The Hormozi caption style, measured: how to edit punchy business Shorts

"Edit it like Hormozi" is probably the most common style request in short-form video, and most people who say it are picturing the wrong thing: the 2021 clone look with ALL-CAPS caption walls, emoji, meme inserts, and green-screen b-roll. The style Alex Hormozi's team actually ships in 2024 and beyond is far more restrained, and because it is restrained, it is measurable.

So we measured it. As part of a 13-video, 4-creator study of short-form editing (July 2026), we broke down three Hormozi Shorts frame by frame: a rapid-fire listicle, a call-in advice clip, and a live workshop Q&A. We logged every hard cut with scene-detection scoring, sampled caption colors, counted words per caption group, and timed every payoff. This article is the resulting anatomy, plus how to reproduce each device in your own edits.

This is an independent editing analysis. We are not affiliated with or endorsed by Alex Hormozi.

What the style actually is (and is not)

Across all three videos we studied, the modern Hormozi look has zero b-roll, zero memes, zero screenshots, and roughly 100% face time. There is no title card. Speech starts at 0.0 seconds, and the caption is on screen from frame one. The entire retention load is carried by three things:

Idea density. In the listicle we measured, 12 complete aphorisms land in 52.8 seconds, a finished payoff every 4.2 seconds on average. The video's first sentence is its entire promise: "Here's 14 years of business advice in 60 seconds."
Big bold captions with one emphasis mechanism. One to four words per screen (median three), mixed-case white geometric sans, heavy drop shadow, no background box, no emoji. Emphasis is done with extra-bold weight on the operative word, never with color changes or karaoke-style per-word timing.
Semantic cuts plus constant micro-motion. Cuts land on idea boundaries, not on a timer, and between cuts the frame is never fully static: a slow zoom drift keeps pixels moving even during a 15-second listening shot.

If your mental image of this style includes flames, arrows, and a new sound effect every second, that is the old imitation, not the source. The measured style is closer to a well-shot podcast with unusually disciplined typography.

Caption anatomy: the measured spec

This is the part most edits get wrong, so here is the exact specification from the footage:

Property	Measured value
Words per screen	1 to 4, median 3; whole group swaps at once, no karaoke
Swap cadence	A new caption group roughly every 0.8 to 1.0 seconds
Case and color	Mixed case (not all-caps), white #FFFFFF bold geometric sans, drop shadow, no box, no emoji
Emphasis	Extra-bold weight on the operative word only; occasional size bump on a single-word screen
Position	Vertical center to chest height, about 48 to 66% down the frame, floating over the speaker's chest
Speaker coding	Other voices render in italic; in the live Q&A the asker's captions are vivid yellow (#FFEB00 sampled) italic, the host stays upright white

Two details do most of the work. First, the tiny vocabulary: three words per screen forces the viewer's reading pace to match the speaking pace, which is what makes muted viewing feel synchronized rather than laggy. Second, the single emphasis mechanism. Weight only. The moment an edit mixes bold, color highlights, and boxed words in the same video, it stops reading as confident and starts reading as noisy. This "one mechanism" rule held across every creator in our study, not just Hormozi; it is law four of the seven retention laws we extracted from all 13 videos.

The listicle format adds one more typographic device: a numbered chapter pill sitting just above the caption line. Dark charcoal rounded pill, the numeral in muted gold (#D7AD63 sampled), the category word in white all-caps: "1. MARKET", "2. PRIORITIES", and so on through twelve. The pill is the only all-caps element in the frame, and it doubles as a progress bar: the climbing count tells the viewer a finish line exists.

Cut density: fast, but slower than you think

The style has a reputation for machine-gun cutting. The measurement says otherwise. Across the three videos, hard cuts land every 3.5 to 5.6 seconds:

Hard cuts per video, from scene-detection scoring of the three studied Shorts. The frame is still never static between cuts; see below.

Three things make this modest cut rate feel fast:

Cuts are semantic. In the listicle, all 12 chapter boundaries are hard cuts; in the call-in clip, the six real cuts align one to one with the argument beats. A new idea gets a new framing. Rhythmic filler cuts, the kind added every second regardless of content, do not appear.
Punch-ins alternate. Each cut toggles between roughly two crop levels: wide (hands and table visible) and tight (chest-up, face filling about 40% of the frame width). The alternation makes each cut register as escalation.
A slow zoom drift covers the holds. In the call-in video, scene-detection scores show low-level motion on nearly every frame from 17 seconds onward. The boldest stretch is a 15-second single shot of Hormozi silently listening to the caller, held together by nothing but a slow push-in and the caller's escalating stakes. No frame is ever truly static.

Compare that with the Ali Abdaal grammar, where the camera barely cuts at all and a mutating on-screen overlay does the retention work instead. We broke that style down in how to edit Shorts like Ali Abdaal. Same laws, opposite mechanism.

Structure: the question is allowed to be long

The most counterintuitive measurement is how patient the Q&A formats are. In the call-in clip, the caller's question runs 17.3 seconds, 27% of the entire runtime, before Hormozi says a word. In the live workshop clip, the question occupies 17.7 seconds, 31% of the video. That is an eternity by Shorts norms, and it works for two reasons: the question is itself entertaining (it contains the stakes, and in the live clip, the punchline that gets the room laughing), and watching the listener's face react to a problem the viewer might share is its own hook. The payoff, the actual answer, lands in the back half, which matches the open-loop placement we saw across the whole study.

The listicle is the opposite: no setup at all. Speech at 0.0 seconds, promise complete by 2.0 seconds, first payoff at 2.5 seconds, then a complete aphorism every 4.2 seconds until the end. The final chapter doubles as the closer, and the video simply stops on it. No CTA, no outro sentence, no fade. Across all three videos, there is not a single "like and subscribe" beat.

Audio follows the same restraint. The listicle carries a faint music bed; the call-in and live clips have no music at all. Fillers, stammers, and audience laughter are kept in. Trims land between beats, never inside sentences. The authenticity is deliberate: the edit removes dead time, not personality.

When this style works, and when it reads as noise

The style is a bet that your sentences can carry the video. It works when the content is a dense advice monologue or a sharp Q&A: short declarative claims, one idea per beat, quotable lines. The captions and punch-ins amplify writing that is already punchy.

It fails, and fails loudly, in three situations. If the underlying speech is meandering, three-word captions expose every filler clause instead of hiding it. If the content needs to show something (a process, a product, a chart), zero-b-roll is the wrong bet; use an explainer grammar instead. And if you stack emphasis mechanisms, bold plus color plus boxes plus emoji, you have recreated the 2021 clone look, which now signals low-effort content farming to viewers who have seen ten thousand of them.

A useful self-test: mute the video. If the captions alone read like a good Twitter thread, the style fits. If they read like a transcript, fix the writing before the editing.

Reproduce it: settings per device

Any editor with word-level transcript timing can build this. The recipe, device by device:

1. Beat map first. Transcribe, then split the transcript into idea-beats (one claim or answer per beat, usually one to three sentences). Every cut, zoom, and graphic fires at a word timestamp from this map.
2. Hook at 0.0s. Trim everything before the promise or stakes sentence. No title card. Captions from frame one.
3. Semantic cuts, alternating crops. Cut at each beat boundary, alternating a tight and a wide framing (a 1.5x punch-in against the full frame is the right magnitude). Target a cut every 3.5 to 5.6 seconds; if the content gives you longer beats, that is fine, cover them with the next step.
4. Drift on any hold over 8 seconds. Add a slow continuous zoom across the whole beat so no frame is static.
5. Captions: 2 to 3 words, white, mid-frame. Mixed case, bold, drop shadow, no box, positioned around 50 to 62% down the frame so they sit on the chest, never the face. Bold the one operative word per line if your tool supports per-word weight; if not, leave lines un-emphasized rather than faking it with color.
6. Listicle variant: add the pill. A persistent numbered pill above the caption line that increments per item. Aim for a completed item roughly every 4 seconds.
7. End on the payoff. Cut immediately after the final line. No outro.

This grammar is encoded in PandaStudio's editing agent

The measurements in this article are not just a blog post; they are the spec our editing agent runs. Tell WritePanda "edit this like Hormozi" and it applies the bold-caption recipe from this study: hook trimmed to 0.0 seconds, semantic cuts with alternating 1.5x punch-ins on beat boundaries, drift zooms on long holds, mixed-case white captions at two to three words per screen, and no b-roll unless you ask. Some devices are still approximations (per-word bold inside caption groups is a known gap, so the agent substitutes an emphasis overlay on the two or three most load-bearing lines), and we would rather tell you that than pretend.

See how PandaStudio works Download for Mac and Windows

FAQ

Should Hormozi-style captions be all-caps?

No. In all three videos we measured, the caption line is mixed case. The only all-caps element is the category word inside the numbered chapter pill. All-caps caption walls belong to the 2021 imitation style, not the current one.

How fast should the cuts be?

A hard cut every 3.5 to 5.6 seconds, placed on idea boundaries rather than on a timer. The "cut every second" feel comes from caption groups swapping roughly every 0.9 seconds and a continuous slow zoom, not from actual cuts.

Does this style use b-roll or memes?

The three videos we studied contain zero b-roll, zero memes, and zero screenshots. Retention is carried by idea density, captions, and cut cadence. If your content needs visual proof, a different grammar (like the explainer style) is a better fit; see our broader Shorts playbook for how to choose.

Do I need to script the video first?

The listicle format is tightly pre-scripted, but the Q&A formats keep fillers, stammers, and laughter in. What matters is that trims land between beats, never inside sentences, and that each beat contains exactly one idea.