How to edit YouTube Shorts like Ali Abdaal (we measured his videos frame by frame)
Search "edit like Ali Abdaal" and you get vibes: "keep it snappy," "use nice graphics," "be authentic." Nobody gives you numbers. So we did the boring thing. We ran four of his Shorts through scene detection, transcript timing, and frame-by-frame overlay tracking: 198 seconds of video, 5,940 frames at 30fps, 182,000 combined views. We logged every cut, every caption change, every overlay mutation, and every zoom, down to the timestamp.
The result is not vibes. It is a measurable editing grammar you can reproduce, and it is surprisingly different from what most "edit like Ali" advice says. The biggest surprise: his camera almost never cuts. The retention engine is something else entirely.
This is independent editorial analysis. We are not affiliated with or endorsed by Ali Abdaal.
The numbers, up front
Four Shorts, four slightly different formats, one consistent system underneath.
| Video | Length | Format | Payoff cadence | First speech |
|---|---|---|---|---|
| Ranking AI tools | 77s | Tier-list reaction | 11 verdicts, one every ~7s | 2.5s |
| The Odyssey Plan | 57s | Framework explainer | Overlay change every ~8s | 0.0s |
| My 3 creator rules | 30s | Listicle with cutaways | Visual change every ~3s | 0.0s |
| Best AI tool per use case | 34s | Rapid-fire list | 10 items, one every ~3.2s | 2s |
And the constants that held across all four: speech or payoff begins within 2.5 seconds, something visible changes state every 3 to 8 seconds, captions run 1 to 5 words per screen, filler words are kept in, and every video ends on its payoff frame with no outro. Here is the payoff cadence per format, plotted:
Device 1: The mutating overlay (his actual retention engine)
This is the headline finding. In the 77-second tier-list video, scene detection found essentially zero camera cuts above threshold. It is one static take. What changes instead is a persistent graphic: a white rounded panel anchored to the top ~45% of the frame, with tier rows (S through F) that fill with tool-logo chips as each verdict lands. Eleven mutations in 77 seconds.
The framework video does the same thing with a different skin: section headers in alternating accent colors (orange, teal, lavender) with short paragraph text that builds in as he reaches each point, mutating every ~8 seconds. The rapid-fire list swaps a top-third overlay (app logo, tool name in a per-tool accent color, use-case subtitle) every ~3.2 seconds, and that overlay is the only motion in the entire video.
How to do it: reserve the top ~40% of the 9:16 frame for a graphic that accumulates state: a list that fills, sections that stack, a label that swaps per item. Mutate it every 3 to 8 seconds, timed to the first word of each new item in your transcript. The unfinished graphic is the open loop; viewers stay to watch it complete. The face never gets covered, because the graphic owns the top and the face owns the bottom half.
Device 2: Captions at chest height, 1 to 5 words, one color trick
The caption system is identical across the tier-list and framework videos: a single line of 1 to 5 words, bold white sans, positioned at upper-chest height (roughly 62% down the frame), sitting below the overlay zone and above the desk. The emphasis mechanism is a selective per-word color highlight, yellow or blue, on the operative word: the tool name, the tier verdict.
Two details most imitators get wrong. First, it is ONE emphasis mechanism per video. Across our whole 13-video, 4-creator study, no creator ever mixed color highlights with bold-weight emphasis or boxed words in the same video. Second, in the rapid-fire list format there are no bottom captions at all: the top overlay IS the caption, with spoken words matching the overlay text almost verbatim.
How to do it: burn in captions at 3 to 4 words per line, bold white, positioned around 62% of frame height so they clear both the top graphic and the chin. Pick one highlight color and apply it only to the word that carries the sentence. If your format has a per-item overlay, drop the captions entirely and let the overlay do the work.
Device 3: One zoom, placed at the 75% mark
If you expect Hormozi-style punch-ins on every beat, the data says no. The framework video contains exactly one deliberate punch-in, at ~43 seconds of 57, which is the 75% mark, right where the overlays clear away and the payoff act begins. The framing change itself is a structural signal: overlays gone plus tighter shot equals "this is the part that matters."
The tier-list and rapid-list videos have zero zooms. Between items he uses jump cuts: the pauses between segments are trimmed hard, but trims land between beats, never inside a sentence, and filler words ("um", "uh") are deliberately kept. In one video he says "here's here's how it works" and it stays in the cut. Authenticity is a feature of the style, not a production failure.
How to do it: cut silence between items, keep the fillers, and save your one punch-in (about 1.5x) for the act change at roughly three quarters of the runtime. If you want the every-beat punch-in style instead, that is a different grammar; we measured it in the Hormozi breakdown.
Device 4: Hook anatomy, with timestamps
All four hooks are different on the surface and identical underneath: the payoff starts before the 3-second mark, and there is zero preamble. No "hey guys," no "in this video," in any of the four.
- Tier list: a 2-second word-pop title card ("RATING AI TOOLS" animating word by word over music, no speech), then the first words at 2.5s are already the first verdict: "Voicepal is S tier. I love it."
- Framework: he speaks from frame 1 with a second-person pain question about figuring out what to do with your life, a typographic overlay echoes the spoken line, the promise lands at 3 seconds, and the concept is named at 5. Question, promise, name, in 5 seconds.
- Listicle: a claim-stack: "This is the game of YouTube" at 0:00, escalating to the multi-millionaire claim by 0:03, with pill overlays punctuating each claim beat. Promise ("it's just three skills") by 6 seconds.
- Rapid list: a title card over him eating lunch at his desk, first item at 0:02. No claim, no question; the format itself is the hook, and the casual setting is deliberate contrast with the polished overlays.
How to do it: pick one of three openings: payoff-first verdict, pain question, or claim-stack. Whatever you pick, the first surviving sentence after your trim must be a verdict, claim, question, or promise, and it must start by 2.5 seconds. An optional 2-second title card is fine as long as speech is not wasted under it.
Device 5: End on the payoff frame, not an outro
The tier-list video ends at 76 seconds on the completed tier list, every row filled: the natural payoff frame the whole video was building toward. The listicle ends on the rule-3 graphic, not a talking head. The rapid list runs to a hard stop on item 10 with no outro at all. In the framework video, the payoff act (the final 25% of runtime) is overlay-free and tighter-framed, and it closes on the direct-to-camera insight.
How to do it: cut immediately after the final payoff word. Leave the completed graphic or the closing claim on screen for the last 1.5 to 2 seconds. No "thanks for watching," no fade, no subscribe plea. The completed state is also the loop close: it visually resolves the open loop from the first 5 seconds, which is one of the strongest replay triggers we cover in the Shorts virality playbook.
Common mistakes when copying this style
Cutting the camera instead of mutating a graphic.
The measured face-time in this style is ~95%, with near-zero camera cuts inside beats. If you add fast cuts AND an overlay system, the video reads as frantic, not kinetic. Pick the overlay as your engine and let the camera sit still.
Removing the fillers.
Over-cleaned speech breaks the casual-expert register the style depends on. Trim silences between beats; leave the "um"s inside them.
Covering the face with graphics.
In all four videos the zoning never breaks: graphics own the top ~40%, captions sit at chest height, the face keeps the rest. A center-frame text card over the face is the fastest way to look amateur.
Mixing emphasis mechanisms.
Color highlight OR bold weight OR boxed words. One per video. Mixing them is the single most common tell of a template pack applied without understanding.
Letting 10+ seconds pass with nothing changing.
The measured state-change window is 3 to 8 seconds, median around 5. Scan your edit for any stretch longer than 8 seconds with zero visual change and fill it: an overlay mutation, a graphic swap, a jump cut. This law held across every creator we measured; the full set is in the seven laws of retention editing.
We encoded this grammar into an editing agent
Full disclosure on why we did this study: we build WritePanda, a desktop video editor with a built-in editing agent. The measurements above are encoded as a recipe the agent applies, so telling it "edit this like Ali Abdaal" produces these exact parameters: speech within 2.5 seconds, a top-band graphic mutating every 3 to 8 seconds, 3 to 4 word captions at 62% frame height with a single highlight color, fillers kept, one 1.5x punch-in at the act change, and an ending cut on the payoff frame. You can download it and try that prompt on your own footage, or apply the numbers manually in whatever editor you use. They work either way.
FAQ
What editing style does Ali Abdaal use for Shorts?
Based on our four-video measurement: a single static talking-head take with roughly 95% face time, a persistent top-zone graphic that mutates every 3 to 8 seconds (tier lists, stacking sections, or per-item label swaps), short bold white captions with one highlight color, jump cuts between beats with fillers kept in, and at most one punch-in near the 75% mark. The graphics carry retention, not the camera.
How fast are the cuts in Ali Abdaal Shorts?
Slower than you think, because most are not camera cuts. His fastest measured format (the 30-second listicle) changes visual state every ~3 seconds using graphic cutaways; the tier-list video has essentially no camera cuts in 77 seconds and instead lands a verdict every ~7 seconds. The constant is a visible state change every 3 to 8 seconds, by whatever mechanism.
What captions does this style use?
One line of 1 to 5 words, bold white sans, at upper-chest height (about 62% down the frame), with a yellow or blue color highlight on the single operative word. In rapid-list formats the bottom captions disappear entirely and a top overlay (logo plus name plus use case) replaces them.
Can I automate this style with AI?
Mostly, yes. The trims, captions, overlay timing, music bed, and the single punch-in are all derivable from a transcript with word-level timestamps, which is exactly what agent-driven editing is good at. The part that still needs your judgment is the content itself: the verdicts, the framework, the claim. The edit grammar is reproducible; the opinions are not.