·
Kamal Sankarraj
·8 min read

We measured 9 top YouTube videos. None of them burn captions.

"Always add captions" might be the most repeated piece of video editing advice on the internet. It shows up in every listicle, every editing course, every "5 mistakes new YouTubers make" video. So we tested it against what top creators actually ship. We ran scene detection and frame classification across 9 recent long-form videos from Ali Abdaal, MKBHD, and Fireship: 963 hard cuts, 502 classified frames.

The result: 0 of 502 frames contained burned-in speech captions. Not "rarely." Zero. Three channels with completely different editing styles, an educator, a product reviewer, and a dev explainer, all made the same call. Meanwhile our separate short-form study found the exact opposite: 13 of 13 Shorts across 4 creators run always-on captions. The advice is not wrong. It is format-specific, and most people repeat it without the format qualifier.

The split, in one chart

0%25%50%75%100%13 / 13 videosShort-form (4 creators)0 / 9 videosLong-form (3 channels)Videos with burned-in speech captions

On the long-form side that 0% held at the frame level too: 220 classified frames from Ali Abdaal, 148 from MKBHD, 134 from Fireship, and not one contained a transcript of the words being spoken. On the short-form side, always-on captions were one of the seven universal laws that held across every video we measured, with 1 to 6 words on screen whenever anyone speaks.

How we measured it (and the honest limits)

Per creator, we took 3 recent high-performing long-form videos. We ran ffmpeg scene detection at a 0.30 threshold to find hard cuts, then sampled frames every 15 to 20 seconds plus extra frames at hook, chapter, and ending points. Each frame was vision-classified into layout categories: talking head, b-roll, graphic with picture-in-picture, full-frame card, screen recording, and whether any on-screen text was a speech caption or something else. That produced 963 hard cuts and 502 classified frames across the 9 videos.

The limits, stated plainly: n is 9 videos from 3 channels. All are recent uploads, so this describes current style, not history. All three channels are at the top of their genres, which is the point of the study but also means the finding may not transfer to every niche. And frame sampling at 15 to 20 second intervals could theoretically miss a caption that flashes for under a second, though a speech caption that brief would not be functioning as a caption.

This is observational data. It tells you what top editors ship, not what would happen to your retention if you changed one variable. We think "what the best editors actually do" is more useful than "what everyone repeats," but keep the distinction in mind.

What they put on screen instead

None of these channels leave the screen bare. All three use on-screen text heavily. The difference is that the text is selective and editorial rather than a rolling transcript. Each channel has its own vocabulary.

Ali Abdaal: keyword pops and quote cards

Across 220 frames, zero speech captions. Instead: 1 to 4 word serif keyword pops near the lower third when a term matters, full-frame kinetic quote cards (cream background, word-by-word reveal, 1 or 2 words accent-colored) used 2 to 4 times per video for the tweetable line, numbered lists, flat diagrams, and a persistent chapter tag like "Habit #N" pinned top-left in his 23-chapter video. The text marks what matters; it never transcribes.

MKBHD: credit tags, one spec card, and a scorecard

Across 148 frames, zero speech captions. His persistent text is small black diagonal source-credit tags on every borrowed clip. Then a single white spec card with at most 4 bullets over hero b-roll about 20 seconds in, never repeated. Product labels sit under comparison shots, a checkbox scorecard appears at the "here's the catch" pivot, and full-frame custom bar charts carry the data comparisons. We broke his full system down in the MKBHD editing study.

Fireship: sticker stamps and live labels

Across 134 frames, no spoken-word captions anywhere, which is striking because Fireship never appears on camera. His keyword stamps are bold words on colored angled sticker rectangles rotated 3 to 10 degrees, things like PAGE FAULT or SELF-HOSTED. Terminals are real and live-typed, with a floating label beside the prompt for the flag under discussion, and paper scans get translucent highlighter plus hand-drawn red arrows. Full breakdown in the Fireship editing study.

Why the advice is right for Shorts and wrong for long-form

The caption advice comes from short-form, where it is genuinely universal. In our 13-video short-form study, always-on captions with a 1 to 6 word vocabulary held across every creator, and each style commits to exactly one emphasis mechanism (color highlight, bold weight, or a boxed word, never mixed). That environment explains the rule: feeds autoplay muted, the viewer has not chosen your video, and you have under 2.5 seconds before the scroll decision. Captions are how a muted Short communicates at all. If you edit short-form, the rule stands, and we walk through the execution in how to edit Shorts like Ali Abdaal.

Long-form is a different contract. The viewer clicked a title, sound is usually on, and the watch session is measured in minutes, not seconds. Interestingly, Ali Abdaal sits on both sides of this line: his Shorts run caption-style overlays, his long-form runs zero. Same creator, same team, opposite call by format.

There is also a real cost to burned captions in long-form that short-form does not pay. These editors are managing a visual rhythm where something changes every few seconds (median shot length across all 9 videos was 3 to 6.5 seconds, with 7 to 13 visual changes per minute). Selective text is one of their strongest beats: a keyword pop or a spec card lands because the screen was not already full of words. A permanent caption bar spends that attention channel on redundancy. It is the same logic behind the deliberate 30 to 60 second static holds all three channels place at verdicts and payoffs, which we cover in the 7 laws of retention editing: contrast only works if the baseline is not maxed out.

The accessibility point, done properly

"But captions are for accessibility" is the strongest objection, and it deserves a straight answer. On long-form YouTube, the accessible option is the closed caption track, not burned text. A CC track can be toggled, resized, restyled, repositioned, and auto-translated by the viewer. Burned captions can do none of that; they impose one size and one language on everyone, including screen-magnifier users they may actively fight with.

So the pro move is the one the platform supports: upload a corrected caption file, or at minimum review and fix the auto-generated track. Every video in our sample had usable auto-subs (we relied on them to analyze hook anatomy). The top channels are not skipping accessibility. They are just not confusing it with an on-screen design decision.

What to put on screen instead, by genre

Your genreText vocabulary to steal
Educator / adviceKeyword pops (1 to 4 words) on key terms, one kinetic quote card for the line you want quoted, a persistent chapter tag if the video has many sections, numbered lists for frameworks.
Review / comparisonOne spec card early (4 bullets max, never repeated), labels under comparison shots, a scorecard at the verdict pivot, bar charts for any numeric claim, credit tags on borrowed footage.
Tech / tutorialAngled sticker stamps on jargon the moment you say it, floating labels beside the exact flag or button in a screen recording, highlighter and arrows on any document you show.
Any long-formA corrected CC track for accessibility, and text that appears only when it adds something the audio did not already say.

The common thread: in every case the text answers "what should the viewer remember or look at right now," not "what is being said right now." That is the actual rule the 0-of-502 number encodes.

Where PandaStudio fits (and where it does not)

PandaStudio has a one-click burned-caption system, and after this study our advice about it is more specific: use it for Shorts, TikTok, and Reels, where the data says captions are mandatory, and leave it off for long-form YouTube. For long-form, the parts of the app that map to what these channels actually do are the transcript-based editing (cut the ums and dead air that actually hurt retention), zooms and annotations for selective emphasis, and lower thirds for labels. We do not yet ship a kinetic quote card or angled sticker stamp as a preset; this study is part of why those are on the template roadmap.

FAQ

Should I burn captions into my YouTube videos?

For Shorts, TikTok, and Reels: yes, all 13 short-form videos we measured use always-on captions. For long-form YouTube: the top channels we measured do not, at all. Use the platform CC track for accessibility and selective text for emphasis.

Do burned-in captions hurt long-form retention?

We cannot claim that. This was an observational study, not an A/B test. What we can say is that 0 of 502 frames from 9 high-performing videos contained speech captions, so whatever these channels' retention rests on, it is not a rolling transcript.

What about deaf and hard-of-hearing viewers?

Serve them with a corrected closed-caption track, which viewers can toggle, resize, and restyle. That is strictly more accessible than burned text, which nobody can adjust.

Does this apply to every long-form channel?

No. It is 9 videos from 3 channels, all recent uploads, deliberately sampled from the top of three genres. Some long-form formats, like street interviews or repurposed podcast clips, do burn captions. Treat this as strong evidence against captions-by-default, not a ban.

Ali Abdaal, MKBHD (Marques Brownlee), and Fireship are not affiliated with PandaStudio and have not endorsed this article. All measurements were taken from their publicly available YouTube videos for analysis and commentary.