Can I copy this style if I only have one camera?

Mostly. Speaker-follows-voice switching needs separate participant video, but the rest of the grammar (title banner, 1-3 word captions, payoff placement, punch-ins to break long holds, dry audio) works on a single recording using alternating crops as synthetic angles.

Published July 4, 2026

Kamal Sankarraj

·9 min read

How Diary of a CEO edits podcast clips (and how to copy it)

Q: Does Diary of a CEO use music in its podcast clips?

No. All three clips we measured run dry conversation audio with no music bed and no sound effects. The retention work is done visually: speaker switches, angle cuts, and captions.

Q: How often does DOAC cut in a podcast clip?

Across the three clips we measured, a hard cut lands every 4.1 to 6.3 seconds on average. Cuts get denser as the payoff approaches and slower on emotional material.

Q: Where should the payoff sit in a podcast clip?

In the measured DOAC clips the payoff lands at the 55%, 62%, and 69% marks of the runtime. The recipe target is 55 to 80 percent: late enough to hold the viewer, early enough to leave room for application.

Podcast clipping is its own editing grammar. It is not a talking-head Short with two people in it. The retention engine is different (speaker switching instead of overlays or b-roll), the hook problem is different (you are choosing a moment from a two-hour conversation, not writing one), and the caption treatment is different (the captions have to tell you who is talking). Diary of a CEO runs the most codified version of this grammar on YouTube right now, so we measured it.

As part of a 13-video study across four creators, we broke down three DOAC vertical clips frame by frame: a two-speaker health dialogue (115.6s), a guest monologue on discipline (138.6s), and an emotional Q&A on grief (65.8s). Every number below comes from those breakdowns. This post covers what the system actually does, then gives you the recipe to copy it.

PandaStudio is not affiliated with or endorsed by Diary of a CEO or Steven Bartlett. This is an independent editing analysis of publicly available clips.

The visual engine is speaker switching, not b-roll

The first thing that stands out in the data: DOAC clips barely use b-roll. The health clip has zero. The grief clip has zero. The discipline monologue has about 9 seconds of it, roughly 6.5% of the runtime, and only where the guest names something concrete (a brain render over "86 billion neurons"). Everything else that keeps the screen moving is the conversation itself.

The rule is simple: the camera follows the voice. In the two-speaker health clip there are 7 speaker switches in 115.6 seconds, and every one lands exactly when the speaking turn changes. The tightest stretch is a rapid Q&A volley with 4 speaker switches in about 3 seconds. That volley is pure retention fuel: the screen state changes four times and the viewer never has to do any work to follow it.

The non-speaker gets on screen too, but only as reaction cutaways: 1 to 1.5 seconds of the listener nodding or reacting while the speaker's audio continues underneath (a J-cut). The discipline clip does this twice in 138 seconds. It is a seasoning, not a pattern.

When one person talks for a long time, the edit fakes the switching instead. The health clip holds on the host for 45 seconds while he reads a study aloud, and breaks that hold with 7 internal angle cuts. The monologue clip alternates between three angles of the same guest every ~6.3 seconds. Add it all up and the measured cut cadence across the three clips is one hard cut every 4.1 to 6.3 seconds, densest right before the payoff and slowest on the grief material (every 5.1s, by design). This is the podcast version of the 5-second state law from our seven laws of retention editing: something visible changes every 3 to 8 seconds, and in this style the mechanism is the speakers.

Hook selection: three ways to open a two-hour conversation

In a scripted Short you write the hook. In a podcast clip you have to find it inside hours of conversation, and DOAC's three clips show three distinct solutions. All three share one constant: speech starts at 0.0 seconds, mid-conversation, with a white title banner in black caps naming the payoff, which auto-exits by about 5.5 seconds.

1. The redacted pull-forward (health clip)

The strongest sentence in the clip lives at the 80-second payoff, so the edit pulls it to 0s and surgically cuts out the operative words. The guest says you can "change the trajectory of your life" by doing ____ every hour, and the blank is a 4-second silent hold with the show wordmark over it. You have to watch to the payoff to fill it in. This is the most aggressive open loop in the study.

2. The banner frontload (discipline clip)

The speech stays chronological, opening on a universally relatable line about regret. The teaser work is done entirely by the banner, which names "The Ulysses Pact" at 0s even though the guest does not say those words until 85 seconds in. The banner is the pulled-forward element; the audio is untouched.

3. The trap question (grief clip)

The clip opens on a wide two-shot of the host asking a rhetorical question every viewer answers in their head before the guest says "Never. Never." at the 4-second mark. No pull-forward, no redaction. The provocative question was already at the front of the source conversation, so the edit just commits to it.

Whichever hook you use, the payoff placement is strict. In the three clips the answer lands at the 69%, 62%, and 55% marks of the runtime. Never early. The hook opens the loop, the middle earns the wait, and the payoff sits in the back half.

Captions that tell you who is talking

The base caption grammar is 1 to 3 words per screen, bold condensed uppercase with a heavy black outline, positioned around 60-70% of frame height (below the chin, above the title banner while it is up). Nothing unusual so far. The podcast-specific move is per-speaker color coding: in the two-speaker clip, every host line is captioned in yellow and every guest line in white. During the fast Q&A volley the color flips with each turn, so the captions alone tell you who is talking even with the sound off.

The second move is register matching. The grief clip throws the loud template away entirely: small lowercase mixed-case text with a soft shadow, no yellow, no emoji, and a single operative word occasionally wrapped in a small colored box. Caps and outlines for health and discipline, a whisper for grief. The caption style is chosen to match the emotional register of the content, not applied as one house style everywhere.

The audio follows the same logic. All three clips are completely dry: no music bed, no sound effects, laughter kept in. And on the grief clip the silences are left long instead of trimmed tight, because the pauses are the content. This is the opposite of what most clipping tools do by default, and it is worth copying deliberately.

The copyable recipe

Here is the whole system as a checklist. It assumes a two-person recording; if you need to capture that first, see our guide to recording a remote podcast with per-participant tracks.

1. Pick a segment whose payoff can sit at 55-80%. Scan the transcript for the strongest single sentence, then cut a clip around it so that sentence lands in the back half. Aim for 60 to 140 seconds total.
2. Enter mid-conversation at 0.0s with a title banner. No intro, no context. White box, black caps, names the payoff, gone by 5.5 seconds. If the payoff sentence is quotable, use the redacted hook: pull it to 0s and mute the operative words.
3. Camera follows the voice. Switch to whoever is speaking, at the turn boundary. Insert the non-speaker only as 1-1.5s wordless reaction cutaways with the speaker's audio continuing.
4. Break any hold longer than ~12 seconds. Alternate punch-ins or crops of the same shot every 5-6 seconds so a monologue reads like multi-cam coverage.
5. Caption in 1-3 word screens, color-coded per speaker. Bold caps with outline at 60-70% height for standard content; drop to small lowercase with soft shadow for emotional material.
6. Keep the audio dry. No music, no SFX, keep the laughter. On serious clips, keep the pauses too.
7. End card for 1.5-2 seconds restating the title. Skip it on emotional clips; the grief clip ends cold on the guest's last word, and that restraint is the right call.

If your source material is solo talking-head rather than conversation, most of this still applies but the retention engine changes; that grammar is covered in how to edit Shorts like Ali Abdaal, where mutating overlays do the job that speaker switching does here.

Running this recipe in PandaStudio

We built this exact recipe into PandaStudio's editing agent. Because PandaStudio records each podcast participant as a separate local track, the speaker-follows-voice switching is exact rather than guessed from a single crop: the agent reads the transcript, finds the speaking turns, and switches the layout at each boundary. Ask it to "make podcast clips like DOAC" and it selects segments with a late payoff, adds the title banner, trims to the mid-conversation entry, sets 1-3 word captions, breaks long holds with punch-ins, and keeps the audio dry.

Honest caveats: a few of the finer DOAC touches are not one-click yet. Per-speaker caption colors and the redacted-hook build (mute the operative words, overlay the wordmark) are on our roadmap and currently take manual steps. The core grammar (clip selection, switching, banner, captions, punch-ins, end card) runs from the single prompt today.

PandaStudio is a desktop editor, free to try, with recording, transcription, and the agent built in. See the overview or grab the download.

FAQ

Does Diary of a CEO use music in its podcast clips?

No. All three clips we measured run dry conversation audio: no music bed, no sound effects, laughter kept in. The retention work is done visually by speaker switches, angle cuts, and captions.

How often does DOAC cut in a podcast clip?

Across the three measured clips, a hard cut lands every 4.1 to 6.3 seconds on average. Cuts get denser approaching the payoff (down to roughly 2.5-3 seconds in the health clip) and slower on emotional material.

Where should the payoff sit in a podcast clip?

In the measured clips the payoff lands at the 55%, 62%, and 69% marks of the runtime. The recipe target is 55-80%: late enough that the open loop holds the viewer, early enough that there is room to apply the idea before the end.

Can I copy this with a single camera?

Mostly. True speaker-follows-voice switching needs separate participant video, but the title banner, 1-3 word captions, payoff placement, dry audio, and synthetic angles (alternating crops every 5-6 seconds) all work on one recording. That combination gets you most of the effect.

Turn an episode into clips in one prompt

Record or import your podcast, then tell the PandaStudio agent to make podcast clips like DOAC. It handles segment selection, speaker switching, the title banner, captions, and export to 9:16.

Download PandaStudio The seven laws behind every recipe