In-browser Kokoro text-to-speech

Launch Kokoro TTS
Step‑by‑Step Last updated: 2025‑01‑15

How to Make a 5‑Minute YouTube Voiceover with Kokoro TTS (Step‑by‑Step)

This tutorial shows creators how to produce a clean 5‑minute narration entirely in the browser. You will write a concise script, segment it into natural beats, generate audio with Kokoro, polish it with minimal mastering, and sync it to video.

1) Plan the Story (2–3 minutes)

Define a single promise for your video. Example: “Viewers learn to organize files in macOS without any third‑party apps.” Draft an outline with 4–6 chapters and estimate timing for each chapter.

2) Write and Segment the Script (5–8 minutes)

Write in short sentences. Each segment should be a sense unit: one idea, 1–3 sentences, ≤20 seconds. Add punctuation and stage directions to guide rhythm and emphasis.

[friendly] Welcome back to the channel! [pause 300ms]
Today, I’ll show you how to tidy your Mac desktop. [smile]
First, we’ll create a simple folder structure—no extra apps needed.

3) Pick the Voice and Pace (2 minutes)

Choose a Kokoro voice that matches tone: friendly explainer, formal tutorial, or energetic promo. Preview two or three segments and tweak punctuation and speed for clarity.

4) Generate and Export WAV (5–10 minutes)

  1. Open Kokoro Web and paste your first segment.
  2. Preview, adjust punctuation, and re‑preview until natural.
  3. Export WAV and name it sequentially: 001-intro.wav, 002-step1.wav, …
  4. Repeat for all segments.

5) Mastering: Loudness and Clarity (3 minutes)

Target −16 LUFS for stereo exports with a light limiter. Optional: a 1–2 dB high‑shelf above 6 kHz for air and very light de‑essing for “s” sounds.

6) Join Segments and Sync to Video (5 minutes)

# Join all segments in order with ffmpeg
ffmpeg -i 001-intro.wav -i 002-step1.wav -i 003-step2.wav -i 004-summary.wav \
  -filter_complex "[0:a][1:a][2:a][3:a]concat=n=4:v=0:a=1[out]" \
  -map "[out]" voiceover.wav

In Premiere/CapCut, drag voiceover.wav under your video. Align chapter cuts first, then add B‑roll. Keep background music (if any) around −26 to −28 LUFS to preserve speech intelligibility.

7) Quality Checklist

  • Each segment expresses one idea; no run‑on sentences.
  • Voice fits the brand; pace is steady and clear.
  • Final loudness near −16 LUFS; no clipping.
  • Export formats: WAV master; MP3 192–256 kbps for delivery.
  • All assets named and backed up for future updates.

FAQ

Can I work offline? Yes. After the initial model download, Kokoro runs locally.

What if I hear choppy pacing? Add commas and short parenthetical pauses like (pause 200ms).

How do I fix a tricky word? Try phonetic hints in your draft and adjust spelling for clarity.


Author: Kokoro Web Team • Last updated 2025‑01‑15