Do I need to upload media?

No. Transcription and TTS can both run locally; keep media and text on device.

How do I sync captions and voiceover?

Keep timestamps from transcription for captions; use segmented WAVs for voiceover and align in the editor.

What export formats should I use?

Master to WAV (44.1/48 kHz), then deliver MP3 192–256 kbps for web.

From Local Captioning to Voiceover: Whisper‑Style Transcription + Kokoro TTS

This tutorial outlines a private pipeline to transcribe content locally, edit and segment the script, and generate clean voiceovers with Kokoro—all without sending media to servers.

1) Transcribe Locally

Run a local transcription tool to produce text and timestamps (e.g., .srt).
Export raw text for editing; keep timestamps for caption use.

2) Edit and Segment

Clean filler words, fix names, and split into sense units of ≤20 seconds. Keep punctuation and stage directions to guide prosody.

3) Generate with Kokoro

Preview each segment, adjust pacing and emphasis, export WAV.
Name files sequentially to simplify assembly (001.wav, 002.wav...).

4) Assemble and Master

Concatenate segments with a DAW or ffmpeg; normalize to −16 LUFS.
Deliver MP3 192–256 kbps for web; keep WAV masters.

Benefits

Privacy by default; no uploads.
Consistent narration quality and repeatable edits.
Faster iteration on phrasing and timing.

5‑Minute YouTube Voiceover
Audio Quality Engineering for TTS
Kokoro TTS in the Browser: 2025 Guide

Author: Kokoro Web Team • Last updated 2025‑01‑15