1) Environment and Setup
For best results, use an up‑to‑date Chromium‑based browser. On desktops with recent GPUs, WebGPU delivers the lowest latency. When WebGPU isn’t available, Kokoro falls back to a high‑performance WebAssembly build, including threads and SIMD where supported.
- Recommended browsers: Chrome, Edge, Arc (2025 builds), Firefox (WebAssembly path).
- Storage: The first model load (~150MB) is cached locally via the browser cache.
- Network: After the first load, inference runs fully offline.
2) Voice Selection and Prompting
Kokoro ships with several clear, neutral English voices. The fastest improvement to perceived quality is matching the voice to the script’s intent and pacing it properly.
- Choose a voice that matches your brand tone (friendly, formal, energetic).
- Use short sentences and explicit punctuation to guide prosody.
- Add stage directions in brackets, e.g.,
[warmly],[pause],[smile].
Lightweight prompting patterns
// Example: guiding emphasis and breaks
"Welcome to Kokoro Web! [pause 400ms] This guide shows you how to create private, on‑device voiceovers. [smile] Let’s begin."
3) SSML‑style Controls
Even without full SSML parsing, simple markup conventions help consistency: add explicit commas and periods, use parenthetical breaks, and keep proper nouns capitalized. For glossary terms or product names, consider phonetic hints.
// Phonetic hint example (pseudo‑notation for your notes)
"Schedule" → "SKED‑jool" (US) or "SHEH‑dyool" (UK)
4) Performance Tuning (WebGPU vs. WebAssembly)
On GPUs supporting WebGPU, medium‑length lines synthesize in near‑realtime. On CPUs, WebAssembly with threads/SIMD remains practical for short‑form narration and batching.
- Prefer WebGPU for live previewing and iterative script work.
- Use WebAssembly for broad compatibility (locked‑down devices or VMs).
- Batch multiple lines into segments of 8–20 seconds to amortize setup overhead.
5) Long‑Form Stability
Generating 3–8 minute narrations reliably requires consistent segmentation and a predictable export path.
- Segment your script by sense units (clauses) rather than fixed seconds.
- Keep per‑segment duration under ~20 seconds to reduce memory spikes.
- Export per segment then assemble in your DAW or editor (Premiere, CapCut, Audacity, ffmpeg).
# Example: joining segments with ffmpeg
ffmpeg -i part1.wav -i part2.wav -i part3.wav \
-filter_complex "[0:a][1:a][2:a]concat=n=3:v=0:a=1[out]" \
-map "[out]" narration.wav
6) Audio Quality and Loudness
For polished results, normalize loudness to −16 LUFS for stereo (or −19 LUFS mono). Apply a gentle high‑shelf EQ for air and a soft limiter to catch peaks.
- Sample rates: 44.1 kHz or 48 kHz WAV for master export.
- Post‑processing: light de‑essing and 1–2 dB limiting.
- Delivery: export MP3 192–256 kbps for web; WAV for editing.
7) Private Voiceover Pipeline
Many teams pair Kokoro with local transcription or translation without using servers. A popular stack is Whisper‑style transcription locally, script editing in a browser doc, then TTS with Kokoro—all offline.
- Draft: outline → write → mark pauses and emphasis.
- Generate: preview per segment → adjust speed/voice → export.
- Assemble: line up with visuals → add music/SFX → normalize.
8) Common Pitfalls and Fixes
- Runs slowly? Switch to a WebGPU‑capable browser or reduce segment length.
- Choppy rhythm? Add commas and brief parenthetical pauses like
(…). - Mispronounced names? Add phonetic hints in your working script and adjust spelling.
- Export crackles? Ensure matching sample rate across your editor and exports.
9) When to Choose WebGPU vs. WebAssembly
Use WebGPU to iterate fast on modern hardware, or rely on WebAssembly to reach locked‑down environments. Kokoro abstracts both paths so your workflow remains the same.
10) Checklist
- Script segmented into sense units under ~20s each
- Voice matches tone and audience; pacing verified
- Punctuation and optional stage directions added
- Exported WAV at 44.1/48 kHz, normalized and limited
- Saved project files for later updates
Related Articles
Author: Kokoro Web Team • Last updated 2025‑01‑15