Can I use Kokoro offline?

Yes. After the initial model download, synthesis runs fully on device without uploading text or audio.

How do I improve voice quality?

Segment scripts into short sentences, add commas and brief pauses, choose an appropriate voice, and normalize loudness on export.

Why is the first load slow?

The model (~150MB) is downloaded and compiled. Subsequent sessions are much faster due to caching.

Kokoro TTS in the Browser: A 2025 Practical Guide to Private, On‑Device Voiceovers

Kokoro is a compact, English TTS model designed to run directly in your browser. With WebGPU and WebAssembly fallbacks, you can generate natural voiceovers without sending text or audio to any server. This guide covers setup, performance tuning, voice selection, SSML tactics, long‑form stability, and production workflows.

1) Environment and Setup

For best results, use an up‑to‑date Chromium‑based browser. On desktops with recent GPUs, WebGPU delivers the lowest latency. When WebGPU isn’t available, Kokoro falls back to a high‑performance WebAssembly build, including threads and SIMD where supported.

Recommended browsers: Chrome, Edge, Arc (2025 builds), Firefox (WebAssembly path).
Storage: The first model load (~150MB) is cached locally via the browser cache.
Network: After the first load, inference runs fully offline.

Tip: If your first load is slow, keep the tab active until the progress bar completes. Subsequent sessions are dramatically faster thanks to cache hits.

2) Voice Selection and Prompting

Kokoro ships with several clear, neutral English voices. The fastest improvement to perceived quality is matching the voice to the script’s intent and pacing it properly.

Choose a voice that matches your brand tone (friendly, formal, energetic).
Use short sentences and explicit punctuation to guide prosody.
Add stage directions in brackets, e.g., [warmly], [pause], [smile].

Lightweight prompting patterns

// Example: guiding emphasis and breaks
"Welcome to Kokoro Web! [pause 400ms] This guide shows you how to create private, on‑device voiceovers. [smile] Let’s begin."

3) SSML‑style Controls

Even without full SSML parsing, simple markup conventions help consistency: add explicit commas and periods, use parenthetical breaks, and keep proper nouns capitalized. For glossary terms or product names, consider phonetic hints.

// Phonetic hint example (pseudo‑notation for your notes)
"Schedule" → "SKED‑jool" (US) or "SHEH‑dyool" (UK)

4) Performance Tuning (WebGPU vs. WebAssembly)

On GPUs supporting WebGPU, medium‑length lines synthesize in near‑realtime. On CPUs, WebAssembly with threads/SIMD remains practical for short‑form narration and batching.

Prefer WebGPU for live previewing and iterative script work.
Use WebAssembly for broad compatibility (locked‑down devices or VMs).
Batch multiple lines into segments of 8–20 seconds to amortize setup overhead.

Works offline: After the initial download, you can disconnect from the network and continue generating audio.

5) Long‑Form Stability

Generating 3–8 minute narrations reliably requires consistent segmentation and a predictable export path.

Segment your script by sense units (clauses) rather than fixed seconds.
Keep per‑segment duration under ~20 seconds to reduce memory spikes.
Export per segment then assemble in your DAW or editor (Premiere, CapCut, Audacity, ffmpeg).

# Example: joining segments with ffmpeg
ffmpeg -i part1.wav -i part2.wav -i part3.wav \
  -filter_complex "[0:a][1:a][2:a]concat=n=3:v=0:a=1[out]" \
  -map "[out]" narration.wav

6) Audio Quality and Loudness

For polished results, normalize loudness to −16 LUFS for stereo (or −19 LUFS mono). Apply a gentle high‑shelf EQ for air and a soft limiter to catch peaks.

Sample rates: 44.1 kHz or 48 kHz WAV for master export.
Post‑processing: light de‑essing and 1–2 dB limiting.
Delivery: export MP3 192–256 kbps for web; WAV for editing.

7) Private Voiceover Pipeline

Many teams pair Kokoro with local transcription or translation without using servers. A popular stack is Whisper‑style transcription locally, script editing in a browser doc, then TTS with Kokoro—all offline.

Draft: outline → write → mark pauses and emphasis.
Generate: preview per segment → adjust speed/voice → export.
Assemble: line up with visuals → add music/SFX → normalize.

8) Common Pitfalls and Fixes

Runs slowly? Switch to a WebGPU‑capable browser or reduce segment length.
Choppy rhythm? Add commas and brief parenthetical pauses like (…).
Mispronounced names? Add phonetic hints in your working script and adjust spelling.
Export crackles? Ensure matching sample rate across your editor and exports.

9) When to Choose WebGPU vs. WebAssembly

Use WebGPU to iterate fast on modern hardware, or rely on WebAssembly to reach locked‑down environments. Kokoro abstracts both paths so your workflow remains the same.

10) Checklist

Script segmented into sense units under ~20s each
Voice matches tone and audience; pacing verified
Punctuation and optional stage directions added
Exported WAV at 44.1/48 kHz, normalized and limited
Saved project files for later updates

Author: Kokoro Web Team • Last updated 2025‑01‑15