When should I choose WebGPU?

Choose WebGPU when you want the lowest latency for interactive preview, repeated script tuning, or high-volume local generation.

When should I choose WASM?

Choose WASM when compatibility matters more than speed, including managed devices, browser restrictions, or systems where GPU support is unreliable.

What makes a workflow page useful?

Useful workflow pages explain the goal, the setup, the tradeoffs, the quality checks, and the failure cases so the reader can actually ship the result.

Kokoro Web Workflow Playbook: Practical TTS Paths for Creators, Teams, and Educators

This page is the practical version of the site: the part that helps you move from “I have text” to “I have a finished voice track” without guessing. It explains how to pick the right browser path, how to prepare scripts, how to export safely, and how to check quality before you publish.

1. Decide What the Page Should Do

Most weak TTS pages try to do everything. A better workflow starts by naming the job. Are you making a 30-second product teaser, a 5-minute YouTube narration, a lesson module, or an accessibility layer for a web app? Each one needs a different script shape, pacing, and export target.

That choice affects how you split paragraphs, whether you keep the audio in a timeline editor, and whether you need a fast preview path or a stable batch path. The goal is not just to generate speech. The goal is to generate speech that fits the context without a second pass of cleanup.

2. Choose the Browser Path

Kokoro Web supports two practical execution modes: WebGPU for speed and WASM for fallback compatibility. The right choice depends on the device in front of you.

Scenario	Best Path	Why
Repeated editing on a modern desktop	WebGPU	Fast preview turns small text changes into immediate feedback.
Managed laptop or locked-down browser	WASM	Better compatibility when GPU support is limited or policy-restricted.
Long batch export	Hybrid	Preview with WebGPU, then use the most stable path for the full run.

3. Prepare Scripts for TTS Instead of for Reading

A script that reads well on screen is not always a script that sounds good. TTS benefits from punctuation that marks real pauses, shorter sentences, and explicit hints when a term is likely to be misread.

Keep one sentence focused on one idea.
Use commas to mark breathing points and periods to end a thought.
Break long paragraphs into smaller semantic units before generating audio.
Spell out acronyms only when the spoken form should differ from the written form.

For example, a paragraph about “browser support” should not be a single wall of text. Split it into a setup statement, a browser note, and a fallback note. That gives you a cleaner edit and a better chance of catching awkward pronunciation before you export the final track.

4. Use a Repeatable Production Pipeline

When a team generates audio repeatedly, the workflow matters as much as the voice model. The pipeline below is intentionally simple because simple pipelines are easier to audit, hand off, and repeat.

Write the source script in your editor of choice.
Split it into segments that are short enough to preview quickly.
Run the first pass in the browser and listen before you export anything.
Adjust punctuation, names, and emphasis cues after the first pass.
Export WAV for editing and archiving, then render the delivery format you need.
Store the script, exports, and notes together so a future revision can be reproduced.

This is the difference between a tool demo and a production workflow. A production workflow leaves behind evidence: the script version, the export format, the browser path used, and the notes that explain why a given voice or speed was selected.

5. Keep Quality Checks Small and Concrete

Low-value content often happens when the site uses generic praise instead of checks you can actually apply. For TTS, the best quality checks are specific and measurable:

Does the first sentence set the context without sounding rushed?
Do names and acronyms sound correct on the first try?
Is the loudness suitable for the destination platform?
Are the pauses natural enough that the listener can follow along without strain?
Does the exported file match the project timeline and frame rate?

If the answer is no, fix the script before adding heavier post-processing. In most cases, a cleaner script gives a better result than a heavier mastering chain.

6. A Simple Folder Structure

Teams often lose time because nobody knows where the source script or last good export lives. A lightweight folder convention helps a lot:

project/
  script/
    narration-v1.md
    narration-v2.md
  audio/
    wav/
      intro.wav
      chapter-01.wav
    delivery/
      narration-final.mp3
  notes/
    pronunciation.md
    export-settings.md

That structure is boring, and that is the point. Boring structures are easy to search, easy to reproduce, and easy to explain to someone who joins the project later.

7. Troubleshooting

Problem	Likely Cause	Fix
Speech sounds flat	Sentences are too long or punctuation is missing.	Split the text and add natural pauses.
Name is mispronounced	The written form is ambiguous.	Add a phonetic hint or rewrite the term.
Long export fails or lags	Segments are too long or the session ran too long.	Use shorter segments and save intermediate WAV files.

8. Example Workflows

YouTube narration

Write the outline first, then turn each section into a voice segment. Preview the intro and call to action before you generate the rest of the video. This avoids wasting time on a full render when the hook still needs work.

Accessibility narration

Use shorter phrases and avoid overly decorative punctuation. The goal is clarity for a listener who may already be juggling the main interface and the voice output at the same time.

Internal training content

Prefer consistent terminology and stable exports. Training content is useful when people can re-listen later and hear the same terms spoken the same way.

9. What This Site Tries Not to Do

We do not try to publish one thin page for every keyword variation. We do not hide the limitations of browser TTS behind generic marketing copy. We also do not call a page “complete” unless it explains enough for a real user to finish the task.

That editorial choice matters for AdSense as well. A site with a few real, useful, and inspectable pages is much easier to trust than a site with many pages that only differ by title.

Author: Kokoro Web Team - Last updated 2026-07-08

Kokoro Web Workflow Playbook