Text to Speech

Type text, pick a voice, hear it spoken — entirely in your browser.

First run is slow.

The SpeechT5 model (~25 MB) downloads from Hugging Face on first synthesis and caches in your browser. Subsequent runs are typically 5–15 seconds for 100 words.

Type text and hear it spoken with one of five natural-sounding voices. SpeechT5 (~25 MB) runs entirely in your browser via @huggingface/transformers — your text never leaves your device, no API key, no signup. Output is a downloadable WAV file ready for use as a voiceover, accessibility narration, or audiobook draft.

How it works

4-step walkthrough

  1. 1

    Type text

    Up to 600 characters per synthesis run. For longer narrations, split the script into chunks and synthesize each separately, then combine with Audio Merge or in your editor.

  2. 2

    Pick a voice

    Five preset voices from the CMU Arctic dataset — two female (US), two male (US), one Canadian male. Each has a distinct timbre suitable for different content (narration, dialogue, podcasts).

  3. 3

    Synthesize

    On first run, the SpeechT5 model (~25 MB) downloads from Hugging Face's public CDN and caches in your browser. Subsequent runs skip the download. A 100-word synthesis takes 5–15 seconds on average hardware.

  4. 4

    Preview and download

    An inline audio player auto-plays the result. Download as WAV (16 kHz mono) for further editing.

Why use Dropvert

Local-first, free, no upload required

  • Browser-side — your text never touches a server, no API key required.
  • Five voice presets with distinct character.
  • No per-character pricing, no monthly limit, no signup.
  • Output is WAV — losslessly editable in any DAW or video editor.
  • Works offline after the model is cached.

Frequently asked questions

5 answered

How does the quality compare to ElevenLabs / Murf / Descript?
Honest answer: not as good. SpeechT5 is a high-quality open-source TTS model but it doesn't match the latest commercial offerings on prosody (intonation, emphasis) or speaker variety. For YouTube narration, accessibility, and rough drafts it's solid. For commercial voiceover work where production quality matters, you'll likely want a paid service.
Why is text limited to 600 characters?
SpeechT5 generates one continuous audio stream per run, and longer inputs degrade quality and risk running out of attention budget mid-sentence. 600 characters is roughly 2 paragraphs or 30–60 seconds of speech — generally enough for a single take. For longer scripts, split into chunks.
Can I use a different voice or clone my own?
The five presets are baked in for v1. Voice cloning (using a sample of your own voice) requires a fundamentally different model class and isn't feasible in-browser yet. Tracked as a future enhancement when models become small enough.
What languages are supported?
SpeechT5 in this configuration is English-only. Multilingual TTS in the browser is possible (e.g., Bark, MMS-TTS) but the models are larger; we'll add language options once model size becomes feasible.
Is my text uploaded?
No. Both the input text and the synthesized audio stay in your browser. The model file is fetched from Hugging Face's public CDN — that's the only network request, and it doesn't carry your text.

Related tools

3 suggestions