Question 1

How does the quality compare to ElevenLabs / Murf / Descript?

Accepted Answer

Honest answer: not as good. SpeechT5 is a high-quality open-source TTS model but it doesn't match the latest commercial offerings on prosody (intonation, emphasis) or speaker variety. For YouTube narration, accessibility, and rough drafts it's solid. For commercial voiceover work where production quality matters, you'll likely want a paid service.

Question 2

Why is text limited to 600 characters?

Accepted Answer

SpeechT5 generates one continuous audio stream per run, and longer inputs degrade quality and risk running out of attention budget mid-sentence. 600 characters is roughly 2 paragraphs or 30–60 seconds of speech — generally enough for a single take. For longer scripts, split into chunks.

Question 3

Can I use a different voice or clone my own?

Accepted Answer

The five presets are baked in for v1. Voice cloning (using a sample of your own voice) requires a fundamentally different model class and isn't feasible in-browser yet. Tracked as a future enhancement when models become small enough.

Question 4

What languages are supported?

Accepted Answer

SpeechT5 in this configuration is English-only. Multilingual TTS in the browser is possible (e.g., Bark, MMS-TTS) but the models are larger; we'll add language options once model size becomes feasible.

Question 5

Is my text uploaded?

Accepted Answer

No. Both the input text and the synthesized audio stay in your browser. The model file is fetched from Hugging Face's public CDN — that's the only network request, and it doesn't carry your text.

Text to Speech

How it works

Why use Dropvert

Frequently asked questions

Related tools