Question 1

How accurate is the transcription?

Accepted Answer

Whisper-Base on clean English speech: 5-10% word error rate, comparable to professional human transcribers. Tiny is more like 10-20%. Accuracy drops on heavy accents, background noise, multiple overlapping speakers, technical jargon, and music. For best results, transcribe clean recordings of one or two speakers.

Question 2

Why is the first run so slow?

Accepted Answer

The Whisper model (75 MB or 150 MB depending on choice) downloads from Hugging Face's CDN on first use. After that, it's cached in your browser and subsequent runs skip the download. The transcription itself runs at 30-90 seconds per minute of audio on commodity hardware.

Question 3

Can I transcribe long audio (like a 1-hour podcast)?

Accepted Answer

Yes, technically — Whisper processes audio in 30-second chunks with 5-second overlap, so length is unlimited. Practically, expect 30-60 minutes of processing time for an hour of audio with the Tiny model. Browser memory may become an issue past ~2 hours of audio.

Question 4

How do I get subtitles for a YouTube video?

Accepted Answer

Download the video first (with yt-dlp or a browser extension), drop it here, pick SRT or WebVTT as the output format. The output file uploads directly to YouTube's captions panel.

Question 5

Why are there sometimes timing gaps in the SRT output?

Accepted Answer

Whisper's chunked timestamps approximate when each phrase appears in the audio — they're not frame-perfect. For a 30-minute podcast, expect ±0.5–1 second of drift on each cue. Acceptable for most subtitle use cases; not for music-video tight sync.

Question 6

Can it identify different speakers?

Accepted Answer

Not in v1. Whisper transcribes audio but doesn't do speaker diarization (separating "speaker 1" from "speaker 2"). For interviews / podcasts where you need that, you'd need a follow-up pass with a diarization model — not currently in the tool.

Question 7

Is my audio uploaded?

Accepted Answer

No. Audio decoding (FFmpeg) and Whisper inference both run entirely in your browser. The model downloads from Hugging Face's CDN on first use; that's the only network request, and it doesn't carry your audio.

Audio Transcription

How it works

Why use Dropvert

Frequently asked questions

Related tools