Audio Transcription

Transcribe audio or video to text via Whisper — in your browser.

Drop files anywhere or click to browse

MP3, WAV, MP4, MOV — most formats work

Transcribe podcasts, voice memos, meeting recordings, lecture videos, and interviews to text — all in your browser using OpenAI's Whisper model. Output as plain text or as SRT / WebVTT subtitle files with timestamps. The audio never leaves your device, which is a privacy story most cloud transcription services can't match.

How it works

4-step walkthrough

  1. 1

    Drop audio or video

    Most formats work: MP3, WAV, FLAC, AAC, OGG, M4A, Opus, AIFF, MP4, MOV, MKV, WebM. Dropvert extracts the audio with FFmpeg if you give it video, then resamples to the 16 kHz mono format Whisper expects.

  2. 2

    Pick the model

    Tiny (~75 MB): fast inference, decent accuracy on clean audio. Base (~150 MB): roughly 2× slower but noticeably better on noisy or accented speech. Both run in your browser via @huggingface/transformers.

  3. 3

    Pick the output format

    Plain text for one continuous transcript. SRT or WebVTT for video subtitles with chunked timestamps that line up with the source audio.

  4. 4

    Transcribe and download

    First run downloads the chosen Whisper model (one-time, cached). Subsequent runs skip the download. Roughly 30–60 seconds per minute of audio for the Tiny model on average hardware.

Why use Dropvert

Local-first, free, no upload required

  • Browser-side — your audio never leaves your device. The standard for podcast transcription is to upload to Otter / Rev / Descript; this is the privacy-first alternative.
  • OpenAI Whisper is the same model the major paid services use under the hood. Word error rate is 5–15% on clean English depending on the model size.
  • Output as text, SRT, or WebVTT — pair with the Subtitle Converter for further format conversion.
  • Free, no signup, no per-minute pricing.
  • Multilingual — Whisper auto-detects the language; both Tiny and Base models support 99 languages.

Frequently asked questions

7 answered

How accurate is the transcription?
Whisper-Base on clean English speech: 5-10% word error rate, comparable to professional human transcribers. Tiny is more like 10-20%. Accuracy drops on heavy accents, background noise, multiple overlapping speakers, technical jargon, and music. For best results, transcribe clean recordings of one or two speakers.
Why is the first run so slow?
The Whisper model (75 MB or 150 MB depending on choice) downloads from Hugging Face's CDN on first use. After that, it's cached in your browser and subsequent runs skip the download. The transcription itself runs at 30-90 seconds per minute of audio on commodity hardware.
Can I transcribe long audio (like a 1-hour podcast)?
Yes, technically — Whisper processes audio in 30-second chunks with 5-second overlap, so length is unlimited. Practically, expect 30-60 minutes of processing time for an hour of audio with the Tiny model. Browser memory may become an issue past ~2 hours of audio.
How do I get subtitles for a YouTube video?
Download the video first (with yt-dlp or a browser extension), drop it here, pick SRT or WebVTT as the output format. The output file uploads directly to YouTube's captions panel.
Why are there sometimes timing gaps in the SRT output?
Whisper's chunked timestamps approximate when each phrase appears in the audio — they're not frame-perfect. For a 30-minute podcast, expect ±0.5–1 second of drift on each cue. Acceptable for most subtitle use cases; not for music-video tight sync.
Can it identify different speakers?
Not in v1. Whisper transcribes audio but doesn't do speaker diarization (separating "speaker 1" from "speaker 2"). For interviews / podcasts where you need that, you'd need a follow-up pass with a diarization model — not currently in the tool.
Is my audio uploaded?
No. Audio decoding (FFmpeg) and Whisper inference both run entirely in your browser. The model downloads from Hugging Face's CDN on first use; that's the only network request, and it doesn't carry your audio.

Related tools

4 suggestions