Question 1

Why does the first run take 10+ minutes?

Accepted Answer

Two compounding reasons: (1) the Demucs model is ~250 MB and downloads on first use; (2) the inference itself is heavy — even with WebGPU, expect 2–4× the song duration. A 3-minute song typically takes 6–12 minutes total on first run, 4–8 minutes after the model is cached.

Question 2

How clean are the stems?

Accepted Answer

For most modern pop, rock, and electronic music: very clean. Vocals are typically 90%+ isolated; drums and bass similarly. Older recordings, very dense mixes, classical music, and bootleg-quality audio produce more bleed between stems. The quality is comparable to LALAL.AI and Moises which use the same Demucs model under the hood.

Question 3

Can I get just vocals (or just instrumental)?

Accepted Answer

You always get four stems. To get an instrumental, mix the drums + bass + other tracks together in any DAW. To get just vocals, use the vocals stem directly. We could add a "vocals only / instrumental only" preset that does this combination automatically — let us know if there's demand.

Question 4

What's the maximum song length?

Accepted Answer

Practically, 5–6 minutes is comfortable. Beyond that, browser memory becomes the bottleneck — the model needs to hold both the input and a 4× output in memory. Splitting longer songs in half and processing each separately is a workaround.

Question 5

Will WebGPU work on my browser?

Accepted Answer

Chrome 113+, Edge 113+, Safari 18+, and Firefox Nightly with the flag enabled. Without WebGPU, the tool falls back to WASM execution — works but 5–10× slower.

Question 6

Is my song uploaded?

Accepted Answer

No. The audio decoding (FFmpeg) and model inference (ONNX Runtime) both run entirely in your browser. The Demucs model file is fetched from Hugging Face's public CDN — that's the only network request, and it doesn't carry your audio.

Stems Separation

How it works

Why use Dropvert

Frequently asked questions

Related tools