All guides
Audio·5 min read·

How to Remove Vocals from a Song (For Karaoke or Remixes)

AI-powered stem separation can pull the vocals out of a finished song, leaving an instrumental backing track. Browser-based, no upload, no signup. Here's how it works and what to expect.

Removing vocals from a song without access to the original studio multitrack used to be impossible. The vocals and instruments were already mixed together in the final stereo file; you couldn't pull them apart. Frequency-based "vocal removal" tools (the kind that subtract the center channel) gave terrible results — muffled audio, missing low end, weird artifacts.

AI stem separation changed this. Modern models like Demucs and Spleeter, trained on thousands of (multitrack, mixed) song pairs, can take a finished stereo song and cleanly separate it back into vocals / drums / bass / other instruments. The quality is genuinely close to the studio multitracks for most modern recordings.

How to remove vocals

Stems Separation on Dropvert runs Demucs entirely in your browser. Drop a song, get back four separate audio tracks: vocals, drums, bass, and "other" (typically guitars, keys, melodic instruments). To get an instrumental, mix the drums + bass + other tracks together; to get just vocals, use the vocals stem directly.

  1. Drop the song. MP3, WAV, FLAC, AAC, OGG, M4A all work. Best results on full mixes (vocals + instruments together).
  2. The first run downloads the Demucs model (~250 MB) from Hugging Face's CDN. Cached after, so subsequent runs skip the download.
  3. Click "Separate stems." On WebGPU-capable browsers (Chrome, Edge, Safari 18+), inference takes 2-4× the song duration. On WASM-only browsers, expect 5-10× longer.
  4. Each stem appears with its own audio player and download button.

Privacy note: the song itself never gets uploaded. Demucs runs in your browser; your audio stays on your device. This matters for the obvious reasons (DMCA, owning your derivative works) and the less-obvious ones (most "stem separation" services upload your audio to their servers and may reuse it for training).

How clean are the separated vocals/instrumental?

For most modern pop, rock, and electronic music: very clean. Vocals are typically 90%+ isolated; the instrumental is similarly clean.

Quality drops with:

  • Older recordings (pre-1980s) — recording techniques were different, separation quality is reduced.
  • Very dense mixes — heavily-layered productions where everything is bleeding into everything.
  • Classical music — Demucs is trained mostly on pop/rock; performance on orchestral works is uneven.
  • Bootleg-quality audio — low-bitrate inputs lose information that the model needs.
  • Backing vocals or vocal effects — the model targets the lead vocal; heavily-effected backing vocals or vocoded textures may end up split between stems.

For typical commercial pop / rock songs, expect results comparable to LALAL.AI and Moises (which use the same Demucs model under the hood, but server-side).

What you can do with the stems

Karaoke

Remove the vocals, keep the instrumental, sing over it. The vocals stem also tells you the original singer's pitch and timing for reference.

Remix / mashup

Pull just the drums from one song and the vocals from another, mix them together in a DAW. This is how a lot of bedroom-producer mashups get made.

Cover versions

Solo or band wants to cover a song. Pull just the original vocals as a reference, then practice your cover against the instrumental backing track.

Studio analysis

Producers can isolate the kick drum, the bass line, the lead synth — useful for learning how a particular track was arranged or for sampling.

Cleaning up live recordings

Sometimes a live recording has dialog or audience noise mixed with the music. Stem separation can isolate the music from the chatter.

Things you can't do (yet)

  • Pull out specific instruments from "other" — Demucs gives you 4 stems. The "other" stem is typically a mix of guitars, keys, and other melodic content that the model couldn't categorize as drums, bass, or vocals.
  • Get the song back to studio-quality multitrack — even 90%+ separation isn't perfect. Listen carefully and you'll hear small bleed between stems, especially on transients.
  • Process audio in real time — the model takes minutes for a 3-minute song. Live performance use cases need a different (specialized real-time) model.

Combine with other tools

Stems Separation outputs WAV files at full quality. After separation, you might want to:

Common questions

How long does it take? On WebGPU-capable browsers: 2-4× the song duration (a 3-minute song = 6-12 minutes). On WASM-only: 5-10× (15-30 minutes). First run adds the model download (~250 MB).

Is this legal? Owning a copy of a song doesn't grant you the right to redistribute derivative works without permission from the rights holders. For personal use (karaoke at home, learning to play along, studying a mix), stem separation is generally fine. For commercial use (selling instrumentals, releasing remixes), you typically need a license.

Can I separate stems on mobile? Technically yes, but slowly — phones have less RAM and slower GPUs than laptops. A 3-minute song on a 2024 phone might take 20-30 minutes. Recommend desktop / laptop for best experience.

Are my songs uploaded? No. The Demucs model runs entirely in your browser. Your songs stay on your device.

What model does this use? Demucs HT (hybrid time-frequency) by Meta AI. Open-source, state-of-the-art for stem separation as of 2024.

Tools mentioned in this guide

Related guides

We use cookies to understand how you use Dropvert and improve the experience.