All guides
PDF·5 min read·

How to Make a Scanned PDF Searchable (OCR)

A scanned PDF is just images of pages — Find/Cmd+F doesn't work, you can't copy-paste text. OCR adds an invisible text layer that makes the document searchable. Here's how.

Scanned PDFs look like normal PDFs but they're really just sequences of images of pages. The text in them is visible to your eyes but invisible to your computer — Find / Cmd+F returns nothing, copy-paste captures empty space, and search engines can't index the content.

The fix is OCR (Optical Character Recognition): a software pass that reads the images, identifies the characters, and adds an invisible text layer to the PDF. After OCR, the document looks identical but is fully searchable, copyable, and indexable.

How to OCR a PDF in your browser

OCR PDF on Dropvert does this entirely in your browser using Tesseract (the standard open-source OCR engine).

  1. Drop the scanned PDF.
  2. Pick the language. English is the default; Spanish, French, German, Italian, Portuguese, Dutch, Russian, both Chinese variants, Japanese, Korean, and Arabic are also available.
  3. Pick output mode:
    • Searchable PDF: same-looking PDF with an invisible text layer over each page. Find / copy-paste / indexing all work.
    • Plain text: a .txt file with page-marker headers. Useful when you don't need the PDF, just the words.
  4. Click "Run OCR." Each page is rendered to a canvas, OCR'd, and the text is overlaid back into the document.
  5. Download.

A 10-page scan typically takes 60-90 seconds on average hardware. The OCR engine downloads on first use (~5-10 MB per language) and caches in your browser for subsequent runs.

What "searchable PDF" actually means

When you OCR a scanned PDF, the output has two layers per page:

  1. The visible image — the original scan, untouched. This is what you see.
  2. The invisible text layer — the OCR'd words positioned at their detected coordinates on the page, with opacity 0 so they don't render visually.

When you Find or copy-paste, the PDF reader works against the invisible text layer. When you read or print, you see the original image. Both layers stay in the same document, so the searchable PDF "behaves like" a regular born-digital PDF.

This is the same technique Adobe Acrobat, ABBYY FineReader, and every other OCR tool produces. The output works in any PDF reader.

Accuracy depends on the source

OCR accuracy is bounded by the quality of the scan. Roughly:

Source quality Expected accuracy
300 DPI scan, clean paper, common font 99%+
200 DPI scan, slight skew 95-98%
100 DPI scan, off-center, faded paper 85-95%
Phone photo of a document, varied lighting 70-90%
Heavy handwriting, decorative fonts 50-80%

For best results: scan at 300 DPI in black-and-white or grayscale, keep the page flat, avoid skew. Phone photos with good lighting usually work but produce more errors than a flatbed scan.

What if my PDF is already searchable?

Born-digital PDFs (exported from Word, Google Docs, web pages, etc.) already have a proper text layer and don't need OCR. Running OCR on a born-digital PDF would duplicate the text imperfectly — the OCR engine would re-recognize the rendered glyphs, often introducing errors that aren't in the source.

To check whether a PDF is already searchable: open it, try to Find a word you know is on the page. If Find works, no OCR needed.

Multilingual documents

For PDFs in multiple languages, you'll need to pick the dominant language. Tesseract handles single-language documents best; running an English model on a French document drops accuracy substantially.

For mixed-language documents, run OCR in each language separately and combine the results, or accept lower accuracy on whichever language isn't the picked one. Future versions of OCR PDF may add multi-language picking; for now, single-language gives the best output.

Common questions

Will the OCR'd text replace the original images? No. The original page images stay exactly as they were; the OCR'd text is added as an invisible overlay. The PDF still looks like a scan.

Why is my OCR'd PDF a bit larger than the original? The text layer plus an embedded font for the OCR'd text adds a small overhead. Usually 5-10% file-size increase.

Can I OCR handwritten text? Tesseract is designed for printed text. Handwriting recognition is a different problem, requires different models, and isn't supported by OCR PDF v1. For handwritten content, Image to Text sometimes works but accuracy is much lower.

Will pages with mixed text and images still work? Yes. The OCR engine recognizes text regions and ignores image-only regions. The invisible text layer is added only where text was detected.

Are my files uploaded? No. The PDF, the OCR engine, and the output all stay in your browser. The Tesseract language model downloads from a CDN on first use; that's the only network request.

Tools mentioned in this guide

Related guides

We use cookies to understand how you use Dropvert and improve the experience.