This Free Tool Will Help You Find and Copy (Almost) Any PDF File.
There’s nothing worse than opening a PDF and realizing you can’t use the search function or even highlight text. This usually happens when the PDF was created by scanning a paper document – it’s just a series of images. Most modern scanning software uses optical character recognition (OCR) so words are both searchable and selectable, but sometimes you come across documents where this is not the case.
In such cases, the free and open source OCRmyPDF is ideal. It’s a command line application that quickly converts any PDF file into an OCR PDF/A file, which means you can search the text. Moreover, it’s completely free.
The best way to install the app is using a package manager on Linux devices and Homebrew on Mac . Windows users can technically install the app by installing Python and a few other dependencies—check that out if you want to do a little digging.
Once you’ve set up the app, you can use it by typing ocrmypdf
followed by the name of the document you want to add OCR to, then the name of the document you want to create. So for example, ocrmypdf before.pdf after.pdf
would take “before.pdf”, add OCR, and then create a new document called “after.pdf”.
This process will take some time, depending on the size of the document, and may not be entirely accurate if the image quality is poor. However, even having said all that, I found that it worked quite well even with the most ancient and poorly compressed PDFs I could find.
And there’s more you can do here: in fact, the OCRmyPDF documentation cookbook describes a bunch of things you can do. You can compress images in PDF, for example by adding --pdfa-image-compression jpeg
to your comment. You can automatically re-rotate any pages with side text by adding --rotate-pages
to the command. Or perhaps the PDF you’re processing already contains OCR, which you think is low quality – you can add --redo-ocr
to the command; this will delete the existing OCR information and start over.
You get the idea: there’s a lot here. Check out the documentation for more information because this thing can do more.