How Do I Create a Searchable PDF Archive?
In this week’s tech tips column on Lifehacker – keep asking questions guys! – we help the reader who has too many important articles who need to make a magical digital transition. At least it sounds a lot more fun than Optical Character Recognition, which doesn’t really roll off the tongue.
Lifehacker reader Phil writes:
“David, your columns are informative, helpful and well written. Thank you.
I copy (scanned) the minutes of a nonprofit meeting. The organization meets twice a month. Each minutes representing a meeting usually consists of one or two typewritten pages, is a file. I have about 20 minutes to do. I will merge all of the collection files for one year into one “year” file, and then merge 10 of those files into another, separate “decade” file. I know there is too much information.
Anyway, I would like these files to be searchable.
Could you recommend a free or inexpensive decent desktop OCR program, or any other method that would make minutes searchable? I’d really like to keep the formatting of the original protocols. There are never photographs in the protocols. I am using Windows 10. “
Thanks for the kind words, Phil! I’m happy to help – not out of flattery, but because your question is one that many readers (myself included) have probably thought about. I have a whole set of things that I would like to transfer from the physical world to the digital, so that I can consign Marie Kondo to oblivion of the original documents and photographs. The sheaves of papers do not please me.
You have several options that you can try. I’d start with the obvious: Google. Assuming you’re creating PDFs, upload the files to Google Drive. Right-click any individual PDF, hover over Open With, and select Google Docs. Google will then try to run OCR on your PDF and you can save the resulting file as a document. You can then search that document (and any others you convert) through the Drive itself.
However, the more I think about it, this solution seems a little inelegant considering how many files you have to work with. Instead, I could try a program like TesseractStudio.Net or just Tesseract OCR if you’re not afraid of the command line. You should be able to use this to create OCR data from your files, and then you can search for it directly through Windows or macOS. OCRmyPDF is another variant, similar to Tesseract OCR, but again, you’ll play with typed commands to apply OCR to your files. No GUI and (direct) Windows support.
There’s also Paperwork , an open source document cataloging tool that comes with built-in OCR, which I would definitely consider considering it’s designed as a single piece of software for archiving, sorting, and searching documents. It looks like this is exactly what you are looking for.
I have not used PDF-XChange Viewer , but others have recommended it as an option. The free version will add watermarks to your PDFs, but it can create PDFs from images and, if I’m right, add OCR to these and any existing PDFs you have. It’s worth exploring, even if it’s not a perfect (free) solution. Likewise, FreeOCR can take your images or PDFs, apply OCR, and export the results as plain text files or Word documents. If you don’t mind searching your archives like this, this is an option.
As for paid solutions, there is always Adobe Acrobat Pro or Foxit PhantomPDF . Both will allow you to add OCR to PDFs, and you can treat all of your documents as a large batch (or create a script that does this with the contents of folders). You might even be able to do all of this during free trials of the apps, as long as they don’t limit their OCR capabilities. I’ve also seen other people with your specific problem successfully use an application like PDF OCR , which could be a cheaper alternative.
That’s all I can think of in my head (and after a little research). Hopefully one of these solutions will work for you – costing you a fortune. Post an answer and let me know which app works best for you!