How to Extract Text from Image-only PDFs with Zotero

If you have a PDF of a book chapter or journal article, it’ll be one of two basic types.¹

On the one hand, it might have real text inside it. If so, you’ll be able to select specific letters or words inside the PDF.

On the other hand, it might just be a series of page images. If this is what you have, you can click on it all you want, but all you’ll select is the whole page image.

Even when you have real text in a PDF, you’ll have various issues if you try to copy and paste from it. And you probably shouldn’t be doing a lot of that anyhow. Strings of quotations generally isn’t the most effective way to make an argument.

But having real text inside your PDF chapter or article will make that PDF searchable and easier to annotate if you intend to read it electronically, underline or highlight text, or otherwise use your PDF like electronic paper.

If your PDF doesn’t have real text inside it, however, you can use Zotero to add it through “optical character recognition” (OCR). That is, you can have Zotero

“look” at an image-only PDF,
give a best guess about what text is on the page, and
save that text back with the image into another, combined PDF.

The OCR may not be perfect. But it will make your PDFs more usable.

1. Get Zotero ready.

To get Zotero ready to add text to your image-only PDFs, you’ll first need to

install Tesseract OCR for Windows or Linux or Mac and
download and extract Poppler for Windows or Linux or Mac to a directory where you’re okay with it staying.

Once you have these tools, install the Zotero OCR extension in Zotero.

After you restart Zotero,

Go to Tools > Zotero OCR Preferences.
For the path to your OCR engine, enter the path to tesseract.exe (e.g., C:\Program Files\Tesseract-OCR\tesseract.exe).
For the path to pdftoppm, enter the path where you have Poppler’s pdftoppm.exe (e.g., C:\Users\[yourusername]\poppler-0.68.0\bin\pdftoppm.exe).
Customize the other options according to your preferences, and click “OK.” If you want Zotero’s OCR text back in a PDF file, you should at least leave the “Save output as a PDF with text layer” box checked. But you may want to leave unchecked the option to overwrite the initial PDF, just in case something goes amiss with the conversion.

2. Create a PDF with real text.

At this point, Zotero is ready to

run OCR on any image-only PDF in your library and
create a new PDF that maps these page images to real text.

To do so, find an image-only PDF in Zotero, right click it, and choose to “OCR selected PDF(s).”

After you click this option, you’ll want to be patient. The process may take a while, even with a comparatively short PDF. And it can look like not much is happening.

But eventually, you should get a command line window that gives you some progress indicators as Tesseract works through your PDF.

When Tesseract finishes, you’ll see a new linked attachment in Zotero with a “.ocr.pdf” ending to the file name. You can use this file to interact with the real text that Tesseract worked out for your PDF’s page images. Zotero’s indexer and your PDF reader’s find function can do the same as well.

If you want to be able to search the new text in your PDF from Zotero, you might want to rebuild or update your Zotero index (Edit > Preferences > Search > Rebuild Index …).

3. Clean up the leftovers.

If you don’t care to keep the leftovers from the conversion process, you can clean them up at this stage. Just right-click either the new linked file attachment or the original one in your Zotero library, and choose to “Show File.”

You’ll then be shown the Zotero storage folder where your PDFs are stored. Be sure not to touch the .zotero-ft-cache or .zotero-ft-info files. But any leftover text (“.txt”) files you can delete.

And if you’re satisfied with the results of the conversion, you can also delete your original PDF from this folder and rename the “.ocr.pdf” file to omit the “.ocr” portion of its file name. It should then have the same name as your original PDF.

So, the original stored file link in Zotero (the one without the little chain icon) should work to open it. And you can delete also the Zotero link to the “.ocr.pdf” file (which you’ve now renamed).

Conclusion

Having real text in a PDF makes it possible to search that document. It also makes it easier to mark it up. Older PDFs or PDFs of older sources might not come with this real text already in them, and OCR is rarely perfect.

But you can use Zotero to add a good amount of accurate text to your image-only PDFs, which will make annotating and referencing these files that much easier.

Header image provided by Zotero via Twitter. ↩

Some of the links above may be “affiliate links.” If you make a purchase or sign up for a service through one of these links, I may receive a small commission from the seller. This process involves no additional cost to you and helps defray the costs of making content like this available. For more information, please see these affiliate disclosures.

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Posted

13 June 2022

J. David Stark