Convert scanned PDFs to text documents using Google Drive

Google Drive might best be known as a cloud-based storage solution and tool for word processing, spreadsheets and presentations, but it has also become quite useful to use with PDF files as well.

No only does Google Drive include a Download as PDF option which allows you to save your text documents, spreadsheets and presentations as PDF, but it also lets you upload PDF files to the Google Drive file management system so that you can organize and share your PDF files along with your other Google Drive files.

Google Drive also includes an powerful option to convert scanned PDFs — that is, PDF files which have been scanned from paper to an image and then converted to a PDF — into text based documents via optical character recognition (OCR) technology when the PDF files are uploaded.

The OCR process is presumably powered by Tesseract, an open source software project which has been in development since 1995 and which Google has sponsored (and extensively developed) since 2006.

It is the conversion of scanned PDFs to text documents that this article will take a closer look at.

How to convert scanned PDFs to text documents

The first step is to ensure that your settings are configured to use Google Drive’s OCR features. Once you log in and open your Google Drive account, click on the cog icon in the top-right of your browser window to bring up the settings menu.

The window shown below will now pop up. Make sure that the Convert text from PDF and image files option is selected. This is the option that is going to convert your scanned PDF files to text documents. As an alternative, you can check Confirm settings before each upload if you would prefer to decide which files to convert on an individual basis.

Figure 1. Upload settings

Click on the ‘Upload’ button, highlighted below, then click Files, select your scanned PDF files and then click on the Open button.

Figure 2. Upload file(s)

The uploading and OCR process will now begin. The below window should be displayed to the bottom right of your screen and will show you the status of the upload.

Figure 3. Upload status

Once Google Drive has finished doing its thing the newly minted text document will be shown in the file list. Click on it to open it and take a look.

The result

Converting a text-based PDF — that is a PDF which uses text objects and can also include image objects and other object types — to a Microsoft Word document is notoriously hard. So you can imagine that converting a scanned PDF — which does not include any actual text objects — into a Google Document is going to be even harder because of the OCR step required.

I have found that Google Drive is quite good at taking a scanned PDF and recognizing its text, but has a hard-time maintaining the look and feel (read layout and formatting) of the original scanned PDF. But this isn’t surprising, like I mentioned before, it’s notoriously hard to do this and people currently pay a lot of money for inferior solutions.

If you use Google Drive as a tool to get text from scanned PDF files so that it can be copied and pasted elsewhere then I think you’ll find it very useful, but if you’re looking to replicate a scanned PDF as a text-based document with the exact same look and feel, it will still require quite a bit of manual tinkering to get the final output looking right.

You May Also Like

About the Author: Rowan Hanna

Leave a Reply