Turning Paper Into Searchable PDFs

Luckily after the release of Acrobat 5 Adobe listened to users’ disappointment over the exclusion of the Paper Capture plug-in and today we can download the plug-in and (as we could with Acrobat 4) convert images into real, searchable text.

Update: The capture plug-in is included in Acrobat 6 Professional but not Acrobat 6 Standard.

While Acrobat’s Paper Capture plug-in is not going to give you the power to turn an office completely paperless, it’s particularly useful for making key data and info available across a network. For example, imagine having information like commonly accessed contracts and documentation on a computer network, instead of locked away as paper in a filing cabinet.

Once it’s on your computer the Paper Capture plug-in will let you make it searchable through its optical character recognition (OCR) engine. The Paper Capture plug-in turns the text images on your scans into actual text characters that can be editable and searchable — so rather than scouring the pages to locate the information you’re after, you can find it almost instantly using Acrobat’s built-in Find tool.

Getting started

The first step required is to scan the paper to digital form. This is the key step to making sure you get the most out of your electronic copy. Because the Paper Capture is run over this, it means the clearest and best aligned pages are most likely to be well processed. We won’t go into too much detail as we’re presuming you have a scanner and a basic understanding of how to use it. Scanning your documents at at least 300 dots per inch (DPI) will help a lot.

Using the Paper Capture plug-in

Once you’ve downloaded and installed the plug-in, in Acrobat go to Tools > Paper Capture.

PDF Output Styles

Once you’ve scanned your pages and have them in Acrobat, you can take the ‘PDF Image Only’ format and convert it in a number of ways, each of which is dependent on how you want to use your PDFs. Here’s an overview of the different styles:

  • PDF Formatted Text and Graphics: Used to be known as ‘PDF Normal.’ When you run the Paper Capture, all bitmapped text it recognizes will be replaced with the equivalent text character. This format creates a small file size than the other styles, however it changes the original look of the document.
  • PDF Searchable Image (Exact): Used to be known as ‘PDF Image + Text.’ This creates the largest file size, but is, as the name suggests, the most accurate. When the plug-in is run a layer of text is placed behind the image, making the page appear exactly as it did when you scanned, but now it is searchable.
  • PDF Searchable Image (Compact): This is the compromise between the two types above, producing smaller files sizes than the Exact method. The general look and feel of the image is retained and it becomes searchable. The quality is not quite as good as the Exact method.

You’ll notice that along with the different styles you have the option to downsample the pages. The lower DPI you go the smaller the file size. If you’re only going to be using the PDF on your computer (and don’t plan to print it) then downsampling is a good way to reduce the file size.

PDF Formatted Text and Graphics

The PDF Formatted Text and Graphics style includes extra editing features which let you manually process words that the plug-in was uncertain about. If you look at the example below you can see a couple of words with a border around them. As OCR technology is not 100% accurate, the uncertain words are kept as images, at which time you can go through them manually.

We can begin the process of accepting or declining whether we change the text image into real text by going to Tools > Touch Up Text > Find First Suspects.

This will pop up the window below, which you use to accept the changes.

Final Thought

If you want to take it a step further, indexing a bunch of files would make locating content much simpler — once the index is setup, it’s just a matter of using the Search tool in Acrobat.

You May Also Like

About the Author: Richard Crocker

Leave a Reply