In going back through the comments and questions I’ve read, I see that one area that concerns many people is how to use the OCR (Optical Character Recognition) abilities of Acrobat. Here’s an overview, and I’ll try to deal with other OCR issues very soon.
When you get a document that has been scanned rather than exported from the software that created it, such as MS Word, it’s just an image (i.e. a picture). Remember, to a computer, a picture of the letter ‘A’ is not the same as the text character ‘A,’ so when you try to text-search an image, you get no hits because there’s no text to search. Typical scanned litigation documents are in the TIFF (image) format. There are also many software and hardware packages that scan paper directly into PDF. For now, I’m not going to address using Acrobat or other tools as the scanning software. For our purposes today, let’s just say, ‘you’ve got those image files that you want to convert into something you can search.’
The unique thing about PDF is that you can have an exact image of the document, plus the text, plus all kinds of metadata ALL IN ONE FILE. This is a wonderful thing — but I will expound on its wonderfulness later… With the ‘Paper Capture’ tools in Acrobat, the software reads the picture, and figures out what the text is. So while you still see the ‘image,’ the software can also read the underlying text. OCR is not perfect, and it works best on first generation, laser printed images (just like your eyes do). In the past decade, however, OCR technology has gotten surprisingly accurate.
A couple of key points here. First, this discussion applies only to Acrobat, not to Reader. Second, prior to Acrobat 6, Adobe allowed you to perform ‘paper capture’ with Acrobat only up to 50 pages. If you have Acrobat 4 or 5, you’ve got a 50-page limit (although, of course, there are ways to work around it.) I think that Adobe still offers the Capture Server product for large scale scanning and OCR work. It’s meant for use in a high-volume production environment, such as a litigation support vendor. In my experience, in government at least, people were leery of using it because you paid by the page. That is, you could buy a 100,000-page license and then you have to fill ‘er up again for the next 100,000. Acrobat 6 Professional allows you to ‘capture’ or OCR large documents without buying the separate server, but is still not truly a substitute for industrial strength tools in a production environment. It is, however, capable of a surprising level of automation, and as far as I can tell, it’s not dumbed down in its character recognition capabilities.
So here you are with a big old TIFF file. Or, if you are like me and occasionally have opposing counsel that just wants to jerk your chain, a PDF file that was produced in ‘image only’ format from MS Word and contains no text.
In Acrobat 6, go to File > Create PDF > From File and select the TIFF file that you want to convert. That brings your image into the PDF format, but still doesn’t make it word-searchable.
You can also choose ‘From Multiple Files’ if you want to do a batch.
Now, go to Document > Paper Capture > Start Capture. The dialog that comes up gives you some choices. You can do a page, all pages, or a range (which might be a good choice if you have, say, a few pages of text followed by lots of charts). Be sure to click the ‘Edit’ button to see the other things you can do, like select English as the recognized language. The PDF Output Style choice you probably want is ‘Searchable Image (Exact).’ As a rule, I wouldn’t do any downsampling of the image, although this might reduce the size of the resulting file.
Click OK, and the OCR engine will start up. If you are running a normal Windows box of moderate memory and processor speeds, pretty much every other process will choke while Acrobat reads the document and converts the pictures of letters into text letters. If it’s a heavily formatted, 1,000-page document, go have lunch or save it for the end of the day because this is going to take a while. Adobe does provide a process window that keeps you apprised of events.
When it’s done, don’t forget to File > Save the document. And there you have it. (At this point, I always like to do a little test by running a quick search on a word that I see on the first page. It just makes me feel better to know that it worked. I also have a continuing dialogue about what to do with the original TIFF file…)
As I said, if your image file is from a laser printed copy, and it’s a decent scan, the OCR accuracy is amazingly good. But it may have garbled some words, so if you want to get really fancy, go back to Document > Paper Capture and select ‘Find first OCR suspect’ or ‘Find all OCR suspects.’ This identifies characters that the OCR engine had problems with, and gives you a chance to correct the text. You can fix the spelling if it’s important to you — say for a proper name or term. That way you can be sure that the search software will find it. Otherwise, for a common word, I’d just save time and let it slide.