OCR: Yuor Sarech Enigne Cna’t Raed Tihs

The limitations of relying on text searching become clear when you use a search engine on OCR’d documents. OCR software has gotten much, much better, but you can still count on 20+ errors on most non-laser-printed pages. I never count on text searching to locate the smoking gun…

There’s a little silliness out there on the web illustrating how you can still make sense of words if the first and last letters are intact, even if all the others are scrambled.

That only works for humans. Computers can only search the actual strings of letters. Some text-search products claim to have ‘fuzzy’ searching capabilities. The only one that I’ve seen that came anywhere close to working was Excalibur (now Convera) RetrievalWare. It doesn’t look like they are still marketing that aspect of it. It’s expensive and takes significant tech expertise — the implementation I used was implemented by Aspen Systems, a huge lit support contractor. Others may have had better luck than I did with the ‘fuzzy’ features of other search tools. Firms used to pay to retype text files in an effort to clean up ‘dirty OCR.’ Surely everyone has better things to do with their lives… In addition, it’s not that easy to get to the ‘text layer’ of a PDF file to alter the underlying text.

Remember, the image layer is a picture of the letters; the text layer is the letters themselves. To a computer there is a huge difference, and on an OCR’d PDF they are not the same. When you use Find or Search, you are looking for strings of letters. If those letters are garbled, you won’t find the words you are looking for.

I’ve long been a believer that your doc management system should allow you to both Search for a document, and also to navigate to it. Text searching works very well on electronically generated PDFs (like word processing or email files). Otherwise, I use it as ‘rough cut’ tool, to make a big pile of documents into a number of smaller piles. PDF (and specifically some tools built into Acrobat) help a lot.

That way, once you’ve found a good document, you know where to find it again — don’t keep running those queries every time.

