Google has implemented technology that will allow it to index the full text of scanned PDF documents. In the past, such documents were rarely indexed at all.
While the search giant has provided full-text indexing of PDF documents for some time, scanned documents posed special problems. PDFs come in various flavors, including text only, image plus text and image only. The first two are created when PDF documents are created directly from an electronic source such as a Word document. As they already include text, they are relatively easy to index.
By contrast, image only PDFs are typically created by scanning paper documents. Computers may not recognize text in such documents: while the resulting PDFs look like the printed originals, they are in fact flat images without any textual content.
As Evin Levey, Google Product Manager, put it in the original blog post:
To people reading these documents, the distinction between words and pictures of words makes little difference, but for a computer the picture is almost unintelligible. Consider a circle. Should it be read it as a zero, the letter ‘O’, just a circle, or the ring from my coffee cup? People learn to answer this kind of question very quickly, but for the computer it is a painstaking and error-prone process.
In order to index the document’s text, Optical Character Recognition (OCR) needs to be performed. OCR is the process of comparing the ‘images’ on screen with characters in a database to determine which shapes represent text. Once complete, this allows the document’s text to be properly indexed.
Google has updated its system and commenced indexing. For more information and examples, check out Levey’s original blog post.