Introduction to PDF content extraction

With the formalization of the PDF/A standard for archival, it is looking more than ever before as though PDF will be the archive format of the future. As a result, increasing numbers of information workers will have access to PDF documents created from source files and applications that cease to be available at some point after the PDF’s creation. This can cause big problems if the documents need to be updated at a later date or if the content needs to be modified for delivery via another medium. This is just one of many scenarios in which PDF extraction fills a crucial role.

What is PDF extraction?

Simply put, PDF extraction is the extraction or conversion of PDF data into another, reusable form. While some content manipulation can be accomplished with Acrobat — and even more is possible with Acrobat plug-ins like PitStop Professional and ARTS PDF ImageWorks — serious content editing within Acrobat is generally frowned upon. For all the advances in its native editing features, Acrobat was never intended to be a composition or layout application. Complex editing should ideally be done in the original source application(s) before regenerating the PDF document. What then, can be done for files requiring significant revision when the source files are either unavailable or unusable?

The answer of course is to extract the relevant data from the PDF files. Assuming that the PDF files in question are not secured to prohibit text and graphics selection, the next step is simply to choose which of the available tools will work best in a given environment. Acrobat users can convert entire PDF documents into various native image and other formats using the Save As command, while the Select and Snapshot tools enable users to extract selected objects from PDF files.

There are numerous third-party tools designed specifically to extract information from PDF files, most of which are more specific than Acrobat’s ‘Save As’ command, and many of them are also more powerful. That said, a large number of them are plug-ins, meaning that they extend rather than compete with Acrobat’s conversion features.

OCR and ‘image only’ PDFs

When it comes to PDF extraction, it’s crucial to know what kind of source PDF will be used. PDF is a very versatile format, and it comes in several different ‘flavors’: text only, image plus text, and image only.

Text only and image plus text are similar in that they contain searchable text, the only significant difference being that image plus text PDFs contain both text and images (funnily enough.) When it comes to content extraction, the only flavor that poses inherent problems is image only.

Image only PDFs can be created by imaging applications or scanning. In such files, text may not be recognized: while the resulting PDFs look like the printed originals, they are in fact flat images without any textual content. This can be sufficient for some purposes, but if you want to select, search or extract text, then an Optical Character Recognition (OCR) will need to be performed. OCR is the process of comparing the ‘images’ on screen with characters in a database to determine which shapes represent text.

With Acrobat’s Paper Capture plug-in, it’s possible to perform an OCR and add an invisible layer of text (known as ‘hidden text’) to the image PDF. In effect, this makes it an image plus text PDF document. Further, Adobe offers a powerful server-based solution called Adobe Acrobat Capture that handles large volumes of PDF documents.

Alternatively, the AdLib OCR Add On for the AdLib eXpress range can also be used for this purpose. Gemini, on the other hand, boasts a character mapping facility that can be used to convert image only PDFs into a variety of editable formats such as HTML and RTF.

In any case, the best fit for a given workflow will depend on the technologies already in place, type and volume of PDF input and of course, the all-important bottom line.

That’s it for now, but if you have any topics or issues you’d like to see covered on Planet PDF in the future, please to drop us a line.

You May Also Like

About the Author: Dan Shea

Leave a Reply