When a file is converted to PDF, it loses its meaning. On the surface all the information is there, and to your eyes it looks exactly the same, but underneath that, all the method, structure and intelligence used when designing the original document has been lost.† This forms the heart of the challenge faced when attempting to convert PDF files back to formats like DOC (Microsoft Word), RTF and HTML, and is not dissimilar to those faced when OCRing paper-based documents.
Once you have your PDF file, the original layout and meaning formed from text-based building blocks — including words, lines (and line breaks), paragraphs, columns, tables, headers/footers and outlines — are long gone. Once in a PDF, its content just describes how and where on the page each object should be displayed.
This is a far cry from where you would be if you went back to the original file in Microsoft Word, Open Office, Google Docs, Adobe InDesign, or whatever. These kinds of word processing and desktop publishing applications follow similar principles, and it’s why converting files between them (while certainly not perfect) is a much more simple process.
How files are normally designed and edited in word processing applications
Most word processing applications use the same sort of principles for formatting and giving meaning to content. For the sake of this article, I’ll use Microsoft Word as the example. Here’s a few of the main ones:
- Paragraphs let you work with text that reflows across lines and can be quickly reformatted using styles to adjust spacing, indent, size and more.
- Columns let you incorporate more complex page layouts and in many cases make content easier to follow and give meaning to using different grouping styles.
- Tables let you layout tabular information not suited to the more linear formatting offered by paragraphs and columns.
- Headers & footers let you repeat content more consistently across multiple pages.
PDF to Word is like the OCR process
If you’re familiar with optical character recognition (OCR) and converting paper to electronic form, you might have already grasped some of the complexities we’re dealing with. Apart from recognizing fonts and how they should be displayed on the page, the challenges are much the same for both as all meaning and structure is gone from the contents.
The loss of the text stream
Take a look at the screenshot below. The first three lines of text show how it is displayed on the page in a PDF. The second shows how many separate objects the text is broken into inside the PDF. For each small text object, the PDF includes co-ordinates that simply describe where it should be positioned on the page and how it should be displayed.
src=’http://www.planetpdf.com/planetpdf/images/1-PDF-to-Word_-_text-objects.png’ width=’412′ height=’229′
alt=’Text objects in PDF’>
The first challenge for exporting text back out of PDF files comes when the streams of text from the original word processor get broken up into these seemingly random chunks. From here we must start to discern what their relationship is to the content around them. This process begins by sucking out all the text from the PDF.
† It is possible to create PDF files with embedded structure information in them, however most PDF files don’t have this structure.