What are you putting in your PDF files?

Mark is the CEO of IDRsolutions, the company behind JPedal, a 100% Java PDF library. He blogs at Java PDF Blog.

PDF files are generally judged on how they appear. This is a shame because it is possible to create a well-crafted PDF (with lots of practical uses) or a horrible PDF (which is pretty useless) and the two versions will superficially look identical onscreen.

This article will explain what the difference is and how you can tell. I get to see an awful lot of PDF files in my day job developing a Java PDF viewer, so I would like to tell you about the good, the bad and the ugly.

Because PDF files are designed to be read and printed (jobs they do very well), most people judge them on their superficial appearance. The format actually lets you put just about anything inside a PDF — images, text, vector graphics — and what you put in can alter the flexibility of the PDF and what you can use it for.

Some PDF files contain just images. Even if they look like they contain text and shapes, these are just bit-mapped images inside a PDF. They look okay, but you cannot search them (there is no text in them) and you need an OCR tool for text extraction. They also tend to be large and they do not scale well. Trying to zoom into them results in a pixelated display. They also tend to need lots of memory. You can spot these files very easily by zooming into them or by trying to select the text (Ctrl-A). If you cannot select any of the text you can see, chances are that it’s an image.

Less common, but still found, are PDF files where the text has been converted to shapes. Again these file tend to be on the large size, and you cannot search them from text. You need OCR to get text out of them.

The biggest complaint we see against the PDF file format is about text extraction. People complain it is very hard to get formatted text from a PDF file to edit. This is because PDF was originally designed as an end-file display format (unlike Word documents) and did not include any document structure.

Since PDF version 1.4, though, it has been perfectly possible to include tags in the PDF file. This allows the extraction of formatted content — but only if the PDF was created with the tags included. Most PDF creation tools still leave them out. I wrote an article explaining how you can see if the tags are present on my blog.

I often find people complain that the PDF file format is defective, when actually the features are there — they just have not been used properly. If we were all more ‘choosy’ about how our PDF files were made, a lot of issues would go away. So next time you work with PDF files, remember that not all PDF files are created equal and look beyond the immediate appearance to see how well-made they are. It will have a big impact on what use can be made of the files.

You May Also Like

About the Author: Mark Stephens

Leave a Reply