PDF’s Brittleness: A Lament PDF – An optimal solution

PDF is one of the most brittle and unforgiving formats I’ve ever worked with, in the

sense that if you introduce even one stray byte into the middle of a file, you stand

a 99% chance of wrecking the reading frame for the whole file (kind of like a frameshift

error in DNA giving rise to ‘nonsense proteins’), totally breaking the world.

I’ve worked with brittle file formats over the years, especially in the area of data

compression, and I can tell you that there are a number of interesting approaches to the

problem of ‘brittleness’ in files that are sensitive to ‘spot mutations.’ Some of the work

on this goes back to the 1940s. I’m talking about error detection and recovery schemes

involving checksum algorithms; the stuff of Xmodem/Ymodem/Kermit/etc. Plus much more subtle

kinds of annealing. Many of these schemes involve adding redundancy to a file and so don’t

work to the advantage of a COMRPESSION scheme (duh!!!), but the point is, you can add

robustness back to a brittle file in almost any dosage you want.

Is PDF a destructive format…?

My observation with regard to PDF is this. PDF is a brittle format. You look at it

sideways and it breaks. Mainly I’m talking about the absolute need to reconcile every object

offset to an ‘xref’ table. Anybody who has tried to hand-edit a PDF file knows what I am

talking about. If you skew an offset, you screw an offset. The reason for having this

built-in brittleness is, ostensibly, performance. Table lookups are faster than walking a

linked list. With a huge document, all search, navigation, update, and display performance

characteristics depend on the speed of direct table lookups.

But we pay a terrible price for this performance, it seems to me. PDF files are, in fact,

too easily breakable. It’s a curious situation. I’ve never seen a file format this brittle

that didn’t depend, somewhere, on cyclic redundancy checks (CRCs) for a check of file

integrity. That is, before you do ANYTHING with the file, the first thing you do upon

opening it is run a CRC calculation (which takes very little time if you do it right), and

if the CRC check flunks, you pack it up and tell the user to go home right then and there;

you don’t bother trying to do anything with the file, because you know it’s corrupt. (Well,

you ‘know’ with a high degree of probability that it is corrupt.)

CRCs are a very strict check of file integrity, because one flipped bit in a 100-megabyte

file will make the CRC show up bad. I mean, we’re talking about a very sensitive

integrity sniffer here!

PDF perhaps doesn’t need that degree of integrity assurance, but by the same token, it

doesn’t need to break down completely just because I introduced a stray whitespace character

somewhere in the middle of an otherwise perfectly good file. That’s the kind of designed-in

lack of robustness that bothers me. It’s the kind of straightjacket no file format needs,

frankly.

Some suggestions and solutions

My solution would be this. No frameshift errors should ever break a PDF file. Ever. What

this means is that no PDF file should carry its own hard-coded ‘xref’ table. The reading

application should produce it dynamically, on the fly, at file-open time. At most, the

PDF file should store a table of in-use versus defunct objects, so that the reading app can

know which objects are usable (includable) for the ‘xref’ table. But as far as calculating

object offsets for every object… that’s something that can and should be done at

runtime, by the consuming app. Once only, at file-open. Remember, it only has to

be done once. After the speed hit of that initial table-tally, you’re home free. (Or, you’re

as free as you were before.)

Just-in-time ‘xref’ compilation can only bring more flexibility and robustness to the

format, it seems to me. It would certainly go a long way toward encouraging people to

experiment with the format. Not only would shameless hackers like myself be more likely to

spend more time hacking around inside files, but people who write ‘producing’ apps would be more likely, I think, to produce PDF as an output format. Think how much easier it would be to produce dynamic PDF on a server via Perl if you didn’t have to fuss with ‘xref’ tables, for instance.

What’s your view? Talkback at the
Planet PDF Forum

You May Also Like

About the Author: Kas Thomas

Leave a Reply