PDF is one of the most brittle and unforgiving formats I’ve ever worked with, in the
sense that if you introduce even one stray byte into the middle of a file, you stand
a 99% chance of wrecking the reading frame for the whole file (kind of like a frameshift
error in DNA giving rise to ‘nonsense proteins’), totally breaking the world.
I’ve worked with brittle file formats over the years, especially in the area of data
compression, and I can tell you that there are a number of interesting approaches to the
problem of ‘brittleness’ in files that are sensitive to ‘spot mutations.’ Some of the work
on this goes back to the 1940s. I’m talking about error detection and recovery schemes
involving checksum algorithms; the stuff of Xmodem/Ymodem/Kermit/etc. Plus much more subtle
kinds of annealing. Many of these schemes involve adding redundancy to a file and so don’t
work to the advantage of a COMRPESSION scheme (duh!!!), but the point is, you can add
robustness back to a brittle file in almost any dosage you want.
Is PDF a destructive format…?
My observation with regard to PDF is this. PDF is a brittle format. You look at it
sideways and it breaks. Mainly I’m talking about the absolute need to reconcile every object
offset to an ‘xref’ table. Anybody who has tried to hand-edit a PDF file knows what I am
talking about. If you skew an offset, you screw an offset. The reason for having this
built-in brittleness is, ostensibly, performance. Table lookups are faster than walking a
linked list. With a huge document, all search, navigation, update, and display performance
characteristics depend on the speed of direct table lookups.
But we pay a terrible price for this performance, it seems to me. PDF files are, in fact,
too easily breakable. It’s a curious situation. I’ve never seen a file format this brittle
that didn’t depend, somewhere, on cyclic redundancy checks (CRCs) for a check of file
integrity. That is, before you do ANYTHING with the file, the first thing you do upon
opening it is run a CRC calculation (which takes very little time if you do it right), and
if the CRC check flunks, you pack it up and tell the user to go home right then and there;
you don’t bother trying to do anything with the file, because you know it’s corrupt. (Well,
you ‘know’ with a high degree of probability that it is corrupt.)
CRCs are a very strict check of file integrity, because one flipped bit in a 100-megabyte
file will make the CRC show up bad. I mean, we’re talking about a very sensitive
integrity sniffer here!
PDF perhaps doesn’t need that degree of integrity assurance, but by the same token, it
doesn’t need to break down completely just because I introduced a stray whitespace character
somewhere in the middle of an otherwise perfectly good file. That’s the kind of designed-in
lack of robustness that bothers me. It’s the kind of straightjacket no file format needs,
frankly.
Some suggestions and solutions
My solution would be this. No frameshift errors should ever break a PDF file. Ever. What
this means is that no PDF file should carry its own hard-coded ‘xref’ table. The reading
application should produce it dynamically, on the fly, at file-open time. At most, the
PDF file should store a table of in-use versus defunct objects, so that the reading app can
know which objects are usable (includable) for the ‘xref’ table. But as far as calculating
object offsets for every object… that’s something that can and should be done at
runtime, by the consuming app. Once only, at file-open. Remember, it only has to
be done once. After the speed hit of that initial table-tally, you’re home free. (Or, you’re
as free as you were before.)
Just-in-time ‘xref’ compilation can only bring more flexibility and robustness to the
format, it seems to me. It would certainly go a long way toward encouraging people to
experiment with the format. Not only would shameless hackers like myself be more likely to
spend more time hacking around inside files, but people who write ‘producing’ apps would be more likely, I think, to produce PDF as an output format. Think how much easier it would be to produce dynamic PDF on a server via Perl if you didn’t have to fuss with ‘xref’ tables, for instance.
What’s your view? Talkback at the
Planet PDF Forum