Navigating the Internal Structure of a PDF Document

A PDF Document is a many layered thing, that is, it has many layers of abstraction. You can look at PDF from different perspectives, each with its own advantages and disadvantages. At the lowest level, the PDF File contains the raw document data. Next up, the COS Layer organizes this data into a tree of simple objects. At the PD layer, these simple objects are put together to implement useful intermediate level structures like Fonts and Images. These are in turn organized into higher level constructs like Annotations and Pages. Some of these objects are also used to impose logical structure, like paragraphs and article threads. And there are more layers still.

Each of these layers of abstraction has its own independent set of rules. For example, what constitutes a legal file format may not contain any useful objects. The COS Object Tree may contain many objects that do not contribute to the document display or are completely unintelligible to Acrobat, but still form a legal object tree.

Knowing how to navigate these structures is essential to any PDF related development effort. But what does a real document look like on the COS Object level? What are these
objects and what is really necessary to make a PDF Document? In the following text the structure of a real PDF Document is laid bare.

PDF File Structure:

Don’t let this next section discourage you. It’s an introduction to the file format. The infinitely more understandable PDF object structure follows.

The PDF File Format is text with some binary data mixed in. If you open it in a text editor you’ll see the raw objects that define the structure and content of the document. Explicit object definitions are prefixed with some text that looks like this ’12 0 obj’ , the number 12 is the object reference. The object defined here is called indirect since it can be referenced by its number. You will also see objects without this reference prefix. These objects are called direct objects and are always contained inside other objects. A container object that references another object does so with the syntax ’12 0 R’ , to include the previous object defined with ’12 0 obj’. There are only 8 low level, or COS, object types.

The first 5 are scalar (single value) types:

  1. Integer – in the file as a number without a decimal point.
  2. Boolean – in the file as the text ‘true‘ or ‘false‘.
  3. Real Number – in the file as a number with a
    decimal point.
  4. Name – in the file as ‘/text‘ i.e. a forward slash, ‘/’, followed by some text, no white space or punctuation allowed.
  5. String – in the file as either ‘(…characters…)‘ or ‘<…hexadecimal character codes…>‘ .
The next 3 are container types:

  1. Dictionary – in the file as ‘<<…other objects…>>‘. Dictionary entries are always in pairs, a Name Object followed by any other object type.
  2. Array – in the file as ‘[…other objects…]‘. A list of un-delimited objects separated by white space only where necessary.
  3. Stream – in the file as ‘20 0 obj<<…stream attribute objs…>>stream…binary data…endstream‘. This is the most complex type. It’s actually a Dictionary Object mated with a string a bytes. The Dictionary contains information necessary for accessing the data in the string of bytes. Streams are always indirect objects, so they always begin with an object reference.

You May Also Like

About the Author: Thom Parker

Leave a Reply