PDF Anatomy 101

Ok class, today we are going to dissect a PDF document and explore what it’s made of.

For an in depth look at the internal workings of PDF check out the PDF Reference Manual.

The basic chemistry

A PDF document is made up of three types of objects, or classes of objects. They are:

  • Document objects.
  • Page objects.
  • Content objects.

Cutting things open. Scalpel please!

A document object must contain at least a cross reference table (xref) and one page object. Optionally it can also contain such things as named destinations, document info, hidden templates, thumbnails, bookmarks and more.

A page object usually contains at least one content object (there can be more than one), page information for cropping, logical page numbers and page rotation, links, article threads, annotations (including sound and file annotations), form fields, digital signatures, page actions, etc.

A content object can only contain the marking operators or resources (e.g. fonts, images) that make up the background of the page.

Putting things under the microscope

This object based construction allows you a certain type of functionality. Think of it as containers that all fit inside one another in a particular order. A content object is contained within a page object, a page object (and all the content objects it contains) is contained within a document object. A document object also has information of it’s own apart from page and content objects. There are many advantages to this approach.

For example, when you replace a PDF page, you are removing all the content objects contained within that page object. There are also some other things which apply only to that page, such as cropping and rotation info, that are removed.

You are also making modifications to some of the information within the document object, namely the cross reference table (xref) which is essentially a map of where all things in a PDF file are located, including the replacement page.

However, the rest of the page object remains. This allows things such as form fields, and article threads (as well as other enhancements that can be so much work to create) to remain and be used by the replacement page. This can be a real time saver.

Now lets compare this with deleting a PDF page. Deleting a page actually deletes the page object. This means that everything the page object contains is also deleted. Thus, article threads, form fields and more are removed and can no longer be used by another page, and bookmarks for that page become non-functional. Depending on your situation, deleting a page and all that comes with it can also be a big time saver.

Classification and documentation

Now that we’ve discovered the mysteries of PDF construction it’s time to bring it all together in an orderly fashion and to make a permanent record of it. Guess what? It’s your lucky day since I’ve already taken care of this final step for you. Use the chart that I’ve designed. You can even print it any size you want.

Anatomy of a PDF Object (25K)
Anatomy of a PDF Object

You May Also Like

About the Author: Bryan Guignard

Leave a Reply