Certainly XML (eXtensible
Markup Language) has caught on. More and more businesses see the value of using
XML. However, because XML is such a flexible data format, XML’s use varies widely.
In general, companies produce XML for interchange (to augment, or even replace
EDI or Electronic Data Interchange), as a standard data format for integration
with databases, or as a default file format from which documents can be constructed
and published in a variety of formats. Some might see some conflict between
PDF and XML, as both provide a standard way to present and exchange information.
In combination with the eXtensible Stylesheet Language (XSL), like HTML’s cascading
style sheets, XML data can be precisely displayed in a browser that supports
XML and XSL.
Where do XML
and PDF files meet?
With the release
of PDF version 1.3 (with Acrobat 4.0), Adobe provided a means to embed XML-like
data into PDF files. The structure of the document can be extracted and rendered
as an XML, HTML or other structured document. An important point, the structure,
called a structure tree, is not a true XML document. It must be processed and
further marked-up before it could be truly called an XML document. So PDF supports
the embedding of structured data into a PDF document. Let’s talk about:
- how we extract
structure form a PDF document;
- how we add structure
to a PDF document; and
- why we would
do any of this.
1) How to extract structure from a PDF file
We’ll begin with
how to extract data from PDF files, as this question is far more common than
how to add structure. To extract a structure tree from a PDF file, it must have
an embedded structure tree within it. Not all PDF files have a structure tree.
The embedded, XML-like structure tree is called logical structure. Logical structure
is independent of the PDF file’s representation, or how images, text and other
design elements appear when viewed. We’ll cover how to add the structure next,
but it’s important to know that there are really two ways to extract structure
from a document. A document’s structure is either defined using markup
and embedded in the PDF file (more on this later), or implied using the
document’s heuristics or the way the document flows. For instance, you can determine
the structure of a newspaper quite simply. There’s a headline, sub-heads, a
byline (or slug), the first paragraph, second paragraph, pictures (inset or
otherwise), a callout, etc. These items have an order to them or they occur
and are displayed in an order within the document based on how Westerners read
What methods are
there to extract structure from PDF files? Let’s focus on extracting content
from a PDF file and form a valid XML file. Here, as mentioned, there are two
approaches and several solutions.
If you created
this PDF file using QuarkXpress or Adobe Illustrator, you probably have a document
that does not have any structure in its data. Without explicitly defining your
data as you do with an XML or HTML document, the structure of the data must
be implied. There are a couple of companies who offer services and software
that extract either XML or HTML, or both from PDF files (some examples are Iceni,
Televisual, and Texterity, all are featured in the tools section on PlanetPDF.com).
These services or software analyze the page’s heuristics and builds structured
data from the PDF file. The result is the XML file we were looking for. However,
depending on the nature of the original PDF document, and by nature I mean the
way it was organized visually on the page, the result may be a bit disappointing.
The structure may not be correct, putting some segments of the page in the wrong
order and including elements such as page numbers, footers, etc. that should
be excluded. Regardless of the quality of the results, you’ll get a data file
you can edit by hand more quickly than copying and pasting from the PDF file
yourself. Some time ago I built an application that relied heavily on Iceni’s
Gemini technology. The application extracted content from newspaper real estate
display ads (the ones with pictures of all the houses). The system exported
thousands of advertisements into a XML file that was then parsed and imported
into a relational database. The process to that point was totally automated
and saved hundreds of hours each day compared to other proposed solutions.
The second method,
which applies only if the PDF file includes a structure tree, enables direct
access to the data embedded within the PDF file. Within PDF 1.3, logic structure
is described in one or more structure trees. This access is accomplished via
Acrobat API’s PDSEdit method. As Adobe’s API documentation describes, PDSEdit
allows access to navigate the structure tree, search within the structure tree,
and bookmark sections or specific content. Here’s where these structure trees
take on XML-like features. You could, for instance, search for words that are
contained within the structure tree’s "title" element. (Here, I’m using "element"
as you might when describing an XML file). Clearly, there’s some value to this.
You could bookmark your PDF document based on its inherit structure (headings,
figures, etc.) rather than manually adding bookmarks based on selected page
elements. Likewise, you could scan the document’s structure and extract headline
for use in a database.
Should you want
to build a custom application that extracts content from PDF files, refer to
the Acrobat Core API Overview (http://partners.adobe.com/asn
2) How to add structure to a PDF file
With some understanding
of how to get structured data from PDF files, you may wish to change the way
in which you create your PDF files to get better results when the data is extracted.
The most effective way to extract well-formed, structured data from your PDF
files is to embed the data.
The PDF file then
becomes a container for the structured data and the look and feel of the document
(as PDF files are commonly used).
to a PDF file is accomplished with one of two methods:
structure through PDF markup and
- adding the structure
programmatically, using the Acrobat API.
Remember that the
logic structure of a document is independent of, yet related to, the page structure;
and that the structured data does not impact how images, text boxes and other
page elements appear in a PDF page.
1) To add structured
data to your PDF when the PDF is created, use pdfmark. Pdfmarks are PostScript
extensions that generate a variety of advanced Acrobat objects including links,
bookmarks, annotations, logical structure, etc. The pdfmarks are added to the
PostScript language code when the authoring application prints the document.
When the Acrobat Distiller application creates the PDF file from the PostScript
code, it generates the structure information added via the pdfmarks. Adding
the structure requires an authoring application that permits the creation of
PDF marks. More information about pdfmarks can be found in the pdfmark reference
manual (href=’http://partners.adobe.com/asn/developer/acrosdk/docs.html’ target=’_NEW’>http://partners.adobe.com/asn
Note that when
you use Acrobat 4.0’s Web Page capture, the logical data structure of the HTML
document that is converted to a PDF file can be preserved. To preserve the structure
of captured Web pages:
- Select File
> Open Web Page…
- In the Open
Web Page Dialog box, click "Conversion Settings…".
- In the Conversion
Settings dialog box, select the Add PDF Structure checkbox.
- When opening
a Web page, the HTML structure will be added to the PDF file.
2) As an alternative,
the Acrobat API provides a means to add structured data as one or more structure
trees via the Acrobat API. So, if you have a PDF file that you want to add structured
data to (it need not correspond to the layout of the document), you can do so
by using the Acrobat API. Further discussion about the API and the various calls
used to create structure are quite technical, refer to the Acrobat Core API
Overview (href=’http://partners.adobe.com/asn/developer/acrosdk/docs.html’ target=’_NEW’>http://partners.adobe.com/asn
3) Why would we want to store structured data within PDF files?
For the numerous
emails I received related to this topic, the interest in storing the logical
structure within PDF files is primarily for repurposing the content. Publishers
spend a large percentage of their time and money finding ways to convert one
document’s file type into another, and more time and money actually doing the
file conversion. Clearly, one benefit to the logic structure of documents being
maintained within PDF files is the combination of look and feel with structured
data. If designers using QuarkXpress or InDesign and other desktop publishing
or traditional design tools could publish electronically to a format that both
maintained the integrity of their design (as PDF does) and the data structure
of those elements (as XML does), then we may have found the perfect file format.
Well, that sounds good, but there is more to it. Remember that XML, with the
presentation style sheet XSL promises to offer similar markup capabilities as
PDF, though we’re not quite there yet. The advantage of XML (with XSL) is the
ability, from a publisher’s perspective, to separate the data (or content) from
the presentation (or look and feel). PDF does a very good job at preserving
the interrelationship of these two elements, and this control over maintaining
look and feel, as offering portability, are two of many reasons why PDF has
proven to be an excellent file format for publishers. Now if we could represent
the data of our documents in a PDF file’s structure tree and render it’s look
and feel using a PDF file’s unique page description language, then we’d have
something (and it would look a lot like what’s been proposed with the XSL standard).
As you consider
how to maintain the integrity of data for future use, the best format is a structured
format and there is no better format than XML. Alternatively, you may have requirements
to maintain the branding image or look and feel of your documents; here PDF
is an excellent choice. How can you get the maximum value from what both XML
and PDF offer? One solution could be to embed XML-like structured data in all
your PDF files, and PDF 1.3 can support this. An alternative, and a method I’ve
long advocated since my Adobe Press book Internet Publishing with Acrobat
(Adobe Press, 1996), is a concurrent publishing model, where your primary
data is stored as XML, and this data is then used within your design application.
Careful though, as many designers want to modify the data you provide for better
fit, or they simply take artistic license with the data (or copy). You’ll need
to assess the applicability and how well you can enforce a concurrent publishing
model. To download the chapter from my book on the concurrent publishing model,
between PDF files and XML is technically possible, however, applications that
make use of this technology have yet to appear with a frequency. Certainly e-businesses
see the value of XML. As they begin considering the impact of structured data
when assessing their marketing materials (and other document often published
as PDF files), the popularity of embedding logical structure within PDF files
may increase. In the short term, the popularity of PDF files is not likely to
decrease as XML gains acceptance. Instead, publishers who wish to repurpose
their PDF files as structured file formats, such as HTML or XML will make use
of the conversion or extraction tools available that produce structured content.
And there, the relationship between XML and PDF may remain, as the relationship
between PDF and HTML is today.