Kevin Newman is a Senior Developer at Debenu and first cut his teeth with PDF back in the late 90’s when he began work on what would one day grow into Quick PDF Library.
The PDF specification provides many options for the display of textual content and the related extraction of the text content. In this article I will try to highlight the key areas and terms that you will encounter when working under the hood with fonts in PDF files. Key terms that you should take note of are in bold.
PDF text definitions
Each block of text in a PDF document consists of four sets of data.
- The encoded characters which are sequences of bytes that represent the individual character codes that make up the text
- The font data which is a group of glyphs (character visualizations) accessed by a unique number called a Glyph ID
- A map that links the encoded character codes to Glyph IDs
- A map that links the character codes to Unicode values. This map is not needed when displaying the PDF but is required to allow the user to extract text content from the document (for example when selecting text and copying it to the clipboard to be pasted into another application).
Multiple blocks of encoded characters can be linked to the same maps and font data.
The font data can be stored in a number of possible formats:
- Adobe Type 1 Font Format, known as a Type 1 font
- Adobe Compact Font Format, known as a CFF font
- A standard TrueType font
- An OpenType font which has a similar structure to a TrueType font but allows the glyph outline descriptions to be either TrueType or CFF format.
- A Type 3 font which uses PDF drawing commands to define the glyph outlines. This font format allows greater flexibility over the appearance of the glyphs but does not include a hinting mechanism resulting in reduced visual quality for small text or low resolutions.
Font data can be embedded into the PDF. This allows the PDF to be viewed in exactly the same way on any computer. If the font data is not embedded the PDF viewer it is simply specified by name.
Non-embedded fonts force the PDF viewer application to look on the user’s computer for a similar font. This may result in differences in the display of the text when viewed on different computers with different installed fonts.
Embedded font data can consist of a complete font file or it can be a font subset which contains only the font data for a smaller number of glyphs. For example, if a PDF consists mainly of English text with a small piece of Japanese text, the font for the Japanese text could be subsetted to only include the glyphs actually used discarding the font data for the unused glyphs. This dramatically reduces the size of the embedded font data resulting in a smaller file size for the PDF.
Fonts can specify either simple or composite encoding.
Simple encoding uses 8-bit character codes mapped to a character set. This means that a maximum of 255 characters can be displayed from the font.
Predefined character sets can be used directly or adjusted using a differences array. The predefined character set encodings are:
- StandardEncoding is the default character set used for Latin-text Type 1 font programs. This is a direct mapping to the order of the glyphs in the font.
- MacRomanEncoding is the standard 8-bit character set used in Western versions of the Apple Mac OS operating system. Type 1 and TrueType fonts usually contain an internal cmap table to map character codes to Glyph IDs.
- WinAnsiEncoding is the standard Windows-1252 character set used in Western versions of the Microsoft Windows operating system. Type 1 and TrueType fonts usually contain an internal cmap table to map character codes to Glyph IDs.
The alternative is composite encoding which uses a two step process to encode characters:
- In the first step, the defined encoding format is used to translate a character code to a character identifier or CID. The CID value is used to look up font metrics for the character (such as the vertical and horizontal width of the character).
- The second step uses a setting from the font definition to translate the CID to a Glyph ID allowing the character to be displayed using the glyph description in the font data.
Composite encoding allows characters to be encoded using multiple bytes. Fixed‑length encoding uses the same number of bytes for each character while variable‑length encoding can use varying byte lengths for different character codes.
For example, UCS‑2 is a fixed‑length encoding that always uses two bytes for each character with the character codes representing characters in the Unicode character set.
UTF‑16 is another encoding for the Unicode character set. It is a variable‑length encoding that uses two bytes for most common characters and four bytes for others.
The variable-length UTF-8 encoding uses from one to four bytes to encode the Unicode character set. Single-byte character codes are used for most English text, two or three bytes for common characters in other languages and four bytes for rarely used characters.
Shift JIS uses one or two bytes to encode Japanese characters corresponding to Code Page 932 used in the Japanese version of Microsoft Windows.
Identity‑H is a simple fixed-length mapping that uses 16-bit character codes. Each two-byte character code maps directly to a CID value.
For composite encoding a structure called a CMap is used to define the encoding format and the character set of the encoded character codes. A special type of CMap called a ToUnicode CMap is used to translate from CID values to Unicode character codes.
When composite encoding is used with either TrueType fonts or OpenType fonts containing TrueType glyph outlines a CIDToGIDMap structure is used to translate CIDs to Glyph IDs.
When composite encoding is used with either CFF fonts or OpenType fonts containing CFF glyph outlines the font data is used to translate CIDs to Glyph IDs.
PDF Standard Fonts
The PDF specification provides a list of fonts known as the standard 14 fonts. These fonts are guaranteed to be available in all PDF viewer applications that conform to the PDF specification so text using these standard fonts do not require embedded font data.
Adobe Font Packs
Adobe Acrobat and Adobe Reader have a feature that automatically downloads and installs a Font Pack when a PDF document is opened containing non-embedded fonts with font names that match the fonts in the Font Pack.
For example, the font named HeiseiKakuGo-W5 is part of the Asian Font Pack. A PDF document could contain a non-embedded font with this font name. If Adobe Acrobat or Adobe Reader opens such a PDF document and the font isn’t available on the system it will offer to automatically download the Asian Font Pack and successfully display the content of the PDF.
The PDF specification provides a list of predefined CMaps. A PDF document may contain text that has been encoded using any of these predefined CMaps without embedding the CMap into the file. This reduces the file size of the PDF and allows text to be extracted easily.