PDFlib has announced the release of the new version of its PDF content extraction engine. The latest edition improves page content analysis, supports right-to-left languages like Arabic and Hebrew, and offers advanced Unicode post-processing controls.
The updates to the engine have been implemented in the PDFlib TET (Text Extraction Toolkit) family of products: PDFlib TET 4, PDFlib TET PDF IFilter 4, and TET Plugin 4. The results of PDF text extraction have been enhanced with improved shadow removal, word boundary detection and de-hyphenation, along with superscript and
subscript detection. More workarounds for non-conforming PDF documents improve the robustness of text extraction; the enhanced repair mode can successfully extract text from damaged PDFs.
TET 4 rearranges bidirectional text in Arabic or Hebrew documents to the proper logical order. Unicode post-processing controls offer folding, decomposition and normalization according to the Unicode standard which is useful to adjust the extracted text according to the requirements of the application.
TET is also available as a free plugin for Adobe Acrobat. The plugin supports Unicode syntax for search text and can highlight search hits on a page. Additionally, PDFlib also offers what it calls ‘the TET Cookbook’, a collection of programming examples that demonstrate the use of TET for text and image extraction tasks.
For more information about the product, check out the official vendor website.