When I first started doing discovery using electronic documents — about 10 years ago — the industry practice was to scan paper documents to TIFF images. The docs were then sometimes put through the Optical Character Recognition (OCR) process to make them word-searchable. In addition, a bibliographic database was hand coded so that it was possible to search for a document by author, addressee, date, or some other field that had been included in the bib database. The vagaries of both the OCR and bibliographic databases can wait for another time… but the problems were many and the costs were high.
Most frustrating to me, however, was the fact that, having gone to all this trouble, each document was split in a least three pieces — the images (often kept on optical jukeboxes because hard drive capacities were orders of magnitude smaller) from which you could view or print an exact copy of the original; the text (often kept in a litigation support database like Summation or XyIndex); and the metadata (in a bibliographic database such as FoxPro, Lotus Notes, or something similar). Then there was the custom code that tied all these pieces together, so that you could do word searches, and then view the images on which there were hits.
A recurring complaint from many lawyers was that when you did a full text search, you couldn’t see your hits on the document itself — because the text was separate from the image. There was a plethora of proprietary image viewers, few of which were either fast or stable. None were free.
It was frustrating to have documents in pieces so that you could not take a subset of electronic documents — say documents relating to a specific witness or issue — and share them with someone on the team (like an expert witness) that didn’t or shouldn’t have access to the entire document system. In order to have any access to the system, a person needed to have all of the software, as well as the image files, OCR text database, and bibliographic database (plus the code that glued them all together). Most maddening to me was the fact that, once having identified such a set of documents, about the only thing I could do was to print them, thus coming full circle to a stack of paper. Not searchable, no metadata — just paper.
All of which led me to team up with some techies to try to create a litigation management system that would solve these (and other) problems. I came up with a design for something we coined a ‘SmartDoc.’ It would have the image, text and metadata all in one file. You could view thumbnails, so that you could visually identify pages. It would support hyperlinking within the documents, and to other items in the database. It would be smart enough to know when it had been produced in discovery, which witness files it was in, and whether it was privileged. It could be encrypted and password protected at the document, not the database, level. We worked on this problem for awhile with very limited success, and then went on our way to solve a few other problems of the day (locating docs via GIS, developing a timeline interface for case development, and other sci-fi scenarios).
By the time I got back to the SmartDoc idea, Acrobat 4 was out. And there it was — image, data and metadata all in one unified package. You could search it and see your hits on the page. It had built-in metadata fields, and you could basically add whatever metadata you wanted. It could be hyperlinked, automated, and password protected. The ‘big iron’ search engines we preferred could index it (hell, it included the Verity search engine). The viewer was free.
Despite my enthusiasm for the format, some in litigation support had objections. Some thought PDF files were too big. (‘Compared to what?’ I would ask. ‘The TIFF file alone, or the whole shebang?’) Cost was also an issue since Adobe was basically charging by the page for its Acrobat Capture OCR product. However, my crude ROI calculations led me to conclude that over the life of a case, using PDF would pay for itself many times over in database costs and attorney time alone. Not to mention copying and storage of all the duplicates you had to print in order to distribute documents.
In the time that has elapsed since then, the case for PDF in litigation has only strengthened. At the core, I still like the fact that a PDF can be a ‘unified’ document. It has the image, text, and metadata all in one file. Moreover, in the time since I started on the quest for the ‘SmartDoc,’ corporate enterprises, the courts, government agencies, and yes, even some lawyers, have adopted PDF as a standard. Adobe says it has distributed 500 million copies of the Reader. A PDF can be a smart, self-contained document.
In future posts, I’ll talk about the reasons NOT to accept PDFs from the other side in discovery.