Blogs

Why PDF/A should matter to you

By Jose Machado posted 08-03-2015 09:10

  

One very important part of any Information Governance strategy is how long documents need to be kept in the organization, before being destroyed. These rules are influenced by various aspects, such as local legislation (retention period required by law), technical limitations (how much storage space is available), business factors, etc. 

 Retention periods can be very long. Some examples in the UK:

  •  Human resources medical records must be kept for periods up to 40 or 50 years, in some situations.
  • Government records (for building, accounting, health & safety, etc.) retention obligations are seldom under 10 years. For documents such as parliamentary papers, it can reach up to a 100 years.
  • Company formation records, financial records of election, must be kept for around 100 years. 
  • Insurance certificates must be kept way over 10 years.

It is essential for any organization to be able to open and display documents, in an unaltered way, for the whole duration of its retention period. This task is not as straightforward as it may seem, as some file formats widely used are often not suitable long term archiving. Microsoft Office files, for instance, use a proprietary format. Nothing guarantees that these documents will be readable and/or unaltered in the long term. 

To achieve this goal, the industry-wide accepted solution available today is the PDF/A standard. As defined in the ISO 19005-1:2005, it is "a file format based on PDF which provides a mechanism for representing electronic documents in a manner that preserves their visual appearance over time, independent of the tools and systems used for creating, storing or rendering the files."

It essentially consists of a PDF file that contains everything it needs to be displayed correctly, in any platform, at any time. The standard ensures it won't have elements that depend on external software for displaying, such as embedded audio and video. All required fonts and colour information must be included within the file, in a platform independent format. All file embedded metadata must comply with the XMP standard. Password encryption is not allowed. Also, there are page size limits to be verified. 

Despite the constraints imposed by the standard, PDF/A documents offer plenty of features required by many areas of application. They can be digitally signed, which makes them a good option for legal documents. They are capable of reliably displaying any set of characters, in any language. Colours and graphs can be displayed without loss of fidelity. As any PDF document, they allow for indexing and full text search. Also, the XMP metadata structure provides a well-known set of structured information about the document, enabling further search options.

Documents can be transformed to PDF/A in a variety of ways. Physical documents can be scanned directly into PDF/A format. Electronic documents can be printed into PDF/A files. Even existing PDFs can be transformed into PDF/A.

This task can become considerably more complex for large amounts of documents. Tens of thousands of document renditions per day is not an uncommon figure, for some organizations. In cases like these, the best option is to integrate a server based PDF rendering solution with the ECM system currently in place.

AIIM keeps a list of certified PDF/A compliant solutions, which can be accessed here.

Further information and references can be found on the PDF Association website. In particular, the document PDF/A in a Nutshell provides a comprehensive introduction on the subject.

 

4 comments
390 views

Comments

08-20-2015 12:32

That's right Julia, I mean the text within the document, not the metadata. PDF has become an ISO standard in 2008. Before that, it was a proprietary format belonging to Adobe.
OCR processing takes a lot of resources, I imagine Windows 7 does it on the fly (I never tried this functionality myself, though). Usually, OCR and Index processes happen asynchronously, so that when users search for documents, it's just a matter of querying the index server.

08-20-2015 10:02

Just so that I am totally clear on what is meant by full text indexing: You don't mean the metadata tags in the tiff but indexing the content based upon OCR, yes? While Windows 7 permits searching TIFF image documents based on textual content the "while" can be really really long as OCR takes a lot of time (just my opinion). I have had push-back at times in previous jobs where I advocated PDF for scan/import and management there asserted PDF was "too proprietary"

08-11-2015 11:07

It's a much more limited format for archiving. You can't full text index a TIFF file, for instance.

08-11-2015 11:03

What about TIFF? Is it not a long-term-storage file format as well?