Blogs

PDF Generated Gibberish

By John Phillips posted 05-08-2012 17:09

  

I recently enjoyed the extensive discussion on this blog created when Serge Huber, CTO for Jahia Solutions, posted the blog entitled “After Flash, why PDF must die.”  The posting and comments contained many informative opinions, as well as, a lot of actual data on technical issues, information management standards, etc.  It was amazing to see the diversity of concerns and issues with the PDF “standard” as a variously viewable and archival information format.

Then, I decided to make a PDF document of an Internet displayed page, and the Adobe PDF printer driver failed – again. Instead of a nice PDF copy of a bill payment receipt, I got some “gibberish” looking characters that made the PDF document largely unreadable. This had occurred last year, before a hard disk crash, and most technical advice found on the Adobe and Microsoft sites was not very helpful. Some blog advice had been to reinstall the Adobe Acrobat software.  Then the disk drive crashed, and on the new drive I installed a newer version of Adobe Acrobat and the problem went away. Or so I thought.

Then, yesterday, it came back. So I researched it again. This time, the answer was to be found at “Adobe Acrobat 10 Displays/Prints Gibberish” - http://helpspot.business.uconn.edu/index.php?pg=kb.page&id=344. The problem has to do with the option of the Adobe software to “Rely on system fonts only, do not use document fonts” when the PDF file is generated. Checking or not checking this box changes the PDF creation approach. The example of character gibberish provided was exactly what my documents had been looking like. In some cases this document creation failure was not apparent in the first few lines of the documents generated. It would only be apparent if one looked at the entire document.

 This could cause real problems with automatically rendered PDFs to be used as official records unless human eyeballs caught the anomaly. Possibly, an automated OCR check of the PDF might catch the problem, but this would need to be executed on all rendered PDFs to identify specific document creation failures, thus taking up more CPU cycles during the document conversion process. Obviously, finding these PDF garbage files that must be deleted retrospectively after many had been stored in an ECM system would be costly for a system owner/operator.  Imagine the reaction of Corporate Legal counsel if this is all they could produce during discovery proceedings to attest to the innocence of their client. And, if the fonts are altered from the original document, do you really have an archival quality rendition?

So, despite our reliance on automated solutions to ECM and ERM to get our daily work accomplished, it is still important to not completely turn our futures over to computer based robots or automated systems. All of the technical wizardry in the world during systems design will not assure we have records of evidentiary value and archival quality if the systems do not perform exactly as expected. Human review of the performance of software must be factored into every automated system or our futures may depend on archival records of questionable quality due to the unexpected generation of gibberish.



#ERM #ScanningandCapture #PDFs #EnterpriseContentManagement #ArchivalRecords #DocumentEvidence #ElectronicRecordsManagement
7 comments
2474 views

Comments

06-06-2012 23:00

Re: a PDF output function on their website. - Yes, I am seeing this on some of the more considerate Web sites that know you want a business record.
It would just be nice if we were better informed when we were actually creating the "garbage" in the first place. I guess one man's plain old meaty content file is another software driver's poisonous garbage input.

06-06-2012 22:50

That sounds reasonable, but my own personal use of Adobe is to quickly render PDFs of business records, so I would need to have a solution that was easily applied to all of the documents I am creating.

06-06-2012 22:47

Regarding - Simply automating some of the tasks (The conversion to PDF) is not sufficient...You also need to automate the validation and detection of unexpected behavior.
I agree and have learned the hard way that there is a reason most of my Acrobat configurations have forced me into reviewing the document after it was created. Strange anomalies often creep in but can be corrected if the document is just "eye-balled" right after generation. Most of the anomalies can be "adjusted" out of the configuration and the driver can then do its job correctly.

05-24-2012 13:02

"the Adobe PDF printer driver failed – again" The print driver did what is was supposed to do. All software has configuration options, and yours what not set up for what you needed. Let's not blame the hammer, if you miss the nail.
Moreover, this is a UX problem with your vendor's web. It's possible to provide a nice print media CSS that would avoid your problem, just as it's possible for them to put a PDF output function on their website.
Garbage in, garbage out has always been the rule.

05-22-2012 23:38

Gibberish pdf's appeared in our organisation and we identified this occured when Calibri, MS default font was selected. When we reinstated the corporate True Type font the problem was solved (without applying the Adobe fix).
Gibberish is not only in pdf's they are appearing in:
- emails (e.g. AIM Webinar Invitations)
- documention (native format) from external parties
- information copied from websites
I have noticed fonts no longer have the TT (True Type)indicator. TT's are reliable with illegibility at low resolution, non romanised characters (other languages), digital rights management and appearance publishing consistency (True Typeface).
I dont think the Adobe fix will address the other instances where gibberish occurs so wouldn't it be safer to use TT's?

05-22-2012 10:56

Thanks for the great article, it's important to remember that when you automate a task -- that you automate all of the steps. One of the steps when someone manually generates a PDF is to look at it and make sure it looks OK. Why not include that in your automated process?
I talk about this in my post here:
http://www.adlibsoftware.com/blog/why-some-software-leaves-you-validating-results-after-converting-from-word-to-pdf.aspx
Simply automating some of the tasks (The conversion to PDF) is not sufficient...You also need to automate the validation and detection of unexpected behavior.
Then the humans can focus on dealing with exceptions instead of having to look over the shoulders of our heavy-lifting software-based robots ;-)

05-22-2012 10:35

I just read Mr. Phillips article on PDF Generated Gibberish, which happened to me yesterday. I followed the instructions to select the Adobe PDF printer and deselect “Rely on system fonts only, do not use document fonts”, and magically everything works well now. Thanks again!