The Enron Data Set is an industry-standard collection of email data that was previously hosted by EDRM and in 2012 became an Amazon Web Services Public Data Set. The Enron Data Set has served for many years as an industry-standard collection of email data for electronic discovery training and is a valuable public resource for all sorts of researchers from all disciplines.
It has never been a secret that the data set that was originally made available by the Federal Energy Regulatory Commission (FERC) contained a high level of personally identifiable information (PII) and even Protected Health Information (PHI) about the company’s former employees.
In May of this year a great cleansing effort has been made by the consultants from Nuix, identifying many items containing private, health and financial information. With an additional cleansing exercise ZyLAB responded to the invitation of the EDRM co-founders George Socha and Tom Gelbmann, to come forward and support this effort to protect the privacy of hundreds of individuals and help them locating any additional private data that may still exist in the data set.
By using the brand Visual Search & Classification technology in combination with the existing deep processing, content analytics and search capabilities, several hard to find items like documents containing social security and credit card numbers, protected health information, 1040 tax forms, and even indecent pictures have been identified.
This effort shows that the right search and content analytics technology is absolutely required to find PII and PHI, regardless of spelling errors, OCR errors, deliberate hidden data, aliases, code words, digital format, location, or language and even regardless of the fact that the data contains explicit text in cases of images, video or audio recordings.
Why is this relevant to you? Well, in today’s legal climate, privacy, protected health-information and data protection violations are becoming the number one concern of organizations building Big Data collections. Several fines and penalties have already been given to non-compliant organizations and regulators such as the FTC in the US and law makers in Europe such as the European Commission and the European parliament are working hard on very strict new legislations that contain even more risk to enterprise and government organizations.
So, what should you do? Here are some steps to follow:
First, make sure that you fully known what types of privacy and health related data your organization has to maintain.
Next, ask yourself if you really need to keep the original data, because it can often be more cost effective to (auto)-redact data and automatically remove or replace names, addresses, phone numbers, social security numbers, etc which can link data to an individual. Once anonymized, your privacy risks and data management cost will reduce dramatically.
Then, create different storage areas for different types of data. Protect sensitive data by using encryption and additional access control.
Most important: know what you have. Use data analytics tool such as used for the additional cleansing of the ENRON data set to identify miss-placed information. Do not do this once, but do this on a regular basis.
And finally, make sure to apply data retention policies accurately: do not keep data longer than required, certainly not data containing high privacy and data protection risks!
By implementing these simple steps in combination with the right technology, you will be able to reduce the risks of your Big Data collections and make your Big Data defensible!#BigData #redaction #e-discovery #PII #PHI #privacy