Developing your Assessment Plan for Defensible Disposition

By Richard Medina posted 04-01-2013 00:42


As I outlined here, an effective defensible disposition methodology primarily consists of developing and then executing four major stages:

  1. The Defensible Disposition Policy
  2. The Technology Plan
  3. The Assessment Plan
  4. The Disposition Plan

This post focuses on the third stage, the Assessment Plan.

Developing your Assessment Plan

The Assessment Plan specifies which information and systems you’re investigating and the particular processing rules you’re going to use. The first step in developing the Assessment Plan is to do the legwork and get a good picture of where all the information is, what repositories it’s in, and anything else you can learn about it that will help you create the rules (described below) and develop a plan of attack. This may take several days or weeks.

You then create processing rules based on the different types of file attributes. There are three categories of attributes that can be used to determine what kind of files you’re dealing with. Note that I’m not using terms like "metadata" and the other terms vendors use when they talk about analytics and classification. We find them confusing. Instead we focus on Environmental attributes around the file, File level attributes about the file, and Content attributes within the file:

  1. Environmental attributes around the file (e.g. file location, ownership)
  2. File attributes about the file (e.g. file type, age, author)
  3. Content attributes within the file (e.g. keywords, character strings, word proximity, word density)

You should then combine these attributes and create sets of rules that machines can use to sort the files, and that help the automated processes flag some of the files as exceptions that need human attention. Start with the simplest rules in which you have the most confidence. And then do multiple passes through the pile, each time using more complex rules on a pile that’s getting successively smaller and smaller.

A general rule of thumb is to use simple file level attributes in the first pass. In later passes, go harder and discover against environmental attributes, like location or access controls. Then use content attributes within the files like character strings.

Let’s walk through each of the three kinds of attributes.

#1: Environmental Attributes

The first are environmental attributes – characteristics around the file. Take a look at this table:


  • Access Controls (#1) can tell a lot about a file. Suppose that the author of a file or the only person allowed to see a file has been terminated. Or suppose that the author or sole authorized reader is a named custodian in litigation. That will tell us if we must keep the document or not.
  • The Location in File Path (#2) is helpful because it can tell us what department the file is associated with, and what other documents it might be related to. But unfortunately file path is only really helpful in this way if you’re looking at companies who are well organized and practice good document hygiene in their naming conventions and file paths. For organizations with lots of restructuring and who don’t take the time to remap their documents and paths, using file paths can be problematic. 

    But the good news is that using file paths you will often come across lots of orphaned material. If this is the case, you can do something simple and effective. Establish a retention policy mandating that work in progress (WIP) materials in NAS environments should be retained for just 3 years. Then move the orphaned files into quarantine or dark archive for 3 years. Now the clock has started ticking. If the orphaned material is not accessed in 3 years, you can assume it’s garbage and purge it – perhaps after doing a quick keyword search to check that it’s not under litigation hold.

#2: File Attributes

  • If you think about file attributes or characteristics about the file, there’s a lot to work with in terms of whether a file is an exact or near duplicate (#3). I’d caution, though, that dupes can be problematic because you often have to determine which of several (2, 5 – or even 20) duplicates is the authoritative source and which are copies.

    Duplicates can also potentially mess up access controls. If you are using an analysis tool to crawl your shared drives in your NAS environments, looking for duplicates,  there’s nothing you can do unless you take a single copy and put it in an ECM system like FileNet, Documentum, OpenText, or possibly SharePoint. You can then stub it (hyperlink to it) in the file system so everyone gets pointed to it, thereby maintaining its access controls.
  • With File types (#4), we can look at the extension or better yet crack open the header of the file and look at the MIME type to identify file types that are inappropriate on corporate networks, such as iTunes libraries.
  • Metadata (#5) can be used – though it often doesn’t exist or just can’t be trusted. But it’s usually useful to go after file age and author, or confidentiality or security markings. We usually use metadata attributes in combination with other variables.
  • And then there’s File Name (#6). You’re in luck if your organization has been using common naming conventions. It’s also sometimes apparent that you’re looking at system generated files, for example weekly reports and sales forecasts. These are definitely not records worthy. They are transitory in nature, and if that’s apparent by naming convention we can sample a handful on a regular basis and be fairly confident that there’s no need to retain them.

#3: Content Attributes

  • Within the file is the last set of attributes we would use – and I’d caution you that many of them are the most difficult for today’s analysis and classification technology to work with. Here we’re looking at strings of characters, pattern matching, and the like.
  • What’s interesting is that some content attribute analysis is actually reaching maturity – specifically key word assessment (#7) in the e-discovery space. That’s fairly robust with prebuilt lexicons that let you go after certain terms with some very proficient tools.
  • But character or word patterns (#8) are what have traditionally been called classification with proximity and frequency of words, or pieces of words, where you’re trying to discern the document type by training it with exemplary samples. It's the cutting edge of the content analsyis and classification industry today.

Assessment Results and Summary

The assessment results after multiple passes will show you how much of the pile is:

  • Unnecessary “junk” that can be purged
  • Records that should be retained according to the records retention schedule
  • High-value non-records that should be retained according to policy (e.g 3 years or 7 years)
  • Information that just can’t be identified by analytics, classification, or any passes using sets of the above attributes assembled into rules

This last piece may be large (we often see 40-50%). The only recourse for this segment is to stage it for disposition – move it someplace like quarantine with read-only status and start the clock ticking. If any files are read, we assume it has value. If it’s not read, we assume it’s transitory and of little value. We may crawl it for legal hold just prior to disposition. But then we purge it.

#DefensibleDisposition #metadata #analytics #InformationGovernance #assessment #ElectronicRecordsManagement #classification