Structured vs Unstructured Records– Really?

By John Phillips posted 06-08-2011 14:44


One of the things that always amazes me is how the proverb “the more things change, the more they stay the same” still applies today in the world where technology and recordkeeping concepts merge. Why are we using the terms “structured data” and “unstructured data”. Where did these terms originate? Why are they important? Are they really of any value?

My recollection of hearing these terms for the first time was back in the 1980’s when the IT shops that owned “Management Information Systems” were being pushed aside politically in organizations as the business units all moved in mass to the use of personal computers. A Microsoft DOS based “IBM personal computer” and Lotus 123 spreadsheet software were far more responsive and useful than an IT shop managed “minicomputer” based application that expected you to “submit jobs” using “terminal emulators.” You were to then wait on the results to be displayed onscreen or picked up somewhere in paper format. In defense to the migration to PCs, the IT shops just said “We’re in charge of the structured data, and all of the rest of that stuff is just unstructured.” The implication was that if you stored your business information on PCs with local application software, instead of accessing your information from data in their databases, you were just some kind of risky mental light weight. Of course, the business units were focused on what actually accomplished their tasks on time, so they cut their central IT support budgets, bought PCs, and, as they say, the rest is history.

To this day, we still see widespread use of these terms. However, in almost every case, the term structured data can be simply replaced by the term “database data” as this describes the format and presentation requirements of this information. Unstructured data can be simply described as “electronic objects” because that also implies that there could be a variety of actual data file formats and presentation requirements for the information stored within those electronic files. Is structured data structured? Yes, in that it has sufficient metadata associated with the multiple “records” within the electronic file to display in tabular column and row formats. Or you can add some formatting and display metadata by presenting the database information with associated headers, footers, pagination, etc. thereby creating an informational electronic object. This is the proverbial “snapshot” of some of the data in the database.

But, what exactly is unstructured about unstructured data? In fact, most word processing files can display data with headers, footers, pagination, margins, fonts, sections, tables of contents, footnotes and other structure. Presentation graphics software stores data that can be displayed as slides with headers, slide numbers, graphics, master slides, notes slides, etc. Spreadsheet files can display data visible as tables and rows of “records” with associated tabular headings and pagination. Simple application resident search commands can access the contents of these files and all formats can store vector or raster images embedded within the electronic files. Is an HTML file not structured? Try the “view source” command in some browsers or just open the hypertext markup language file with a text editor and you will instantly see what is meant by a “tagged text” file. Headers, footers, and tagged text are everywhere. Who is telling us that this is not structured?

Where does this structured concept of data begin or end? Are XML files not structured? Are spreadsheets not structured? It seems that the only real distinction between structured and unstructured data is the storage of the data in a format that requires a query language like SQL to access the information in each electronic object or file. In fact, you can query and display data from both XML files and spreadsheet files, just not as powerfully as with standard SQL accessing database data.

Think about this. Despite all of the lofty references to structured data somehow being more accessible and manageable than all of that unstructured stuff, one almost never uses structured data without displaying it as a report or a screen of formatted data. In other words, unless it is converted to a page, document, or other record that is pretty much functionally the same as unstructured data, it is largely useless for most people. Try giving your management some structured data in the form of a tabular data dump, without report titles, headers, columnar metadata, pagination and other formatting and see how they like the structured data. What they will expect is that you convert the structured data into a report with some informational context and value. They will expect delivery as PDF file attachment or (gasp) a paper report, both of which really resemble those “unstructured” electronic objects. And they will have to be managed with software intended to manage those objects – probably ECM/ERM software.

Are we not limiting ourselves to the professional perspective of IT shops from decades ago by continuing to use their terminology? Let’s think about how we might better deal with this issue. It seems that it boils down to two types of software to manage slightly different file formats – database software and content management software. What do you think?

#ElectronicRecordsManagement #structuredcontent #erecords #unstructuredcontent #xml #records #ECM