The Ocean of Big Data

By John Phillips posted 06-11-2012 15:53


We have all heard the phrase “That’s like trying to boil the ocean.” We hear this phrase when it appears we are trying to do too much and are probably going to fail as inadequate manpower, resources, and budget slowly erode any real achievement of goals. Projects that are “trying to boil the ocean” are typically said to need better focus and a more well-defined scope to enable actual accomplishment of quantifiable objectives. It sometimes seems we learn little from the past even though many contemporary challenges have been seen before. Big Data proponents may soon be embarrassed as they experience the axiom brought to us by George Santayana when he said “Those who cannot remember the past are condemned to repeat it.”

So, where have we seen before many of the almost insurmountable challenges brought to us by Big Data? In a 1990’s concept termed Data Warehousing. As with Big Data, many of the proponents of (Big) Data Warehousing thought that the limitation of CPU processing power, data storage, and data analytics were the fundamentally limiting challenges that needed to be surmounted. They thought we needed more advanced computer technologies, greater rates of content capture, increased storage capacities, and more integration of divergent software platforms and data formats. Then, the realities of information creation context, data security constraints, metadata consistency, and overall e-records management became evident. It also became apparent as well that there was a fundamental need to design significantly generalized usability into a collection of information silos just dumped together.

I wrote about this in an article published in the April 1997 Records Management Quarterly professional journal entitled “What’s in that Data Warehouse?” It is shocking how many of the information management issues enumerated are still true today – of Big Data:

  1. “A data warehouse is an electronic means to store a large amount of reference or historical data typically used to support decision-making and information retrieval needs of an organization.” Sound like Big Data?
  2. “The usual components of a data warehouse are 1) software tools to extract the data from existing databases, 2) software to manage the new database (warehouse) environment in which data are to be stored, and 3) software to retrieve and analyze the data in the data warehouse.” And then Big Data proponents want to also throw unstructured data into Terabyte and Petabyte warehouses?
  3. “After data is incorporated into a data warehouse, how does one assure that a “part number” from one database origin is the same as a “sub-assembly number” from another database origin?” – Metadata and taxonomies anyone?
  4. “The information that creeps into the corporate data warehouse may be used for purposes beyond that originally intended by those loading and storing the information.” Is there going to be a Big Data Security and Privacy Czar? Which data driven e-records are non-sensitive, sensitive or need-to-know only?
  5. “Although many data warehouses in use today are used for strategic planning or data analysis activities rather than as operating production computing systems, there will eventually be a tendency to treat the data warehouse as an authoritative source of electronic business data and records.” Who determines authenticity and accuracy of evidence in a Big Data Set? and,
  6. “Data compilations have been requested in court proceedings for many years, It is entirely conceivable that the data in a data warehouse will eventually be used in legal proceedings against the owner of the data.” Is everyone ready for Big e-Discovery?

Big Data activities may at some point seem like trying to boil the global InfoSphere. How does one define boundaries of input to assure relevance for objectives? There is no question that the ability to access and analyze large sets of data can be extremely useful for certain specific time constrained focused purposes. But how do you calculate ROI on each data set added to the Big Data Set? Of value to who? For what purpose? There is already a lot of discussion in information technology blogs and communities that we need better tools, storage, and concepts. To achieve what/whose goal(s)? Does it really make sense to dump together potentially unrelated data sets and then turn around and try to figure out how to access and use the “stuff” in the new “Big Whopping Data Set”?

If Big Government, the Big Corporation, and Big Labor are often considered negatively, will Big Data fall to a similar fate? Or maybe the Big Bad Data Wolf will threaten to blow our house down as it did in the Three Little Pigs episode of the book English Fairly Tales. I doubt it, since most business data is still “housed” in data centers made of bricks (and mortar). Oh, and, whatever happened to the “Small is Beautiful” concept of a few years ago? Remember, when small appropriate technologies were all the rage because they “empowered” people? So far, is Big Data empowering or overwhelming?

So, is there anything fundamentally new about Big Data? I am not sure there really is. Maybe it is just a Big Fad. Bigger usually means more complex, more costly, more challenging to implement, and more difficult to establish ROI or added value to a particular organization. It will be interesting to see what John Mancini says Wednesday in the AIIM Webinar on “Big Data? Big Hype!” I may agree with him already.


#ReallyandTrulyBigData #ERM #ECM #Databases