BigData : old problem, new solutions

By Serge Huber posted 06-07-2012 09:52


For a lot of people, when they hear about BigData, they think that this is a new problem, and that in itself it requires new understandings and capabilities, be they human or electronic. But BigData is not really a new problem, as it is all relative to the storage capacity and computing power available at a specific time. For as long as computers have existed, large datasets have been created or collected and have created challenges for automated processing and analysis.

As an example, I’d like to draw from personal experience. When I was a student at the Swiss Federal Institute of Technology of Lausanne (EPFL), I worked on a dataset called “The Visible Human” . This project was an incredible one, because it was using a real person’s body (donated to science), that was frozen and then cut into thin slices, and scanned at high resolution to build a complete and realistic database of human anatomy. It was then made public and available for scientists and doctors to use however they wanted and made it possible to advance medical research significantly.

My project was quite advanced at the time, building on research that had been done at my school ( to make it possible to process this large dataset in real-time, using “large” clusters of computers and extract data of this complex set in a highly efficient way. I deliberately put the word “large” in quotes because at the time this implied about four computers working together, which might seem small today but was already quite powerful for the time. The school had developed a library called CAP to make it very relatively easy to harness the cluster’s computing power, and had also attached it to another cluster serving as a network attached storage (NAS).

You might be wondering how large this dataset was. Well, considering the time (1997), this complete dataset of high resolution slices was 15GB large. This might seem small by today’s standards but you can trust me that this was an incredibly large amount of data back then, especially when at the time maximum hard disk capacity was around 1GB ( ). So you can imagine the amount of disks needed to store a 15GB dataset.

Despite these challenges, the clustering worked quite well, and it was possible, in near real-time, to retrieve any slice from the huge dataset. My project involved further processing the extracted slices to reconstruct 3D voxel objects that could then be viewed by doctors to look at a specific structure, be it an organ or any other part of the human body.

Let’s forward back to now, where the scope of BigData has of course increased by huge factors, and that again it far exceeds nominal disk capacities. Analyzing these datasets causes all kinds of problems, but a lot of these have been seen before, in some form or another. The real new problem is how to do this in a efficient and cost-effective way, while improving the speed of analysis.

So since the problem is not really new, how have the solutions changed to be more advanced ? Well, for one, large clusters are now much more accessible than ever before. With the use of Amazon Web Services for example, one can easily rent a lot of computing power to perform either one-time or permanent large data analysis. Also, since one of the most looked-at is commercial user behavior, a lot more man power has been invested in building tools to handle large data analysis, as for example the work going on at the Apache Hadoop project. Although this is a pretty low level library, it already makes it much easier for developers to handle large sets of computers to handle various problems, may they be data analysis or any other form of extreme computing loads.

BigData is really a problem built for the open source world, as it makes a lot of sense for companies to share experience and developer time on the infrastructure work, which is complex and expensive, while focusing on the actual algorithms and products that they will sell to customers. The common problem areas and the sharing of the implementations have given birth to new open-source solutions that are now widely available, making it possible to build on top of existing infrastructure. Among the new interesting technologies that can help deal with large data are the various NoSQL implementations that are now available, and the associated interoperability standards that make it a little easier to access various data store (such as REST, JSON, JCR, CMIS or the upcoming WEMI standard).

The algorithmic part is now considered the hard part, especially since it crosses over to other domains such as mathematics, semantics, or any other specialized business or scientific area skillset required to analyze the data. So in effect the real reason BigData is hard is because of the need to get teams together that mix up quite different knowledge areas. For companies this might be difficult to do as they might not have the skill sets in-house and might need to outsource or hire new specialists. This is also possibly why a lot more work on BigData has happened in Universities where it is much easier to find the various skill sets needed.

In the content world, BigData is often associated with analytics, where the main interest is to process user’s behavior to improve content personalization and make sure it fits the needs and interests of the content consumers. But there are also other interesting uses such as content semantic analysis or integration with business data systems to extract new knowledge out of existing structured or unstructured data (such as the EU-sponsorised IKS project)

For content management systems, dealing with BigData will often mostly mean integrating with external data analysis systems, as the former are more concerned with dealing with managed data, while the latter deal with processing data. It is my hope and focus that the open source world will rise up to the challenge rather than having a world of incompatible vendors selling proprietary solutions for data analysis, as it is now mostly the case in the world of web analytics.

#analytics #ContentManagement #BigData #NoSQL