The Dark Side of Big Data

By Johannes Scholtes posted 09-24-2012 07:35

Recommend

The ongoing information explosion is reaching epic proportions and has earned its own name: Big Data. Big Data encompasses both challenges and opportunities. The opportunity, as focused on by many parties, is to use the collective Big Data to predict and recognize patterns and behavior and to increase revenue and optimize business processes. But there is also a dark side to Big Data: requirements for eDiscovery, governance, compliance, privacy and storage can lead to enormous costs and new risks.

Applying content analytics helps to assuage the dark side of Big Data, but also to benefit from the power of Big Data.

New data formats (multimedia, in particular), different languages, cloud and other off-side locations and the continual increase in regulations and legislation—which may contradict previous protocols—add even more complexity to this puzzle.

This is where content analytics come into play, and they are becoming an essential toolset, particularly for overcoming the challenges from unstructured and multimedia data. In essence, we need computers to battle the data explosion we’ve caused with other computers.

Several commercial tools exist to gain direct access to all such data formats and repositories, regardless of whether they are on-site or off-site and despite the language in which they are composed. But once you have access to the data and you have applied the traditional tools to unpack compound objects (eg. ZIP’s and PST’s) and manage the vast volumes, how can you derive true understanding from it?

This is where content analytics come into play, and they are becoming an essential toolset, particularly for overcoming the challenges from unstructured and multimedia data. In essence, we need computers to battle the data explosion we’ve caused with other computers. Applying content analytics helps to assuage the risks of Big Data, but also to benefit from the power of Big Data: broad analysis which yields absolute insights.

Content analytics such as text-mining and machine learning technology from the field of artificial intelligence can be used very effectively to manage Big Data. Think of tasks such as, but not limited to, identifying exact and near-duplicates, structuring and enriching the content of text and multimedia data, identifying relevant (semantic) information, facts and events, and ultimately, predicting what is about to happen or classify information automatically. As a result of these content analytics efforts, users can explore and understand repositories of Big Data better and also apply combinations of advanced search and data visualization techniques easier.

Using these types of automated processes also requires an unbiased evaluation of the results and defensible processes. In other words, the quality and reliability of the automatic structuring, enrichment, classification and prediction techniques needs to be measured by using existing best-practices. Only then will end-users accept the usage of such technology for mission-critical processes. Many such best practices exist from the field of information retrieval.

By using the right content-analytics methods, one can implement a defensible process to control the risks of Big Data on the one hand, and benefit from the predictive power and new insights that can be gained from Big Data on the other hand!

#grc #DefensibleDisposition #InformationGovernance #LegacyInformationClean-up #darkside #BigData #e-discovery

1 comment

31 views

Permalink

https://community.aiim.org/blogs/johannes-scholtes/2012/09/24/the-dark-side-of-big-data

Comments

Azana Baksh

09-26-2012 08:58

Johannes, nice article. We are seeing an increase in businesses seeking specialized skills to help address challenges that arose with the era of big data. The HPCC Systems platform from LexisNexis helps to fill this gap by allowing data analysts themselves to own the complete data lifecycle. Designed by data scientists, ECL is a declarative programming language used to express data algorithms across the entire HPCC platform. Their built-in analytics libraries for Machine Learning and BI integration provide a complete integrated solution from data ingestion and data processing to data delivery. More at http://hpccsystems.com

Blogs