How To: Exploiting Big Data with Indexes

By Kevin Neal posted 07-09-2013 15:41



Use Case:  In today’s business environment, more than ever, it’s simply not good enough to be average.  Organizations of all sizes have to strive to create competitive advantages, understand trends and gain better insight into operational efficiency.  One of the most useful techniques to accomplish these goals is to Exploit Big Data through analysis.  However, this is challenging due to the volume, velocity and variety of content that must be analyzed.  Image-only files are useless in data analysis.  Therefore, in order to take the all-important first step in exploiting all of your content is to apply indexes so that computer systems can properly begin to understand the information.

  1. Reporting:  Business executives are generally paid good money to make important decisions about the business and these decisions are often based on reports.  These reports are often compiled from various data sources such as spreadsheets, interviews with customers or employees and possibly other documents.  This method of gathering all this various data is not only time-consuming but it’s problematic due to the fact that the data is often presented in a inconsistent manner.  For this reason you will want to use a Big Data system such as Splunk where business executives and have instant access to sets of data from various sources that is real-time information and presented through dashboards or graphics that can clearly show trends or other information that is pertinent to the decision making process.
  2. Predictive analytics:  Historical reporting is fantastic to analyze information yet this information is typically in the past.  Imagine if you can proactively determine a trend or predict, with solid data, future events?  This is a major benefit of Big Data aggregation.  For example, given the right set of data you can probably predict where mortgage interest rates will increase or decrease in a particular geography.  You would use statistics such as current available housing inventory supply, real-time unemployment rates as well as possibly the latest transactions within a certain time period.  Also, using the same Big Data aggregation concept but for a completely different application is predictive analytics is in the field of Healthcare.  If you can feed enough Index information into a Big Data solution then healthcare providers can narrow down much quicker the proper diagnose on people with illnesses where this can enrich people’s lives.
  3. Business process improvement:  There is always room for improvement and this is especially true in the business world and the most effective way to effect positive improvement is through the visibility to business processes themselves.  Once you understand the process then you apply matrixes to these processes such as time needed to complete a task or steps needed to finish a project.  A Big Data solution such as Splunk is an ideal complement to the efficiency improving technologies such as ABBYY Data Capture with tangible return on investment through reduced labor costs associated with manual data entry and Box with highly effective collaboration where enterprise workers can get work done quickly and be overall more effective in their business activities.  Just by deploying a Big Data analysis system with Data Capture efficiency and Collaboration on mobile that is secure is absolutely one way to achieve better process improvement but just imagine all the possibilities that can be done with the data itself.  And it all starts by Exploiting Big Data with Indexes.
Features Benefits
  • Automatic indexing of relevant data
  • Full-page for complete index
  • Touch indexing for structured data extraction
  • Reduces costs associated with manual data entry
  • Ability to analyze all data sets
  • Offers ease of use for high user adoption

Solution Description:   This solution might sound gaudy and complicated but it’s actually straight-forward and logical.  There are three basic concepts which are Index Creation (ABBYY technology), Index Analysis (Splunk) and secure Image Storage (Box).  We will use several technologies to create indexes for various reasons and then we will feed our Big Data system all these indexes so that this software can do what it does best.  The Big Data system allows administrators to easily aggregate all this data and then create dashboards, reports and other useful business intelligence tools.  So the process is quite logical:  Capture indexes for all sources including existing databases, paper documents and, of course, images and send all these indexes to Big Data.  Then send the images to Box for safe storage, easy access and effective collaboration.


System Requirements:

Note:  This is a software developer and systems integrator solution.  We are using Splunk as our Big Data aggregator in this solution because it is so easy to configure, yet extremely effective.  Splunk can only perform well when you can provide lots of “Index” information.  As seen in this graphic, “Index” is at the core for Big Data to even begin analyzing different data sets.

  1. Box account
  2. ABBYY FlexiCapture for Automatic Data Capture
  3. ABBYY Recognition Server for Full-Page recognition
  4. ABBYY TouchTo for touch indexing
  5. Splunk Big Data software (free download)


Configuration Steps (Complexity = Moderate to Involved):

  1. Start Splunk and review choose Add data
  2. Depending on the output type and format of indexes select the proper Splunk Add Data function
  3. Now connect Splunk to your data source(s)
    1. For example, maybe Recognition Service you might choose ‘From files or directories’ and as an option Preview data before indexing
    2. …and for FlexiCapture you might choose the ‘any other data…’ then ‘Consume data from databases’ because you output to a SQL database directly
    3. …and for TouchTo you might choose the ‘a file or directory of files
  4. After connecting all the index data sources to Splunk it is advisable to review the Splunk Manager options to familiarize yourself with all the various settings and configurations available
  5. Now that you have configured Splunk to utilize Indexes from your various Data Capture and Conversion sources, you will want to gather information contained within Box.  To do this a software developer would utilize the Box API (Application Programming Interface) to import data such as tagsget comments or get file info
  6. A complete list of all the Splunk Indexes can be viewed in Manager
  7. Once all the indexes have been aggregated within Splunk then organizations can truly realize the benefits of Big Data with detailed reporting, predictive analytics and/or improved business process via simple visual tools such asdashboards


Associated screen prints on this solution:

1.  Splunk architecture with Index at the core

2.  Start Splunk

3.  Add data

4.  Splunk add From files or directories

5.  Data preview

6.  Any other data…

7.  Consume data from databases

8.  Splunk add A file or directory of files

9.  Splunk Manager

10.  Splunk Indexes Manager

11.  Splunk dashboard

What do you think?  “Big Data” is still a relatively new idea and many use cases are just coming to light.  How can you imagine using Big Data?  The possibilities to innovate in this area are tremendous, do you have a story to tell?

#data #tag #indexes #indexing #box #tags #metadata #ScanningandCapture #BigData #tagging