Language is Not Just a Jumbled Bag of Words: Why Natural Language Processing Makes a Difference in Content Analytics

By Johannes Scholtes posted 03-27-2012 07:43


State-of-the art text analysis supports multiple languages, which is critical when investigations go global and involve collections of information in various languages. In such scenarios, the technology obviously adapts to differences in character sets and words, but the tools also need to incorporate statistics and linguistic properties (i.e., conjunction, grammar, sentiments or meanings) of a language in order to achieve acceptable performance. This Natural Language Processing can dramatically influence the insight that is drawn from text.

However, some text analytics products take a so-called “bag of word” (BOW) approach, in which all words (maybe with the exception of a list of high frequency noise words) are dumped into a mathematical model, without any additional knowledge or interpretation of linguistic patterns and properties such as word order (“a good book” versus “book a good”), synonyms, spelling and syntactical variations, co-references and pronouns resolution or negations.

This “bag of words” approach takes simplicity one step too far. Here is why:

There are special issues that one has to take into account when applying, for instance, entity-, fact-, event- and concept extraction techniques in text mining and where Natural Language Processing can make the difference:

  • Variant Identification and Grouping: It is sometimes needed to recognize variant names as different forms of the same entity giving accurate entity counts as well as the location of all appearances of a given entity. For example, one needs to recognize that the word "Smith", in the example, refers to the "Joe Smith" identified earlier and therefore groups them together as aliases of the same entity.
  • Normalization: Normalizes entities such as dates, currencies, and measurements into standard formats, taking the guesswork out of the metadata creation, search, data mining, and link analysis processes. An example would be good here…
  • Entity Boundary Detection: Will the technology consider “Mr. and Ms. John Jones” as  one or two entities? And what will the processor consider to be the start and end of an excerpt like,  “VP John M.P. Kaplan-Jones, Ph.D. M.D.”?

Such basic normalizations will not only dramatically reduce the size of the data set, it will also result in better data analysis and visualization: entities that would not be related without normalization can be the missing link between two datasets especially if they are written differently in different parts of the data set or if they are not recognized as being a singular or plural entity properly.

In addition, one of the other problems in the discovery and identification of entities, facts, events and concepts, is the resolving of the so called anaphora and co-references. This is the linguistic problem to associate pairs of linguistic expressions that refer to the same entities in the real world.

Consider the following text:

“A man walks to the station and tries to catch the train. His name is John Doe. Later he meets his colleague, who has just bought a card for the same train. They work together at the Rail Company as technical employees and they are going to a meeting with colleagues in New York.”

The text contains various references and co-references. Various anaphora and co-references will have to be disambiguated before it is possible to fully understand and extract the more complex patterns of events. The following list shows examples of these (mutual) references:

  • Pronominal Anaphora: he, she, we, oneself, etc.
  • Proper Name Co-reference: For example, multiple references to the same name.
  • Apposition: the additional information given to an entity, such as “John Doe, the father of Peter Doe”.
  • Predicate Nominative: the additional description given to an entity, for example “john Doe, who is the chairman of the soccer club”.
  • Identical Sets: A number of reference sets referring to equivalent entities, such as “Giants”, “the best team”, and the “group of players” which all refer to the same group of people.


There are various ways of approaching these problems: (i) with an in-depth linguistic analysis of a sentence, or (ii) using a large already-annotated corpus. Both techniques have their advantages and disadvantages. There is still a lot of research required in this area in the coming years to improve the quality of these types of analyses, but there are already many reliable techniques to resolve co-references and anaphora.

Now, if all these natuaral language processing techniques are applied one will be able to:

  • Extract 2-4x more (and better) entities, facts, events and concepts than other providers that use, for instance, bag of words technology which completely ignores linguistic patterns, negations, pronouns and co-references.
  • Increase overall extraction quality easily with 20-30% up to 95+% recall, especially due to the proper interpretation of negations and better boundary detection.
  • Provide a smaller and better set for data analysis such as the derivation of relationship networks and correlation patterns between custodians in social networks and email.
  • Create much better training sets for machine learning algorithms such as those used for machine assisted review or predictive coding.

Language is not a jumbled bag of words; ignoring simple linguistic structures such as synonyms, spelling and syntax variations, co-references and pronouns resolution or negations will result in technology that will probably never exceed 60-65% recall and precision. This stunted performance is caused by the simple fact that lots of relevant information is ignored or wrongly used to build training sets for machine learning. To end users, 60-65% is only 15% off from random selection, which is often interpreted as unreliable behavior for eDiscovery, Governance, Enterprise Information Archiving and other Information Management initiatives. Yet surprisingly, many eDiscovery-related software products do not support Natural Language Processing, and therefore, they completely ignore relevant linguistic patterns like those described in here.

Of course, there are a few exceptions; some eDiscovery and information management products do offer Natural Language Processing as part of their text analysis.

For text mining, a very in-depth analysis is often not necessary; a reasonably light analysis can be sufficient to identify the most important elements of a sentence: the subject clause, the verb clause, potential proper nouns, references, and other relationships. In many cases finite-state parsers or shallow parsers can be used with the support of dictionaries. These analyses are also commonly known as a part-of-speech (POS) analysis.

It is not that hard to make these systems better and more reliable. A few good tips are listed here!

#NaturalLanguageProcessing #e-discovery #ElectronicRecordsManagement #SharePoint #Content Analytics #Text-Mining