There is a whole lot of information out there about the "value" of Content Analytics, Autoclassification, etc... (ex: AIIM's recent "Using Analytics" whitepaper) but I'm having difficulty finding anything, anywhere, regarding the skills & education required to actually DO it.
Anyone out there have some advice to give? I'd really like to extend my Records Management skills into the IT/CA realm and any advice on where to start would be a huge help! If there are Applied Content Analystics training programs out there, that'd be great! But even if I could know what skills I should obtain (ex: SQL Database Administration, programming languages to learn, or statistical skills to obtain) it'd be a good start!
Good question! If you are looking for courses, you may want to look into available online courses like https://www.mooc-list.com/categories/statistics-data-analysis This is a list of online courses across technologies and concepts. I have done a MOOC (Massive Open Online Course) earlier this year about a topic and it's a good way to learn from top experts in a virtual way.
As regards unstructured information, the three key areas in my opinion are:
1. Text Analytics (curation). There are free online courses from Stanford also Text Mining and Analytics - University of Illinois at Urbana-Champaign | Coursera and plenty on Python OpenSource.Welcome to Python.org I recommend teaching yourself using Python, plenty free on the web or buy a cheap O'Reilly book.
2. Natural Language Processing (understanding). Python has an NLTK Library Natural Language Toolkit - NLTK 3.0 documentation and there are lot of books on this.
3. Machine Learning (Prediction). This can include using neural networks (e.g. word2vec), Topic Modelling etc. to surface latent patterns in text. For auto-classification or auto-categorization (big difference) check out Pointwise Mutual Information Measure as an algorithm, great for generating discriminatory clues.
Use of Knowledge Organization Systems (KOS) like thesauri, taxonomies, authority lists and rules are key to mix with the statistical: Hybrid is best!
Put all these three components together and you can also build apps that some marketer's are calling 'insight engines' or 'cognitive search' as they augment human thinking and mimic in very simple and limited way how 'we think', but because they can use more information than we can read they can suggest what we may miss or what we see (but don't see because we are biased).
You may find these posts I made of interest on slideshare, Paul Cleverley's Presentations on SlideShare some of them cover text analytics and its all free.
If you would like a chat to get started, more than happy to share what I know to the community from an academic perspective.