Blog Mining Gets Real

The torrent of facts, data, figures and insights that blogs deliver daily are random and chaotic, yet immensely valuable in the right context. For companies committed to getting to the truth of where they stand with their prospects, customers, suppliers and many other stakeholders, blogs are becoming the medium of choice.

One strategy is to cull through them and take snippets out and circulate them around the company; it is quite another to take the time to mine the blogs for usable and measurable competitive, distribution, product development, service, and support insights. Capturing information from the chaos of all that data is the challenge. That’s why blog mining is starting to get real.

In watching this area for some time, the intersection of text mining, data mining, linguistic analysis, statistical analysis, and latent semantic indexing techniques (Google uses latent semantic indexing for example in indexing web pages) on the one hand and the glaring need to interpret blogs on the other, looks like a problem many marketing, sales, and service departments grapple with daily.

I decided to dive in and see what’s real and what’s not in this arena by visiting each vendor’s site and trying to find useful downloads that would give me the ability to complete unstructured content analysis from blogs on my laptop. What I found out is the basis of this article.

Business Intelligence’s Perfect Storm

What immediately is apparent is that every vendor with any type of text mining, data mining, linguistic or natural language processing capability is quickly announcing text mining and unstructured content applications. No doubt so many business intelligence vendors can hardly wait for text and blog mining to start taking off. SAS and SPSS, two of the data mining powerhouses in the BI market, each have applications purpose-built for text mining.

While SPSS doesn’t offer a test drive of theirs, called Clementine, SAS does offer a download of SAS Enterprise Miner, which requires Base SAS or SAS/STAT be installed beforehand. Autonomy and Verity, two search engine vendors, have created text mining applications in addition to productizing their search technologies. Companies specializing in text mining are Inxight Software and Stratify, Inc. SAP and IBM also have text mining applications. Companies that are also worth keeping track of in this area are Insightful Corporation and ClearForest Corporation.

Of all these companies, SAS’ downloads came the closest to making the goal of completing an analysis of blog content achievable without spending for an application. If SPSS made Clementine a free download for trial it would have made the goal of analyzing a blog accomplishable fast. SPSS does offer the complete SPSS 13.0 for Windows for download, yet there isn’t much in the way of text mining tools in that application.

No search for applications to quantify blogs would be complete without checking out Wolfram Research. They are the developers of the Mathematica series of quantitative analysis applications, and also have a free trial of Mathematica CalcCenter 3, which supports a wide variety of algebraic, statistical, data analysis and report writing functions. They do have a download available for CalCenter 3, yet it is disabled of any computational functions.

Natural Language Processing

There is much work being done in this area that is worth watching. First, there’s IBM and its significant research efforts in natural language processing, you can read about here. Microsoft also is investing heavily in natural language processing, and you can see their research page here. The best-of-breed vendors in this arena show much potential for taking the unstructured data of blogs and building linguistically significant relationships. These include Attensity, which excels at taking unstructured data and not only quantifying it, but even monetizing the decisions surrounding the data.

Island Data uses natural language processing for organizing unstructured content in its Insight RT Suite. There’s also Centor, which got its start using natural language processing to find linguistic patterns in unstructured content from the automotive industry and has since branched into high tech. Attensity requires a server component, Centor does as well. Island Data is hosted.

MIT had the best download for completing natural language processing that is immediately available on the Web. It’s a creation from the MIT Media Lab. The ConceptNet Project is a shareware toolkit for handling natural language processing. While free, to make the kit work it takes much reading of documentation and loading free prerequisite server components from the ConceptNet site.

Blog Mining as a Service

Yet another option for analyzing blogs comes from information services companies that monitor a select set of blogs and summarize them, or in the case of Intelliseek, their site has a set of tools worth checking out. Included in the May 19th refresh of this site are clickable trend charts (called BlogPulse Trend Tool) and enhanced conversation tracking. To get a sense of how often your company is mentioned in blogs use the Trend Charts to compare yourself to your industry or to competitors.

Comparing the last six months of mentions in blogs of SAP, Siebel and CRM produced the result shown here. Another company, Buzzmetrics, also tracks blog mentions of your company and also has a methodology for measuring the impact of word-of-mouth influences in purchase decisions. Relatives of mine who work in the financial side of the movie industry say word-of-mouth is the single biggest influencer of ticket sales, so no doubt Buzzmetrics will see some business from the studios this summer.

What’s most interesting about techdirt is their open door approach to having anyone submit stories for publication on the site. It’s more like an opinionated news feed than a collection of press releases and shows what a blog can become. See what the Harvard Business School had to say about techdirt here.

In my effort to find tools for analyzing blogs without spending a bundle, I found their site techdirt ci which is short for competitive intelligence. The concept of techdirt ci is to scan blogs, web pages, and any other form of publically available electronic information and then deliver to your company a personalized blog of information on market information, competitive analysis and major news from your industry.


Like unstructured content captured on Web forms that never really gets used, blogs’ explosive growth is generating raw data sets that your company really can’t afford to ignore. Consider these recommendations for capitalizing on blogs:

  • Start a blind blog to capture what customers really think of your company.
  • Set up the necessary tools to analyze the blogs.
  • Take a competitor’s blog and using any of the tools mentioned here, analyze it.
  • Check out at least once a week, if nothing else to watch the statistics of growth around blogging.

Bottom line: At the beginning of the year blogs were considered by many industry watchers one of the top ten trends. It’s becoming very clear that blog mining is certainly part of that mix.

Louis Columbus, a CRM Buyer columnist, is a former senior analyst with AMR Research. He recently completed the book Getting Results from Your Analyst Relations Strategies, which is available on

Leave a Comment

Please sign in to post or reply to a comment. New users create a free account.

CRM Buyer Channels