How Smart Can Analytics Get?
According to Forrester, the average company's data warehouse today is somewhere between one and 10 terabytes in size. So what happens to analytics over the next decade, as the norm moves toward the petabyte range? How will this hunger for precise analysis, combined with a flood of raw new data, set the stage for powerful, advanced analytics outcomes?
New architectures for data and logic processing are ushering in a game-changing era of advanced analytics.
These new approaches support massive data sets to produce powerful insights and analysis -- yet with unprecedented price-performance. As we enter 2010, enterprises are including more forms of diverse data into their business intelligence (BI) activities. They're also diversifying the types of analysis that they expect from these investments.
At the same time, more kinds and sizes of companies and government agencies are seeking to deliver ever more data-driven analysis for their employees, partners, users, and citizens. It boils down to giving more communities of participants what they need to excel at whatever they're doing. By putting analytics into the hands of more decision makers, huge productivity wins across entire economies become far more likely.
But such improvements won't happen if the data can't effectively reach the application's logic, if the systems can't handle the massive processing scale involved, or the total costs and complexity are too high.
In this sponsored podcast discussion we examine how convergence of data and logic, of parallelism and MapReduce -- and of a hunger for precise analysis with a flood of raw new data -- are all setting the stage for powerful advanced analytics outcomes.
To help learn how to attain advanced analytics and to uncover the benefits from these new architectural activities for ubiquitous BI, we're joined by Jim Kobielus, senior analyst at Forrester Research, and Sharmila Mulligan, executive vice president of marketing at Aster Data Systems. The discussion is moderated by BriefingsDirect's Dana Gardner, principal analyst at Interarbor Solutions.
Listen to the podcast (51:12 minutes).
Here are some excerpts:
Jim Kobielus: Advanced analytics is focused on how to answer questions about the future. It's what's likely to happen -- forecast, trend, what-if analysis -- as well as what I like to call the "deep present," really current streams for complex event processing.
What's streaming in now? And how can you analyze the great gushing streams of information that are emanating from all your applications, your workflows, and from social networks?
Advanced analytics is all about answering future-oriented, proactive, or predictive questions, as well as current streaming, real-time questions about what's going on now. Advanced analytics leverages the same core features that you find in basic analytics -- all the reports, visualizations, and dashboarding -- but then takes it several steps further. ...
What Forrester is seeing is that, although the average data warehouse today is in the 1-10 terabyte range for most companies, we foresee the average warehouse size going, in the middle of the coming decade, into the hundreds of terabytes.
In 10 years or so, we think it's possible, and increasingly likely, that petabyte-scale data warehouses or content warehouses will become common. It's all about unstructured information, deep history, and historical information. A lot of trends are pushing enterprises in the direction of big data. ...
We need to rethink the platforms with which we're doing analytical processing. Data mining is traditionally thought of as being the core of advanced analytics. Generally, you pull data from various sources into an analytical data mart.
That analytical data mart is usually on a database that's specific to a given predictive modeling project, let's say a customer analytics project. It may be a very fast server with a lot of compute power for a single server, but quite often what we call the analytical data mart is not the highest performance database you have in your company. Usually, that high performance database is your data warehouse.
As you build larger and more complex predictive models you quickly run into resource constraints on your existing data-mining platform. So you have to look for where you can find the CPU power, the data storage, and the I/O bandwidth to scale up your predictive modeling efforts.
But [there is] another challenge, which is advanced analytics producing predictive models. Those predictive models increasingly are deployed in-line to transactional applications to provide some basic logic and rules that will drive such important functions as "next best offer" being made to customers based on a broad variety of historical and current information.
How do you inject predictive logic into your transactional applications in a fairly seamless way? You have to think through that, because, right now, quite often analytical data models, predictive models, in many ways are not built for optimal embedding within your transactional applications. You have to think through how to converge all these analytical models with the transactional logic that drives your business.
Sharmila Mulligan: What we see with customers is that the advanced analytics needs and the new generation of analytics that they are trying to do is driving the need for a new data platform.
What you've got is a situation where enterprises want to be able to do more scalable reporting on massive data sets with very, very fast response times. On the reporting side, in terms of the end result to the customer, it is similar to the type of report they are trying to achieve, but the difference is that the quantity of data that they're trying to get at, and the amount of data that these reports are filling up is far greater than what they had before.
That's what's driving a need for a new platform underneath some of the preexisting BI tools that are, in themselves, good at reporting, but what the BI tools need is a data platform beneath them that allows them to do more scalable reporting than you could do before. ...
Previously, the choice of a data management platform was based primarily on price-performance, being able to effectively store lots of data, and get very good performance out of those systems. What we're seeing right now is that, although price performance continues to be a critical factor, it's not necessarily the only factor or the primary thing driving their need for a new platform.
What's driving the need now, and one of the most important criteria in the selection process, is the ability of this new platform to be able to support very advanced analytics.
Customers are very precise in terms of the type of analytics that they want to do. So, it's not that a vendor needs to tell them what they are missing. They are very clear on the type of data analysis they want to do, the granularity of data analysis, the volume of data that they want to be able to analyze, and the speed that they expect when they analyze that data.
They are very clear on what their requirements are, and those requirements are coming from the top. Those new requirements, as it relates to data analysis and advanced analytics, are driving the selection process for a new data management platform.
There is a big shift in the market, where customers have realized that their preexisting platforms are not necessarily suitable for the new generation of analytics that they're trying to do.
We see the push toward analysis that's really more near real-time than what they were able to do before. This is not a trivial thing to do when, it comes to very large data sets, because what you are asking for is the ability to get very, very quick response times and incredibly high performance on terabytes and terabytes of data to be able to get these kind of results in real-time.
Kobielus: Let's look at what's going to be a true game changer, not just for business, but for the global society. It's a thing called "social network analysis."
It's predictive models, fundamentally, but it's predictive models that are applied to analyzing the behaviors of networks of people on the Web, on the Internet, Facebook and Twitter, in your company, and in various social network groupings, to determine classification and clustering of people around common affinities, buying patterns, interests, and so forth.
As social networks weave their way into not just our consumer lives, but our work lives, our life lives, social network analysis -- leveraging all the core advanced analytics of data mining and text analytics -- will take the place of the focus group.
You're going to listen to all their tweets and their Facebook updates and you're going to look at their interactions online through your portal and your call center. Then, you're going to take all that huge stream of event information -- we're talking about complex event processing (CEP) -- you're going to bring it into your data warehousing grid or cloud.
You're also going to bring historical information on those customers and their needs. You're going to apply various social network behavioral analytics models to it to cluster people into the categories that make us all kind of squirm when we hear them, things like "yuppie" and "Generation X" and so forth.
Social network analysis becomes more powerful as you bring more history into it -- last year, two years, five years, 10 years worth of interactions -- to get a sense for how people will likely respond ... to new offers, bundles, packages, campaigns and programs that are thrown at them through social networks.
If you can push not just the analytic models, but to some degree bring transactional applications, such as workflow, into this environment to be triggered by all of the data being developed or being sifted by these models, that is very powerful.
Mulligan: One of the biggest issues that the preexisting data pipeline faces is that the data lives in a repository that's removed from where the analytics take place. Today, with the existing solutions, you need to move terabytes and terabytes of data through the data pipeline to the analytics application, before you can do your analysis.
There's a fundamental issue here. You can't move boulders and boulders of data to an application. It's too slow, it's too cumbersome, and you're not factoring in all your fresh data in your analysis, because of the latency involved.
One of the biggest shifts is that we need to bring the analytics logic close to the data itself. Having it live in a completely different tier, separate from where the data lives, is problematic. This is not a price-performance issue in itself. It is a massive architectural shift that requires bringing analytics logic to the data itself, so that data is collocated with the analytics itself.
MapReduce plays a critical role in this. It is a very powerful technology for advanced analytics and it brings capabilities like parallelization to an application, which then allows for very high-performance scalability.
What we see in the market these days are terms like "in-database analytics," "applications inside data," and all this is really talking about the same thing. It's the notion of bringing analytics logic to the data itself. ...
In the marriage of SQL with MapReduce, the real intent is to bring the power of MapReduce to the enterprise, so that SQL programmers can now use that technology. MapReduce alone does require some sophistication in terms of programming skills to be able to utilize it. You may typically find that skill set in Web 2.0 companies, but often you don't find developers who can work with that in the enterprise.
What you do find in enterprise organizations is that there are people who are very proficient at SQL. By bringing SQL together with MapReduce what enterprise organizations have is the familiarity of SQL and the ease of using SQL, but with the power of MapReduce analytics underneath that. So, it's really letting SQL programmers leverage skills they already have, but to be able to use MapReduce for analytics. ...
One of the biggest requirements in order to be able to do very advanced analytics on terabyte- and petabyte-level data sets, is to bring the application logic to the data itself. Earlier, I described why you need to do this. You want to eliminate as much data movement as possible, and you want to be able to do this analysis in as near real-time as possible.
What we did in Aster Data 4.0 is just that. We're allowing companies to push their analytics applications inside of Aster's MPP database, where now you can run your application logic next to the data itself, so they are both collocated in the same system. By doing so, you've eliminated all the data movement. What that gives you is very, very quick and efficient access to data, which is what's required in some of these advanced analytics application examples we talked about.
What kind of applications can you push down into the system? It can be any app written in Java, C, C++, Perl, Python, .NET. It could be an existing custom application that an organization has written and that they need to be able to scale to work on much larger data sets. That code can be pushed down into the apps database.
It could be a new application that a customer is looking to write to do a level of analysis that they could not do before, like real-time fraud analytics, or very deep customer behavior analysis. If you're trying to deliver these new generations of advanced analytics apps, you would write that application in the programming language of your choice.
Kobielus: In this coming decade, we're going to see predictive logic deployed into all application environments, be they databases, clouds, distributed file systems, CEP environments, business process management (BPM) systems, and the like. Open frameworks will be used and developed under more of a service-oriented architecture (SOA) umbrella, to enable predictive logic that's built in any tool to be deployed eventually into any production, transaction, or analytic environment.
Dana Gardner is president and principal analyst at Interarbor Solutions, which tracks trends, delivers forecasts and interprets the competitive landscape of enterprise applications and software infrastructure markets for clients. He also produces BriefingsDirect sponsored podcasts. Follow Dana Gardner on Twitter. Disclosure: Aster Data Systems sponsored this podcast.