Social network analysis has, in a real sense, been with us almost as long as we’ve been doing predictive analytics. Customer churn analysis is the killer app for predictive analytics, and it is inherently social. It’s long been known that individual customers don’t always churn themselves—i.e., decide to renew and/or bolt to the competition—in isolation. As they run the continual calculus called loyalty in their heads and hearts, they’re receiving fresh feeds of opinion from their friends and families, following the leads of peers and influencers, and keeping their fingers to the cultural breeze. You could also make a strong case for social networking—i.e., individual behaviors spurred, shaped, and encouraged within communities—as a key independent variable driving cross-sell, up-sell, fraud, and other phenomena for which we’ve long built predictive models.
The other day, a Forrester client was asking me for educated guesses on how fast the average enterprise data warehouse (EDW) is likely to grow over the next several years, and as I was working through the analysis, I couldn’t avoid the conclusion that social network analysis—for predictive and other uses—will be an important growth driver (though not the entire story). I’d like to lay out my key points.
First off, I need to re-iterate, per my blog post from last month, that social network analysis is much more than parsing a stream of tweets to see who’s flaming whom these days. At heart, it involves exploring the shifting web of relationships among people based on their profiles, interactions, and affinities.
Second, this definition clearly encompasses call-detail-record (CDR) analysis, which is a core telecommunications industry application of predictive analytics and data mining. We all know CDR analysis as the means by which carriers track our calls for billing, collection, usage monitoring, fraud detection, and other core operational requirements. Of course, CDRs also constitute a core data set that carriers leverage for sales, marketing, customer service, churn analysis, “friends and family” programs, and other key functions.
Third, CDRs are just one of many types of interaction, transaction, and behavioral records being leveraged by today’s online service providers, of which traditional telcos are just one category. Increasingly, customer-generated GPS and other geolocation data is becoming just as key for operational and predictive uses, especially for wireless carriers. Likewise, clickstream analysis is the lifeblood of personalization and customer experience optimization in Web 2.0 social networks, enterprise portals, clouds, and other online environments.
Fourth, CDRs, geolocation data, clickstreams, tweetstreams, audit log records, and other “event” data are beginning to flood into enterprise data warehouses (EDWs), where they are being aggregated for historical and predictive analysis—in other words, for social network analysis in the broader context discussed above. In fact, event data represents one of the most important new categories of information causing the EDW to balloon into the hundreds of terabytes and even petabytes. Another important new information category in the EDW is unstructured text. Some new information types—such as tweetstreams—straddle both categories: event data that is unstructured.
Fifth, today’s vanguard of petabyte-scale EDWs—the “outliers”--tend to cluster in particular verticals—most notably, telecommunications and Web 2.0 pure-plays. In these verticals, which one should regard as the core of the new “cloud” paradigm, they’re used primarily for CDR analysis, customer churn analysis, next best offer, online experience optimization, fraud detection, and other applications that rely on social-network analysis.
Sixth, the growth of cloud computing in this decade, across all verticals, will create a huge demand for petabyte-scale EDWs to drive the social network analysis that is central to this way of doing business. The very large EDWs that today are vertical-specific outliers will, by the end of this decade, move into the horizontal, cross-industry mainstream. Where distributed analytical databases are concerned, we’re all skyrocketing toward Planet Petabyte.
Now, to close the loop on EDW sizing, here is the rough order-of-magnitude I like to use on such questions. Generally, Forrester breaks out key EDW sizing metrics into the following areas: storage, loading, and usage concurrency. As a rough estimate, approximately 90 percent of deployed data warehouses have storage capacity (raw, uncompressed data) under 10 terabytes (TBs), have loading capacity less than 1 TB/hour, and usage concurrency under 100 users.
Generally, we foresee average EDW capacities across all industries doubling every 2-3 years throughout this decade, with the primary gating factors being the cost of storage and the efficiency of compression. In other words, it won’t be as fast as Moore’s Law (i.e., doubling every 18 months), but more like every 24-36 months. In the early years of this decade, the annual EDW-capacity growth rate will probably be less 25 percent, but, with advances in storage, compression, and cloud technology/adoption, the annual growth rate will probably accelerate throughout the decade, reaching 200 percent by 2019. This is consistent with the “doubling every 24-36 month” average growth rate that I sketched out for the decade as a whole.
With those educated guesses and assumptions in mind, it’s plausible to forecast that, by the end of this decade, the average DW will between 10-40 larger than it is now—i.e., by 2020, 90 percent of EDWs will have a storage capacity in the 100s of TBs, with petabyte-scale EDWs common, with 10TB/hour loading the norm, and usage concurrency often in the 1,000s of concurrent queries/access.
Clearly, cloud-based storage is key to realization of this forecast. I’m working on a forthcoming Forrester report addressing the virtualization of EDWs into the cloud, and storage virtualization is a core technology. That report will be published in the next quarter or so. I’d love to hear your thoughts on all this.
James Kobielus also blogs at http://blogs.forrester.com/business_process/.












Interesting perspective. Some additional thoughts if I may...
I agree with your analogy of Social Networking Data as CDR events but I think this is where the comparaison must stop. The fundamental issue really is the quality of the data associated with Social Networking Events (let's call'em SNEs) and Call Detail Records (CDRs).
SNEs are by their very nature devoid of precision and semantic quality, attributes that are crucial if one is to develop strategies and make informed business decision by looking at this data (amongts other data sources of course). SNEs are very volatile and devoid of real content. It may, for instance, be interesting as a side note, to notice that population X is connected to population Y at a point in time, under a specific context (say a U2 concert). But the speed at which the connections are made and un-made is just too quick for the Enterprise to react in any meaningful way.
CDRs however have lasting power. When you place a call to your friend in Chicago, the call is made, revenue is generated and a further pattern is re-inforced. A business decision (a friend and family package for instance) can be developed from this "event".
I guess, thinking about it while writing this post, is that the difference is that a connection is freely made as SNE but required no spending on the part of the generating parties, whereas one has to pay money to place a CDR generating call. Inherently, there is more information value in the latter than the former event.
Where I do ageee with you is that statistically, we are going to have to store and look at A LOT of SNEs to find trends and that will indeed fuel our collective voyage to planet Peta Byte.
Philippe
As far as the growth of information, social media is just one contributor. Consider other data that both current (e.g. smartphones) and new (e.g. healthcare devices) wireless electronic devices (what device can make room for a network connection these days) will be posting to the web data stores in the future. While all that data may not be publicly available, it will need to be stored somewhere. One example to consider is when the boomers get all the home healthcare devices they will need to monitor their health - all those devices will likely be reporting data to their doctors and/or directly to their online health records. Would be fun to consider a variety of "data generators" and determine which will require the most data storage in the future.
-Zach What's Your Social Media Information Strategy? http://bit.ly/3Zmoav