I had the good fortune of participating in day one of the *IE BI and Predictive Analytics Summit last Thursday in Chicago. With over 300 attendees, the 2012 version was considerably larger than the one I attended two years ago and, in my view, better as well. The insights on the conduct of data science in large organizations were especially informative.

Digvijay Lamba of WalmartLabs discussed a system for using big data to deliver “unexpected insights” for the very lucrative Walmart Halloween season. Lamba was the first of many speakers to emphasize domain expertise as a critical data scientist skill. To help remove the gap between business and tech/analytics, his group has articulated the “social genome” that consists of products, people, locations, events and interests. Dashboards for ideas are generated when the genome taxonomy is cross-classified with data sources that include transactions, web-based data, social media data, blogs, etc.

Robin Glinton and Herman Asorey of Sears Holdings data science center of excellence eat their own dog food with a system that helps manage their Operational Data Engine that supports 1,000s of analytics users on a large, massively-parallel Teredata platform. Starting with over a hundred system performance metrics, SH deploys dimension-reducing principal components to distill key performance indicators, among which turn out to be parallel efficiency, gating efficiency, burn rate and time expansion. The DS group then creates segments using techniques like k-mean clustering, and examines KPI trends against features volatility and momentum that are so important in financial services. With support vector machines and neural nets as their essential classification engines, the authors note off-the-charts efficiency improvements from the efforts.

Anne Hale of Pfizer was quick to acknowledge her work with customer segmentation for company drugs involves no big data. Years of primary research on segmenting and predicting the potential for Pfizer products have led her to reject most attitudinal measures for behavioral ones. And the linkage hypotheses that relate company behaviors to leading indicators to, in turn, physician intent to prescribe Pfizer drugs, have taken her down the path of the same Simultaneous Equation Models taught in econometrics courses. Indeed, I haven’t seen “path analysis” since grad school 30 years ago, at the time rejecting the technique as too grand. Hale proves me wrong: SEMs have now been a successful staple of her marketing work at Pfizer for decades.

Stanford-trained economist and Chief Data Scientist of Accretive Health, Scott Nicholson, is both excited and frustrated with the opportunities for DS in health care. Having cut his chops at analytics hotbed LinkedIn, Nicholson has seen the potential of data science and fully appreciates the possibilities in health care. On the other hand, health care lags the Internet world in technology / analytics and is saddled by legal compliance concerns – one, for example, being its reticence on open source. Not dismayed, Nicholson sees a big present/future in health care for DS. And like Digvijay Lamba, Nicholson obsesses on domain expertise, defining data science as “using data to solve problems end-to-end, starting from asking the right questions to making insights actionable.”

Mukund Raghunath of consultancy Mu Sigma distinguishes muddy from clear data science challenges. The latter lend themselves to the traditional scientific problem-driven cycle of hypothesis, data, analyses. Muddy problems, in contrast, generally mandate discovery-driven solutions, with initial data observation needed to clarify business issues. While acknowledging that data science portfolios must include both types, Raghunath argues that discovery-driven solutions, even as they present lower information to noise ratios, are less biased and more likely to lead to game-changing results.

My favorite among the outstanding presentations was Clifford Lyon’s discussion of the impact of experimentation or A/B testing on different aspects of Web user experience with CBS Interactive sites. Colors, typography, positioning, navigation, graphics, tag lines, and quotations are all in play at CBS. The value of experimentation is, of course, that cause and effect can be established between design decisions and subsequent traffic behavior. The experimentation cycle is simple:

  1. Create variations
  2. Apportion users randomly
  3. Measure key indicators
  4. Conduct tests and perform analyses

Lyon notes that only one out of every six experiments yields results of business value.  Multivariate testing, with multiple factors and their interactions, is particularly nettlesome. Finally, Lyon’s team must re-test its successes over time, often with A/A randomization, lest what provided lift at first now fails.
Good stuff. I wish I could have returned to hear more presentations on day two. I’ll certainly have future *IE Group, Chicago-area analytics conferences on my radar.