Mike Driscoll's: The Social Effect: Predicting Telecom Customer Churn with Call Data, was a good illustration of predictive analytics in a larger data warehousing and BI context. Driscoll and his team analyzed billions of calls, millions of records and thousands of defectors from a Greenplum DW looking for predictors of churn. Driscoll's a big proponent of the open source R Project for Statistical Computing to support his work flow of data munge, data model, and data visualize. And with a Ph.D. in Bioinformatics, he often thinks like an epidemiologist, in this case looking for indications of contagious churn behavior. Using several social network analysis packages available in R, Driscoll's team appears to have found that churn in an individual's social network of calling accounts in a given month is likely to lead to more churn in subsequent months, a clear indication of a network effect. That contagion is overwhelmingly the strongest signal the team found from the data. A next step is to work with marketing to intervene on early network churn with email campaigns to minimize losses from the affected networks. I'd love to see the results from those experiments.
Driscoll and colleague Jim Porzak are co-chairs of the highly-successful Bay Area useR Group (R Programming Language), which held its monthly meeting at the conference Tuesday evening. The speaker was none other than venerable John Chambers, recently “retiring” to teach at Stanford after a 40+ year career as research scientist at Bell Labs. Along the way, Dr. Chambers became the first statistician named a Bell Labs Fellow in 1997, and received the prestigious Association for Computing Machinery presented Software System Award for the design of the S system, the predecessor of the R statistical platform. The ACM acknowledged that S “ has forever altered the way people analyze, visualize, and manipulate data”. Chambers recently published Software for Data Analysis: Programming with R, which has deepened my R programming knowledge immeasurably.
Chambers took his useR audience on a trip through statistical programming time, discussing functional, object-oriented, data streams and procedural methods, showing the evolution of approaches to extend S and paradigms for interfacing S with other languages. He even presented dumbstruck participants the sketch of an initial S design, now more than 35 years old!
Prior to Chambers' talk, I met David Smith, Director of Community and author of the outstanding Revolutions blog for commercial R vendor Revolution Computing, at a reception hosted by Revolution. Funny, it seemed everyone at PAW was an R advocate for that hour!
Rhodes scholar Sean Gourley, Co-founder of Younoodle, presented The Mathematics of Innovation: Predicting Startup Success, describing his company's business of predicting the future success of early stage startups. The typical VC portfolio has a 2:6:2 yield – every 10 investments yields 2 successes, 6 breakevens, and 2 failures. Yet, because successes are extremistan, the rewards of a few can overcome a score of breakevens and failures. But what if predictive analytics can help VC's boost the predictability of early startup success so that 2:6:2 becomes 3:5:2? The VC's make a lot more money.
Younoodle is building a large startup database it continuously refines to help improve predictions. Among the key areas of focus: 1) How good is the founding team? (Elite schools are important, but not for all members of the founding team) 2) Competitor DNA profiles are critical. (The startup should be in a vacant space in the competitive landscape clustering.) 3) What is the company worth today? (Traditional discounted cash flows are useless). The company has had preliminary success using predictors from 1)-3) above to approach its goal of “lifting” 2:6:2 to 3:5:2. If Younoodle attains that goal, it's own VC prospects are outstanding, indeed.
The keynote on Wednesday, February 17, Response Modeling is the Wrong Modeling: Maximize Impact with Net Lift Modeling, by Kim Larsen, was perhaps my favorite session, combining the best of both the practice and theory of predictive modeling. Net Lift modeling exercises the two legs of super-crunching: predictive models and randomized experiments, looking for incremental or “swing” impact. Lift modeling is uninterested in the “self selectors” who respond positively to both campaign interventions and control , but rather to swing prospects that respond only to the “treatment” or campaign. Indeed, swing clients are to self selectors as “skill” is to “common causes” in the measurement of performance: the former positive outcomes are more important to assess than the latter.
Larsen proceeded to introduce the concepts of weight of evidence, net information value, log odds ratios, overfitting, penalized estimation, training, validation, naïve Bayes, knn classifiers, and bifurcated logistic regression – topics covered in an advanced SAS training seminar – in an understandable way. Ever the practical analyst, he advocates examining the incremental lift in the top decile of responders as a good test of the impact of a campaign.
My one regret for the trip was not staying the additional night to participate in the Wednesday afternoon sessions on simulation and tree ensembles. I won't make that mistake next year. A request for 2011 PAW in San Francisco: Arrange a keynote from one of the Stanford statistics professors Trevor Hastie, Robert Tibshirani or Jerome Friedman, co-authors of Elements of Statistical Learning. The learning approach to predictive modeling sits comfortably between traditional statistical methods and the more esoteric computer science foundations of data mining. Who better to update attendees on the state of the craft than the guys who wrote the book from the top statistics program in the country?
Steve Miller also blogs at miller.openbi.com.