11 a.m. CST: Away from the New Age and trip-hop music in The W Hotel lobby and through the registration and pleasantries, the event kicks off with salesforce.com Data Scientist Aron Clymer.
Clymer heads up the Product Intelligence team, which helps salesforce.com’s product teams maximize value and adoption of its as-a-service offerings. His team consists of six data scientists and three business analysts covering 150 product teams, all in-house at the vendor’s San Francisco headquarters. They’re dealing with approximately 1 billion behavioral data transactions per day, all loaded into Hadoop and filtered into its EDW.
In mining data for value, Clymer gave a nod to the “gross national happiness” index by the landlocked Asian nation of Bhutan as akin to the way salesforce.com gauges sentiment of its products and customer interactions. His team works off a hub-and-spoke model for information tied to product intelligence, with a view framed by customer behavior along one axis and customer sentiment along another. Customer dimensions and metrics start with “executive level questions: first, let me make sure I understand your question, and secondly ... what are you going to do with the results?”
In one look at customer case issues across sentiment transactions, the salesforce.com product team hit a basic but repeated request: what’s my password? Multiple behavioral data pointed to some variation of this customer question, so Clymer’s team worked with the customer experience people at salesforce.com to put together a how-to video for password issues. Those cases “plummeted” and Clymer attributed nearly $1 million in savings in terms of time and effort to deal with those cases.
“We never would’ve thought that was a big deal ... but that turned out to be one our largest case volume issues,” he says.
11:47 a.m.: Google Data Scientist Yannet Interian takes the microphone. She works at Google+, following years digging through data projects related to ads for television and YouTube at the search engine behemoth.
Interian notes that some of the problems enterprises may have when starting on big data projects is the lack of data tied to their business question. The more data, the better from this perspective, Interian says, adding that this is why enterprise search often “doesn’t work well.”
Her basic process starts something like this: you begin with hypothesis; select variables based on parameters and business concerns with that hypothesis; pull huge data pools on that into MapReduce tables; join and aggregate based on data points and patterns; then analyze and visualize, often repeating the entire process “over and over.”
12:28 p.m.: Carlson Wagonlit Travel, senior BI manager, Catalin Ciobanu had worked previously quite a bit more on the “science” side of data science: as a physicist at Fermilab.
Now, Ciobanu takes an “object-oriented approach to data analysis” at the Paris-based business travel enterprise, constructing a clearer picture from data that can unclear at first blush. Ciobanu offered these bits of advice:
- Have a clear image of what the output might look like and write it down. (“Sometimes you have it in your head but put it on paper to see.”)
- Understand the objects you’re working with and their interaction. “Event” reconstruction is key.
- Choose your variables wisely. For all purposes, an object is a set of values assumed by a set of variables.
- Data profiling is a useful concept. Pulling visualizations from those profiles and samples are a “short cut to insight.” And to get to worthwhile profiles, here are a few questions to ask: Do your distributions make sense? If not, why? And are your distributions talking to your “anticipated” outcome from the start of analysis?
- Be mindful of 50-variable analysis. Six is typically enough, 10 should probably be the max. Ciobanu warned about infinitely multiplying factors until there is too much noise or unrelated data.
2:10 p.m.: After a lunch break, we hear from the chief data scientist at Accretive Health, Scott Nicholson.
Nicholson meshed his background in e-commerce and LinkedIn, with his newfound excitement over the industry-specific challenges to IT in health care, a field he sees as “next” in terms of data management demand. Even with that interest, you'll still be spending plenty of time digging through dirty data, Nicholson says. In health care, Nicholson estimates that about 80 percent of his time is spent cleansing, munging and loading data.
“You acknowledge that this is a mess. I kind of have to be a master of cleaning, extracting and trusting my data before I do anything with it,” Nicholson says. Suggestions that Nicholson made for data analysts and scientists working in the health care field:
- Constantly acknowledge HIPAA and statistical identification
- You’ll need to find enterprise versions of open source software
- Follow the deployment all the way through, working with physicians and nurses who understandably “need to know why”
- The default for the industry is to be slow-moving. “Make sure people understand your timeline may be different from theirs”
- Communicate to understand differing terminology (the word “model” holds different meanings for a doctor than a data scientist, for example)
- Transparency is a “high-performing black box”
2:59 p.m.: Clymer and George Mason University professor Kirk Borne began a trio of workshop discussions on data scientist issues related to methods for business-minded analysis, finding the correct questions to ask from the start, and a look at expanding the training and knowledge related to the data science field.
The inaugural data scientist event was scheduled to be followed up Thursday by two more *IE summits covering analytics and business intelligence.