Another view of "big data" landed in my lap this week in a chat with North Carolina State University's Richard Kouri, the executive director of the biosciences management program at the College of Management.
NC State's Office of Technology Transfer has been working with IBM on a big data project that seeks to commercialize research projects coming out of university R&D by finding likely customers, partners or investors through online research.
While this project is a little different from the kind of big data complex event data stream monitoring we were talking about last week, it does involve high-speed in-memory processing and pattern matching for fast information turns.
The idea here is to reach out and selectively attack likely data sources specifically chosen by analyst/investigators: blogs, white papers, research forums, industry and government websites, any good place where high-speed processing and embedded analytics can use keywords and rules to unearth useful correlations to likely partner candidates for NC State.
Long story short, this task is presently addressed by 19 employees combing and keyword searching to find likely partners who were already investing in technology like that coming out of NC State or might have reason to be interested in the university's R&D. This kind of commercialization intelligence project by project is largely manual and takes months -- and NC State, I'm told, has some 3,000 research projects eligible for development.
So the university commissioned a pair of pilots with IBM's Emerging Technologies Business to see how it might more quickly commercially develop a Salmonella vaccine and another product involving a drug delivery system.
The pilots called in IBM LanguageWare text analytics, BigSheets and ICA tools to let investigators fire off keyword-based questions against specifically targeted data sources and return thousands of results that were "colder" or "hotter" depending on how closely and repetitively the words appeared.
What's interesting here is the addition of domain experts like Kouri, who holds a Ph.D. in radiation biology and works out of the NC State business school. That makes him -- in his words -- a "typical boundary spanner" with business and science skills who can figure out the mix of market and science he needs to find a likely partner. In other words, he already knows likely sources and where to start and the upgrade allows him to let high-speed analytical processing bang out and report and even learn against vast amounts of information on the Web.
From there he can reductively improve the results, add a paper from Deloitte Consulting or Booz Hamilton and impute new relevant search terms for analytic processing. He might start with 250 companies that work on vaccines related to Salmonella, and come back with 1,000 results that become hotter by correlation so he can reduce the list from thousands, to hundreds, to 25 and to 8 final best results -- which is pretty much what he did.
"We make it loose, then we make it tight and then we make it loose again," Kouri says. "This is reiterative, as you apply the domain experts to interpret what you pulled out. And then you remember a guy at UCLA who published a paper in 2009 that should have showed up, which usually means you need to match and reset keywords, which improves the result again."
To me this is a really cool example of mixing an analyst/domain expert skills with what we're (maybe) assuming prematurely is "big data" processing too fast to pause for reflection or human intervention.
"It democratizes information to smaller sets of users in the way a small company can get to information that only a big company could once get," Kouri told me. "But the big company used to do it the wrong way because it collected data once and tried to distribute the same set of information across 50 business functions. You really wanted 50 different runs of this so each business function had its own set of information."
And the reason I'll agree with NC State and IBM that this project really is "big data" and not just another research vault is because it adheres to the three principles Neil McGovern from Sybase wisely laid out last week: high volume, high velocity and a short timeline for analysis of information that is volatile and going to change pretty fast.
Researchers know pipelines are always on some schedule of currency and decaying value and, "there is a huge pipe for a lot of data to fly through," Kouri said.
But what he thinks about most are the two ends of that pipe. "I need, at minimum, an engagement process at the front end that lets me to get to the critical sets of questions relatively quickly. I also need at the back end is a way to display the resulting data in a way that I can make informed business decisions fairly quickly. The stuff in between works but the stuff at the ends is where we are working and it's still kind of fuzzy."
(I never even got to the plans for Hadoop and cloud deployment NC State is migrating the IBM pilots to, or the chat with a really smart IBM strategist who's really a startup specialist, but I hope to. Thanks to Chris and Richard and Kevin for an article I hope will follow. -ed)