My company, OpenBI, is getting ready to celebrate a birthday. When opened for business eight years ago, we branded as a data warehousing/business intelligence consultancy focused on open source software. OpenBI's expanded well beyond OS in 2014, and while DW/BI's still central to the business, our product portfolio now includes big data and data science. I'd guess the current mix is roughly 50-50 traditional and new.
That evolution pleases me and certainly mirrors changes in the industry. A statistician with a data science background, I was more the company “R guy” early on. Now, it seems, all OpenBI'ers are R guys or at least R (Python-pandas) apprentices.
In the last 3 weeks alone, I can cite half a dozen overtures from staff/customers on R topics and will mention a few.
Curmudgeonly partner Bryan skyped for my take on MonetDB.R, an R package he was chartered to investigate by a customer that connects to the open source analytic database MonetDB. I hadn't heard of the package, but quickly installed and exercised it and liked what I saw. MonetDB.R does the expected moving data between MonetDB and R with aplomb. But it's the package's additional capabilities that set it apart. Within R, once the meta-data connection to a MonetDB table or query is made, the “monet.frame” object assumes many methods of the ubiquitous R data.frame, even though it's virtual. The supported functions generate SQL for MonetDB under the hood.
One monet.frame capability that's especially welcome is the “sample” function that enables programmers to select a random sample of data into a physical R data.frame from an underlying MonetDB table. That table could, of course, be much larger than what would fit into R's memory space. The database in effect “serves” samples to R for subsequent analytics. Very handy for large data.
Partner Dave knows quite a bit about R and loves to analyze sports data, as anyone who's seen one of his Hadoop demos can attest. His latest initiative involves the application of the page-rank algorithm to rate college basketball team by performance, using results of games from the 351 Division I teams scraped from the Web. Once the data's munged, he organizes it into an R data.frame suitable for graph/network analysis by the igraph package. The findings are encouraging: the page rank ratings correlate reasonably well with established indexes like RPI and BPI. Clever application of a learning algorithm.
Partner Kevin approached me with two issues: the first, integration of R with the Tableau visualization platform, the second a prod to help him look at Shiny, a package distributed by RStudio that makes it easy for developers to turn their R analyses into interactive web applications.
Tableau-R integration is at the variable level, which is different than that offered by competing tools Spotfire and Omniscope. That's both good new and not so good news the good being that in many circumstances the R-computed Tableau attributes behave as one would want with Tableau filters/hierarchies, as a stock index data POC we assembled demonstrates.
Senior architect Jon called called the other day, asking for suggestions that'd shed light on an e-marketing customer's data. Jon had recently completed Coursera self-training in data science and R, so we decided to explore in R a half million record data file to see what goodies we could find to show the “boss”. R readily consumed the CSV and we then used the lattice graphics package to showcase the individual attributes and relationships. We built dotplots for frequencies, densityplots for univariate distributions, scatterplots for inter-attribute relationships, and hexbinplots for “big” data. The R motif of using graphics to explore the data and tell a preliminary story worked well for both consultant and customer.
There's a lot of excitement from everyone at OpenBI surrounding the soon-to-be-released RScript transformation plugin for Pentaho Data Integration. With its rich, open source functionality, PDI's long been OpenBI's platform of choice for ETL, data integration and wrangling. Now its considerable capabilities can be combined in transformation flows that promote the building of R data structures from complicated inputs, as well as facilitate computations in R that broadcast to, for example, model-scoring and report-writing steps. We believe PDI-RScript can be a productivity boon for data science professionals.
What's all this mean? To me, more than anything that OpenBI consultants and the companies we work with are becoming ever-more data-driven, and are now less content with traditional DW/BI -- demanding larger, more flexible and quicker, evidence-based answers. This, I believe, is good news for both customers and consultants.