The last Friday in April was the all-hands meeting for OpenBI in Chicago. Thrice annually, the entire company gets together for sharing, learning and playing not necessarily in that order. It's a great opportunity for our geographically dispersed staff to connect. And it's a welcome respite from what's often the grind of day-to-day consulting projects.
The itinerary this time was socializing Thursday evening, followed by presentations on company business all day Friday, capping off with more socializing Friday evening. I'm pretty sure everyone looked forward to a peaceful Saturday.
The presentations Friday were especially good, with three of the six revolving on data integration, which is central to OpenBI's business. First up was Pentaho Data Integration for ETL and big data, followed by Alteryx for data blending and advanced analytics, and finally Python-Pandas-Scikit-learn for data science.
In the group discussion that followed the talks, it was “speculated” that each platform had its intelligence niche. As a rough point of departure, we see PDI as the platform to build the data warehouse and promote BI; Alteryx, with its tight integration with the R Project for Statistical Computing, as the application to support the DI “blending” needs of predictive analytics professionals; and Python-Pandas-Scikit as the platform to handle the programming, data management and machine learning demands of data scientists.
Pentaho Data Integration has been the ETL workhorse for OpenBI all of our eight years. It's safe to say we've done more development in PDI than in any other platform. Open source PDI is central to many of OpenBI's DW build projects, especially those for new “data-driven” companies with no legacy ETL platform. The crown jewel of the Pentaho BI Suite, PDI's as powerful as Informatica at a fraction of the cost, readily handling every knotty integration challenge we throw at it. PDI also shines in the OEM work OpenBI does assisting SaaS vendors add analytics to their existing applications. And PDI has big data capabilities that help impose order on the very noisy existing Hadoop ecosystem. Finally, there's incipient integration with R that might put PDI in play with the predictive analytics and data science markets.
Alteryx sits squarely in that data integration-statistical nexis, offering powerful DI for the predictive analytics professional. But it's the integration with Tableau and especially with R where Alteryx shines. Indeed, Alteryx offers the strongest interoperability with R of any DI tool I've worked with. The sweet spot for Alteryx appears to be as the analytics hub for established quant departments in mid to large companies, enabling powerful blending and statistical analysis to practitioners who're less than guru R, Python, SAS programmers. “Data Scientists Not Required! Data Blending and Advanced Analytics in the Hands of Every Analyst.”
Where data scientists are required, Python-Pandas-Scikit is a great choice. Don't believe me? Ask Harvard University, which uses the platform as programming foundation for its famous CS109 course.
The Python of 2014 is much more than the agile data scripting language you learned 10 years ago. With millions of users and countless contributors, there seems no end to new Python capabilities, some of which, like libraries numpy, scipy, and pandas, have transformed the way programming is done in Python. Loops, lists and comprehensions to munge data are now often replaced by arrays, datafames and vectorized operations sourced from community-contributed libraries. Pandas for data management is a terrific library that competes directly with R. And the exciting Berkeley Data Analytics Stack (BDAS), a next generation platform for Hadoop big data analysis, has API's to Python. Our data scientists are salivating.
The final word on these three DI platforms? Each is excellent in its existing DW/BI, Predictive Analytics and Data Science niche, but also has the “legs” to extend reach and provide bigger solutions. OpenBI's excited to work with all three.