I’m getting close to being comfortable with my new notebook computer. It’s quite the performer, what with fast multi-processing, a speedy hard drive and 8 GB RAM.
And while the computer serves all the business needs of helping me manage OpenBI, I’m in the process of turning it into my personal data science tool chest as well.
Data scientists and science of business BI practitioners often compartmentalize their computing needs into data management/integration (“munging”), information visualization and statistics/analytics. That’s the way I’ve organized software to date on my new machine.
So far, I’ve worked in “software two’s” on the computer. I’ve installed two relational databases, two tools for moving/integrating data, two visualization programs and two (actually three) statistical platforms. All software is either open source or demo-licensed from OpenBI partners.
For database management, my starting point is the ubiquitous open source MySQL. Over the years, I don’t think I’ve ever had a problem installing, configuring or connecting MySQL to anything. It just works. MySQL is certainly not the best option for “large” analytics data, but functions well for the modest sizes I use on my notebook. Despite the hand-wringing from open-sourcers when Oracle took over, MySQL remains my trusty friend.
For larger, analytic databases, I’ve become enamored with VectorWise from Actian, the re-branded Ingres. With the admittedly simple tests I’ve conducted the last few weeks, VectorWise has been blazingly fast on my new machine. The load of a 52M, 30 integer attribute table completes in just over three minutes. And group by queries scanning all rows start returning output almost instantaneously. I was able to get the ODBC and JDBC drivers working with no problems this time. I do, however, anxiously await the promised new client management tools.
A former programmer, I just had to have an agile language as a foundation for data munging. And Ruby, a pure OO successor to Perl and Python, fills that role for me with aplomb. I love the conciseness and elegance of the language. It’s an absolute pleasure to work with the scripting challenges of moving, reshaping and recoding data I give it.
But for industrial strength integration, even a language as expressive as Ruby cannot match a full-featured ETL platform. PDI (Pentaho Data Integration), the crown jewel of the Pentaho BI suite and, increasingly, a competitor to larger Informatica and Data Stage, is a no-brainer for me. And why not? Commercial open source PDI is functionally rich with low cost compared to its proprietary foes.
Visualization/agile intelligence platform number one is Tableau. If there’s a cleaner, simpler, more elegant and powerful BI tool, I’ve yet to find it. I’m not sure I’ve ever spent time going through Tableau documentation or demos. I just fire it up, load a data set and I’m off to the visualization races. The 5.4M record data set that caused slight angst with Tableau on 4 GB RAM? No problem now.
At the same time, I’m a big fan of competitor Omniscope from U.K. vendor Visokio. Omniscope’s robust and easy to use as well. I now appreciate Omniscope’s poor man’s data integration functions. And I’m an even bigger advocate of its interoperability with R that lets me use R’s language against non-R data – and then visualize the whole works. Very powerful – and agile – stuff.
I was able to take advantage of the increased memory on this machine to load a 20M record data frame with the 64 bit version of the R Project for Statistical Computing. And I certainly had a lot more space to maneuver in with the 8 GB memory limit. Still, large data’s a hardship for vanilla, open source R. I didn’t even think of attempting predictive models with my entire large data set, settling instead for a 2M record random sample. Much as I love to use R as a staple for my data science work, the memory-strapped community edition is probably not the answer for predictive models with big data.
Having worked with SAS software for 20 years, I do occasionally miss the power of its data step programming language and selected procs, though I can gladly live without the ugly macro language.
World Programming System to the rescue for my addiction. The U.K. company offers a SAS compiler clone at a fraction of the cost. The product works like a champ, running my extensive, mothballed SAS scripts without a hitch. On my new notebook, the performance and capacity of WPS trump community R handily. I’d love to see what the 64-bit version can do. In my mind, WPS is a serious statistical contender for the data scientist. Indeed, I’m enthusiastic about serious but inexpensive R-WPS collaboration.
Enterprise R from Revolution Analytics, my third statistical platform, installs with a wonderful development environment and provides web integration as well as many answers for community R’s performance weaknesses. I look forward to RA’s enhancements that will enable it to interoperate seamlessly with community R.
In the meantime, I await Oracle R Enterprise for Windows. Can ORE solve R’s capacity limitations? Stay tuned.