Up to Date on Open Source Analytics
I’ve been updating the computational analytics platform on my Wintel notebook the last few days. I’d fallen behind several versions on each of the main tools and decided to get them all back in synch at once. The good news for hackers like me is that there are so many freely-available, open source analytics products to choose from. The bad news is that it takes a focused effort to stay up to date on the latest largesse.
First up were the relational databases MySQL and PostgreSQL, along with the accompanying ODBC and JDBC drivers. Both are solid SQL’s capable of sourcing small to mid-sized analytics challenges with ease. Support is available from vendors Oracle and EnterpriseDB, respectively, for enterprise customers. For my work, both database installations and configurations were uneventful. Though I only tested the ODBC drivers, I was able to connect both to several front-end tools with no problems.
The next hurdle was the trusty R Project for Statistical Computing along with the updates to the packages I most often use. Like most R upgrades, going from 2.14.1 to 2.15. was a snap – I wish all my installs went so smoothly. To maximize the size of memory-limited data I can handle, I use the 64 bit compile on my 8G RAM machine.
After R came the latest command line and GUI releases of Octave, a high-level interpreted language used for numerical computations. Suspiciously similar in function and syntax to the commercial MATLAB, Octave’s vector, matrix and array computational constructs provide powerful “capabilities for the numerical solution of linear and nonlinear problems, and for performing other numerical experiments.” Octave, like MATLAB, is popular in advanced statistics courses that involve heavy matrix algebra, and is a staple of Stanford professor Andrew Ng’s highly-regarded Coursera offering, Machine Learning. A useful comparison of the numerical features of Octave/MATLAB, R and the numerical Python library, NumPy (see below), is detailed here.
As my agile language of choice, I teeter between Python and Ruby, preferring the pure object-orientated syntax of Ruby, but acknowledging Python’s sway in the marketplace. As a consequence of indecision, my Python environment was in disarray, 18 months having passed since the last update. Not only did I have to download the latest core language, I also had to update each of the supporting computational packages-modules.
Python 2.7 installed without a hitch, as did the 2D graphical library, matplotlib. I’d forgotten just how powerful matplotlib is. Its “cost,” though, is a fair amount of code to set up the graphics. Probably worth it, however.
Numerical computation in Python is handled deftly by the NumPy package. NumPy’s powerful N-dimensional array object and extensive linear algebra functionality conspire to provide much of the capability of MATLAB-Octave to Python programmers. My NumPy update went smoothly.
I decided to add the SciPy package this go round. SciPy, which provides efficient routines for numerical integration, differential equations, and optimization used by mathematicians, scientists and engineers, depends on NumPy, and is built to work with NumPy arrays. I plan to take SciPy for a spin soon.
I wasn’t so fortunate with RPy, a package I’ve used extensively in the past that interfaces Python to R, allowing Python to “manage all kinds of R objects and execute arbitrary R functions (including the graphic functions).” Sadly, RPy has apparently not been updated recently for Windows, and as a result won’t work with the current 2.15.1 release of R. I’m now investigating RPy2 as a replacement.
Last week, I had the good fortune to participate in an O’Reilly Webcast entitled Python for Data Analysis, by Wes McKinney, author of a soon-to-be-released book of the same name. McKinney introduced me to IPython, an interactive shell for execution of Python code that provides “a web-based notebook with the same core features but support for code, text, mathematical expressions, inline plots and other rich media … (as well as) support for interactive data visualization and use of GUI toolkits.” After installing IPython, I can interactively execute scripts from Python with calls to matplotlib, NumPy and other libraries – capabilities similar to those I have now with R and Octave. Good stuff.
The Webcast also delivered a wealth of information on using Python for data analysis and “munging”. Much of the hour was spent discussing the Python package pandas, built on top of matplotlib and NumPy, which “aims to be the fundamental high-level building block for doing practical, real world data analysis in Python … (with the) broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.” Certainly an ambitious goal, though I was impressed with the capabilities McKinney demoed, and plan to invest the effort to include Python/IPython/pandas in my computational tool chest.
I’ll have more to say about data analysis with Python in subsequent blogs.