Ruby Monday – Part 1 introduced the agile programming language Ruby as my current favorite “data munging” tool in the tradition of Awk, Perl and Python. Today's blog embellishes on Ruby's suitability for basic data management tasks that precede BI and analytics. In next week's finale, I'll outline the “Ruby solution” to my stock portfolio returns simulation.
For data manipulation tasks, Ruby has modern control structures, powerful data constructs and versatile functions/methods that make it easy to craft “munging” programs. Full-blown regular expression syntax and exhaustive string-handling capabilities facilitate text search and change. With powerful Ruby containers/collections – arrays, enumerables and hashes – it's easy, for example, to organize complicated data in memory, pivot text files, merge data sets, compute frequencies and cross tabs, and do complex array calculations with set and range manipulation. Ruby data structures make sorting and control break processing simple, and posting delimited output files a snap. The built-in libraries include functions to manage operating system files, directories and IO, making Ruby quite suitable for system administration. Ruby also works deftly with the outside world, exploiting its ability to consume run-time arguments and manage processes with pipe and fork commands from the OS. I often use these capabilities to “wrapper” R functions and invoke the statistical package in batch from the command line. A powerful exception handling capability is built in, while blocks and iterators promote code economy, supplanting loops for most aficionados. Indeed, the more I work with the language, the more convinced I become that short of a sophisticated ETL environment, Ruby's an ideal choice for BI data movement program development.
In addition to the built-in modules, there are standard libraries that extend the capabilities of the language. Date and time manipulation is straightforward. The Net set of HTTP, IMAP, POP, SMTP, and Telnet libraries facilitates network and Internet access. Benchmarking and profiling packages support the calibration of Ruby programs. Test::Unit provides a unit testing framework borrowed from Smalltalk. RubyGems, a standardized packaging framework, makes it easy to install, upgrade and uninstall new Ruby packaged code, both locally and remotely. CGI, SOAP, Socket, and thread protocols are accommodated by modules with the same names, and there are several variants of XML libraries. Collaboration with Windows is through Win32API and WIN32OLE. Tk is a graphical interface recognizable to Perl and Tcl enthusiasts. Even left-over Curses fans from the '80s are not forgotten.
One of the attractions Ruby shares with the R Project for Statistical Computing is an abundance of freely-available outside libraries built by an enthusiastic open source community. Perhaps the most important of these for BI programmers is the database independent interface, DBI, that “provides an abstraction layer between Ruby code and the underlying database, allowing you to switch database implementations really easily. It defines a set of methods, variables, and conventions that provide a consistent database interface, independent of the actual database being used.” Among the database drivers (DBD's) front-ended by DBI I've used are Oracle, ODBC and MySQL. Programmers who've embedded SQL in 3GL code will recognize the cursor, prepare, execute and fetch idiom. I'm also quite fond of Rio, the Ruby I/O Facilitator, that provides a single interface to all Ruby I/O libraries, including the Web. And my understanding of the basic collaborative filtering analytics behind the Netflix Prize came from a quite useful Ruby library implementing singular value decomposition, SVD.
Ruby is more than just an easy-to-use structured programming language, offering support for modules and full-blown object orientation. Programmers design classes instantiated by objects that are manipulated with methods and messages. Classes can inherit from other classes as refinements and are never closed – programmers can always add additional methods to the classes they build as well as Ruby builtins. Though Ruby doesn't support multiple inheritance directly, through the concept of mixin it provides a controlled multiple inheritance-like capability. If Ruby's rich data and programming constructs are not enough, the language can be extended through an API to low-level C.
There's no shortage of guides for aspiring Ruby programmers. I purchased several books a few years back, and the number of titles has grown appreciably since. I especially like Programming Ruby, The Pragmatic Programmer's Guide, dubbed PickAxe by devotees. This book, in tandem with the accompanying web-site curricula, provides a compelling learning environment. I notice there's an update for Pragmatic with Ruby 1.9; it might well be time to re-invest in the latest version. I'm sure the newer O'Reilly Media books are excellent as well. Those in a hurry to get started with Ruby should spend time with Mitch Fincher's outstanding online tutorial.
And lest one think that Ruby is only for programming in the small, the app/dev framework Ruby on Rails, according to expert Nathan Torkington, O'Reilly Program Chair for OSCON, “is astounding. Using it is like watching a kung-fu movie, where a dozen bad-ass frameworks prepare to beat up the little newcomer only to be handed their asses in a variety of imaginative ways.” It's hard not to be impressed with the screencast showcasing Rails development.
Rails claims an enthusiastic cadre of contributors in addition to the core development team, much like R. Though it may be somewhat happenstance, I've spoken to two prospects over the last few weeks that are ecstatic Rails enterprise consumers. If Ruby's success follows a similar trajectory to R, look for it to continue growing rapidly as a development platform of choice. I know at least one “organization” that's already on the bandwagon!
Steve Miller also blogs at miller.openbi.com.