About a month ago, I downloaded an open source visualization tool for a quick evaluation. I had planned on using my trusty 276,000 record, several dozen attribute delimited file of census data to see how the tool performed with real world volumes. Alas, I ran into a problem right away. My data file, which consists of roughly equal numbers of numerics and string fields interpreted as category values, failed the load. It seems the tool could only handle commas as delimiters, whereas my file used semi-colons. And I couldn't simply change all semi-colons to commas, since the text fields had embedded commas that would subvert the load with a comma delimiter. What I needed to do was first surround each text field with quotes to protect the embedded commas, then change the semi-colon delimiter to a comma.
I could easily have solved my problem using OpenOffice Calc for a one-off solution with CSV files, but decided instead to dust off a favorite open source tool, the programming language Ruby. The short script I ended up writing first looked at 100 records to get a consensus of which fields were numeric and which were text, storing the text field column positions. It then read all records, prefixing and appending quotes to text fields. Finally, the script changed the delimiter from semi-colon to comma, writing it's results to a file that appeased the visualization tool loader. It took me a couple of hours, but I now have a tool that can generically address this type of data problem.
Developed by Yukihiro Matsumoto (“Matz”) in 1993 and released to the public in 1995, Ruby's considered an agile language, a much preferred moniker to the earlier and pejorative scripting language. Depending on whom you ask, agile language can mean quite a few things. Among them is that Ruby:
- is suitable for both beginners and experts,
- is simple and elegant yet powerful,
- is one-pass compile/execute,
- supports dynamic typing,
- supports rapid development,
- is transparent -- has high-level programming constructs geared to the problem domain,
- can be used for both simple scripting and advanced application development,
- is object-oriented and extensible,
- is portable,
- is freely-available open source, and
- has an abundance of add-on libraries built by a community of developers.
Tasks in Ruby are typically completed with less than half the code of Java or C++ -- and generally the programming time is considerably less as well.
For my BI needs, Ruby is the latest and best “data munging” interpreter in the venerable tradition of Awk, Perl and Python – tools I've productively programmed for 30 years. Indeed, besides SAS, Perl was generally my choice for data warehousing ETL before the emergence of Informatica and Ascential in the 90's. But Perl, “The Duct Tape of the Internet”, is now dated in comparison to Python and Ruby, both object-oriented at core. According to Matz, “Perl and Python were not exactly what I was looking for. I wanted a language more powerful than Perl and more object-oriented than Python.” Ruby's certainly more OO-modern than Python, though Python still has a larger user base and more external libraries, reflecting its seniority. I much prefer Ruby to Python or any other language now, but sometimes choose Python for its special functionalities.
Ruby Monday – Part 2, will detail features of Ruby I believe make it particularly useful for BI analysts.
Steve Miller's blog can also be found at miller.openbi.com.