I’ve gotten quite a bit of reaction to my blog on Pretty Big Data.
One respondent wondered what a highly technical write-up like that was doing in Information-Management.com, but most of the mail has been favorable. Several correspondents expressed surprise at the performance of the VectorWise analytical database, and a few others liked the idea of linking VectorWise with R through ODBC/JDBC.
One writer, though, said while what I’d shown was all well and good, my examples were against data that were already aggregated before I invoked R functions and visuals. How could that approach work on the detail record level of the 120M+ row table? Touche. I’d been wondering the same thing!
It’s no secret that R’s memory-bound, with the size of data sets it can accommodate limited to something well less than available memory. And while R can handle hundreds of millions of records on Linux hardware with souped-up memory, no way my “little” 8G machine is going to load 120M records with over 20 attributes into an R data frame. 20M, maybe. So what’s an analytics geek to do? Randomly sample cases from the “big data” population, of course!
Over the last week I’ve experimented with two different approaches of sampling records from my 120M row database table into R data frames for subsequent analyses. I now believe both of these techniques are viable for R/VectorWise analytics.
The first method is a distinctly VectorWise SQL solution. VW supports a number of “random” functions, the simplest for my need being randomf, which returns a random floating point number between 0 and 1. If I ‘select * from flight where randomf() <.25’, for example, roughly 25% of the rows, chosen randomly, will be returned. So if I wish to sample 200,000 records, my where clause becomes ‘where randomf()<200,000/123,000,000’. If I invoke such a query in an RODBC session, all rows where randomf() is less than that constant will be returned to an R data frame. Voila.
Well, almost. Delivering records based on randomf() only will not guarantee exactly the sample size I request. For that, I’d have to ask for more than is needed, then use VW’s syntax to limit the number of selected rows actually fetched. A bit clunky but workable.
A second, more complicated but potentially more rewarding technique, combines capabilities of both R and VectorWise. The rub is that I have to add a sequential “rownum” identifier field to the flight table first. With that, given a sample size of, say, 500,000, from the 120M, I invoke R’s powerful “sample” function to generate a vector of random rownums, sorting them and writing a text file – a very fast operation. I then call a VW script that bulk loads the text file into a VectorWise “join” table. That data, in turn, is merged in a query with the big flight table to return the half a million records, which are finally loaded into an R data frame.
I tested both methods extensively with sample sizes ranging from 2,000 to 5,000,000 on the flight data. In VectorWise, the turnaround time has as much to do with the number of columns returned as the number of rows. With the 17 attributes of the flight table I deem most important, the elapsed times range from 15 seconds to little more than a minute for just about all queries.
And, non-intuitively, the first technique is not much faster than the second – maybe 15%. That’s testimony to VectorWise’s join performance – a 5,000,000 row join without an index on a run-of-the-mill relational database would choke my notebook. The advantage of the second technique is that I can maintain information on precisely which records are in each sample over time.
Truth be told, I was surprised – and happy – with the performance of both methods. I even use them to write csv files to load into Tableau for visual exploration. Now if I can only find a way to have R operate on VectorWise data natively, without R size constraints!
This week I’m off to the International R Users Meeting, useR!-2012, at Vanderbilt University in Nashville. I’m honored to be on the Program Committee of a conference alongside some of the top statisticians and computational programmers in the world. My big challenge? Choosing among the many outstanding-looking presentations that’ll occur simultaneously. I’ll blog and tweet on my experiences.