A few months ago, I conversed with friends at Revolution Analytics about “pretty” big data for statistical analysis. David Smith and Joe Rickert of RA had read one of my blogs and expressed interest in the multi-million record census data set I’d built from the Public Use Microdata Sample (PUMS).
I, on the other hand, was intrigued by the 120M+ record file RA was using to showcase their RevosScaleR big data statistical analysis capabilities. Turns out the “airline on time performance data” they analyzed was accessible from an American Statistical Association website, input for an ASA competition on statistical computing and graphics. I decided to see what I could come up with from the data set – though well after the contest had concluded.
At the beginning, I’d planned to simply programmatically read the files directly from the website, but decided instead to first download them individually to my notebook. Then I consolidated the 22 yearly files into a single large one. The ruby program that did the work also checked to assure that each record was complete and that there were no egregious missing data anomalies.
The aggregate data file consists of characteristics of a large percentage of commercial flights within the U.S. between 1987 and 2008. Among the attributes are airline carrier, flight number, year, month and day of departure, time of scheduled and actual departure, time of scheduled and actual arrival, origin and destination airports, departure and arrival delays, flight distance, scheduled and actual elapsed flight times, airplane tail number, cancellation and diversion codes, and a number of delay attributes.
One problem with the data is that while there are both departure and arrival time attributes, there’s only a single flight date for each record. This leads to what I believe are a number of calculation errors involving end of day departures where flights arrived at their destination the following day. In some cases, there appear to be erroneous large negative departure and arrival delays. Rather than attempt to fix the bad data, I simply eliminated suspicious records. Fortunately, the pool of such cases represent less that .2% of the total.
I knew there was no way that Windows community edition R would be able to accommodate over 120M records in a data frame. At the same time, though, I wanted to use some of my favorite R functions and graphics on the data. The good news is that an awful lot of meaningful analyses can be constructed from basic attribute frequencies. So rather than attempt to work with 120M records, I was often able to first summarize the data in multiple ways, then feed the attribute counts to R procedures. Kind of the like the query-enhancing aggregate tables we all know and love in BI.
The World Programming System (WPS) SAS-clone software came to my rescue first. Because of its virtual memory management, I was able to create a WPS data set of all 121M records, then use proc freq to build individual data sets of frequencies as needed. For example, I generated counts by airline carrier and by hour of day, by carrier and year, and by carrier and delay minutes. From these frequency data sets, I used Bridge2R to build corresponding R data frames. At that point I’d aggregated data that could be submitted to many R functions that accept count data for quick computation and visualization. The 121M record WPS data set took about half an hour to build. It took another 5 minutes to create five aggregated frequency data sets. With those, the Bridge2R and R work was very fast. The process is a bit clunky, but it works.
Flush with WPS/R success, I decided to test out the high-performance VectorWise analytic database from Actian, connected by ODBC to R via the RODBC package. The performance of this combination was a head turner. It took 10 minutes to load the 121M records into a VW table. With ODBC connectivity installed, I was able to issue count-groupby SQL statements in my R session that quickly returned frequencies from the database into R data frames. Once the counts were in R, the visuals were produced immediately.
I’m really just starting to see what can be done with the combination of VectorWise and R, but have been gratified by the results so far. Figure 1a shows the frequencies of flights by carrier and by hour of departure using R’s unsexy but highly-informative dot plots. In each case, the attribute of interest is sorted by count size for ease of interpretation. Figure 1a shows DL (Delta Airlines) having the most flights over the 1987-2008 time period. 8 AM is the most frequent departure hour, as indicated in 1b.
Figure 2a displays the on-time percentage of carriers aggregated over 21 years. OT percentage is defined as the portion of flights that arrive within 15 minutes of schedule. 2b is the same data with a zero minimum x axis. For these visuals I joined in the full carrier name.
Figure 3 is a kernel density function fit to arrival delay minutes. By default, the R kd function I used expects record-level data, but accepts frequencies as an option. The median delay is zero with quite a long positive tail. I’m pretty confident some of the extreme points are in error.
High-end quantiles for arrival delays by month are displayed in Figure 4, produced by an R function that can also accept frequencies. It’s not surprising that the travel-intense, summer months are the worst for delays, followed by the end of year holidays. September and October appear to be optimal for delay-avoiders.
Finally, Figure 5 contrasts over-time performance of 4 airlines: Delta (DL), American (AA), United (UA) and Southwest (WN). 5a shows the number of flights for WN steadily increasing from 1987 to 2008, while the three others display steep drops after 9/11 from which they still hadn’t recovered in 2008. Perhaps 5b provides some explanation: WN generally had better on-time performance than the three competitors.
Total elapsed time on my notebook computer for the database, statistical and visual computations for the five analyses? Less than half a minute! I’ll report more on this data in a subsequent blog.