I had a nice conversation with an old stats colleague the other day. He's a SAS analyst still, while I've mostly migrated to R, though I like SAS-clone WPS a lot as a data science platform. We reminisced about the 80's and 90's of statistics and pre-data warehouse analytics a time when SAS was not only our statistical package of choice but also a primary data integration programming platform. The discussion evolved to the topic of data exploration, which occupies much of our DS effort today. Fortunately, the tools we have to explore data have advanced considerably over the years.
My friend an I agreed that the first priority with a new data set revolves on determining the distribution of values for each of the attributes. Initially, we wish to see frequencies for the responses of each variable. Those give us a general sense of the data, its distribution and its quality. For categorical attributes, we prefer to visualize frequencies sorted from most to least in an unadorned graphic; for numeric attributes that assume many different values, we like histograms and perhaps even the more sophisticated kernel density plots to detail the shape of the data. In the end, we decided to share the tools in our chest for showcasing data distribution -- mine in R, his in SAS.
I noted in a blog a few weeks back the package data.table that's starting to take the R community by storm. data.table substantially extends the functionality of R's ubiquitous data.frame and, in addition, provides many performance and capacity enhancements. For me, it's the go-to data management package in R today.
Putting together a fast frequencies function powered by data.table is a snap, as is generically superimposing a nice lattice package visual around the results. Consider an example from the provider utilization and payment data set recently published by the Centers for Medicare & Medicaid Services. The 1.7G CSV file consisting of over 9.15M records with 27 attributes loads into an R data.table in little more than a minute on my 16G RAM Wintel notebook.
The frequencies visual for 3 attributes is presented in Figure 1. 1a displays the counts for “nppes_provider_state” in descending frequency order with a natural zero point in the lattice package dotplot. The graph includes 60 of the 61 distinct values, accounting for a high percentage of total records. Optional parameters to the calling function allow inclusion of all categories. Note the non-states like VI and GU that may be subsetted out of further analyses.
Click here for a larger version of Figure 1.
The 1b dotplot displays “provider_type”, limiting to just the first 30 of 89 distinct values that nonetheless account for almost 90% of the cases. And finally, 1c presents counts for the first 60 of the 880,000 individual providers, “npi”. The top care-giver is represented in over 600 of the individual records. The almost seven figures of category-counts are returned in just 2 seconds.
Well-distributed numeric variables like “average_submitted_chrg_amt” are too granular for frequencies visuals. Better for these are graphs like the histogram, which “bins” the attributes into ranges, and the densityplot, which plots kernel density estimates. You can think of a densityplot as a histogram where the number of bins is infinitely large.
Figure 2 displays both a histogram and a densityplot for the base 10 log of billing attribute average_submitted_chrg_amt for the state of Illinois cases. Why the log? average_submitted_chrg_amt is quite skewed to the right a long tail of very large charges -- and the log transform makes the distribution much more “bell-like” and thus easier to visualize. 2a depicts the histogram, while 2b represents the densityplot. Both are scaled to density, with the 387623 records representing the area under the curves.
Click here for a larger version of Figure 2.
The combination of frequency counts driven by simple dotplots for categorical attributes along with histograms-densityplots for granular numeric variables presents a powerful one-two punch for initially assessing the distribution of variables in a data set. Add in variable transformations and data subsetting and you have a comprehensive first-look lens. And the next exploration step for me? Tableau-R, Spotfire-R or Omniscope-R to investigate multivariate relationships.