OpenBI’s college recruiting starts the week after next and goes through October. I’m looking forward to the campus visits and anticipate we may hire as many as three students, two for next June and one for January. We’re already started thinking about a revamped training curriculum for the “newbies”.
To that end, I’ve found what I think might be a good source of data for exercises in Python, R and Tableau. The data sets derive from the Public Use Microdata Sample (PUMS) of the American Community Survey (ACS), and include an extensive dictionary of attributes, such as residence, age, sex, race, education, occupation and income, on the population sample.
I set out to take a look at the 2007-2011 ACS 5-year PUMS data. Downloading the hefty file was a snap using Python’s urllib library. Extracting the multiple CSV file contents of the zipfile was straightforward as well. In the end, there were four identically-organized CSV files totaling over 10 gigabytes, in excess of 15 M cases, and hundreds of attributes, most of which weren’t useful.
No way could I load all that data into Python memory, so I scoured the documentation to determine a small set of attributes of interest. Since the variable names were included in the header for each file, I was able to write a quick Python script to create a consolidated CSV file with just the desired variables from each input. That CSV was, in turn, used as input to R to produce a data frame.
On my 16 GB RAM notebook, memory-ravenous 64 bit R had no problem absorbing the 15 M+, 12 attribute records. The meta-data, existing R attributes and some simple code collaborated to formulate several additional variables. The first was an 8-category educational attribute, the second combined 2 features to divine a 5-category race factor. I then built a 7 M+ record subset of the “working” population – those 18 years of age or more with wages in excess of $1000/year. Finally, I randomly sampled 80% of the working subset, a bit less than 5.7 M rows, to use as training data, the remainder held out for test. I saved the data sets for further R and subsequent Tableau processing. I’ll take on the Tableau visualization on another occasion.
I continue to be impressed with R’s “big data” performance on my notebook. I was able to execute several models predicting wage as a function of age, sex, education and race against all 5.7 M training records in reasonable – less than 2 minutes – time. Only the computationally-intensive gradient boosting failed due to inadequate resources.
Once I completed the models and accompanying graphs, I just had to brag about the work – maybe six hours total – to Bryan, a skeptical OpenBI partner. If I’m a credulous proof of concept kind of guy, Bryan’s ever the hard-nosed, production cynic. And that’s good, because he delivers BI applications to customers.
When I discussed the data and analytics challenges with Bryan, his “BI” take was to use the freely-available Pentaho Data Integration (PDI) platform for ETL, rather than Python. He confided he also would likely land the data in a MySQL or Postgres staging repository for further scrutiny. The calcs I performed in R he thought might just as easily be done in PDI.
Bryan wasn’t dogmatic about his thoughts, however, opining that one-off analyses don’t generally mandate production-level tools and scrutiny. He said he’d build the R and Tableau datasets with PDI, since even as a quick-and-dirty, that’s a tool he knows well. His ultimate message: use tools in accord with both what you know and the deployment/maintenance challenges.
One point we did agree on was that, be it data science or business intelligence, an ETL platform such as PDI from Pentaho should be front and center in the practitioner’s tool chest. Much as I enjoy programming in Python and as powerful as the language is, the high-level visual data movement and transformation abstraction of a tool such as PDI wins for production-capable data migration – and maybe even for some one-off’s.
Add PDI (or a similar, readily-accessible ETL platform) to the data science arsenal.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access