Im very excited about some of the work my company OpenBI, a BI consultancy focused on open source (OS) solutions, is currently doing. Over the last month or so, weve been collaborating with OS analytics database leader Infobright and OS BI platform vendor Jaspersoft -- with special sauce from the freely-available R Project for Statistical Computing -- to build an enterprise-level, cloud-based analytics demo to showcase at the upcoming MySQL User Conference.
The plan was to assemble a complete open source BI prototype database management, ETL, query/reporting, dashboards, OLAP, and statistical models/graphics. We decided early on to use real monthly data from the Current Population Survey, and were then off to the development races. We started work on a script that included database/table definitions, ETL, and comprehensive BI and statistical back ends. Wed deploy the completed script in turn to our Amazon cloud account. No sweat I thought. Weve honed our expertise on open source BI in the cloud. Weve performed this drill several times already. Alas, not so fast ...
We had no problem with the initial database and ETL sides, correctly loading all 206,404 records to Infobright. In fact, we had little difficulty deploying the initial demo reports, dashboards, and OLAP cubes. We were even able to seamlessly integrate R with Jaspersoft to run statistical scripts that exploit Rs powerful graphics capabilities as if they were native to Jaspersoft. What I hadnt planned for in our tight schedule, however, were the data problems/delays that routinely plague BI development. Shame on me!
The team got the raw data loaded correctly quickly using JasperETL. The main data file actually represented two different record types, household and individual. Once loaded, we conducted sanity inspections of each of the attributes and further reviewed individual tabulations and summaries to synch with those posted on the census site.
Our travails started when we looked to condense the initial data tables to consider households, individuals, and income-generating adults, respectively. It took a bit of experimentation to find the right combinations of attributes and categories to restrict the data like we wished. Truth be told, I cheated a bit at the end, eliminating several hundred records of full-time employed adults with little or no income.
We committed more time than originally allocated to cross tabulate, summarize and validate separate variables we knew to be related. With over 100 attributes, many of which relate in theory, there was quite a bit of sanity checking work though that had to be done to provide us comfort with the data.
Other challenges emerged from both the absolute numbers of attributes and the granularity of the important categorical variables. There are, for example, 21 separate values for race, and 17 categories for education far more than are sensible for basic OLAP and statistical analysis. So for these and several other categorical attributes, we created new items collapsing levels of the old to create more compact, analytic-friendly dimensions. Of course, testing the new creations took time not allocated on the plan.
We also had to map the text dimension columns to their integer-valued counterparts. Education, for example, came in two flavors: an attribute of labels such as 5th or 6th grade and Some college but no degree and one of corresponding arbitrary integers like 33 and 40. The challenge was to synch those representations so that the ordering of categories for OLAP and statistical analysis would be sensible. Left to their own devices, most analytics tools present categories in alphabetical order almost certainly not what is desired.
The additional data work expanded our demo deployment by over 30%. A stats guy who routinely sets expectations that the data side comprises more than three quarters of BI effort, I felt a bit humbled by my oversight. Hopefully, this refresher learning will stay with me for a while this time!