I’m very excited about some of the work my company OpenBI, a BI consultancy focused on open source (OS) solutions, is currently doing. Over the last month or so, we’ve been collaborating with OS analytics database leader Infobright and OS BI platform vendor Jaspersoft -- with “special sauce” from the freely-available R Project for Statistical Computing -- to build an enterprise-level, cloud-based analytics demo to showcase at the upcoming MySQL User Conference.

The plan was to assemble a complete open source BI prototype – database management, ETL, query/reporting, dashboards, OLAP, and statistical models/graphics. We decided early on to use real monthly data from the Current Population Survey, and were then off to the development races. We started work on a script that included database/table definitions, ETL, and comprehensive BI and statistical back ends. We’d deploy the completed script in turn to our Amazon cloud account. No sweat I thought. We’ve honed our expertise on open source BI in the cloud. We’ve performed this drill several times already. Alas, not so fast ...

We had no problem with the initial database and ETL sides, correctly loading all 206,404 records to Infobright.  In fact, we had little difficulty deploying the initial demo reports, dashboards, and OLAP cubes. We were even able to seamlessly integrate R with Jaspersoft to run statistical scripts that exploit R’s powerful graphics capabilities as if they were native to Jaspersoft. What I hadn’t planned for in our tight schedule, however, were the data problems/delays that routinely plague BI development. Shame on me!

The team got the raw data loaded correctly quickly using JasperETL. The main data file actually represented two different record types, household and individual. Once loaded, we conducted sanity inspections of each of the attributes and further reviewed individual tabulations and summaries to synch with those posted on the census site. 

Our travails started when we looked to condense the initial data tables to consider households, individuals, and income-generating adults, respectively. It took a bit of experimentation to find the right combinations of attributes and categories to restrict the data like we wished. Truth be told, I cheated a bit at the end, eliminating several hundred records of full-time employed adults with little or no income.

We committed more time than originally allocated to cross tabulate, summarize and validate separate variables we knew to be related. With over 100 attributes, many of which relate in theory, there was quite a bit of sanity checking –work though that had to be done to provide us comfort with the data.

Other challenges emerged from both the absolute numbers of attributes and the granularity of the important categorical variables. There are, for example, 21 separate values for race, and 17 categories for education – far more than are sensible for basic OLAP and statistical analysis. So for these and several other categorical attributes, we created new items collapsing levels of the old to create more compact, analytic-friendly dimensions. Of course, testing the new creations took time not allocated on the plan.

We also had to map the text “dimension” columns to their integer-valued counterparts. Education, for example, came in two flavors: an attribute of labels such as “5th or 6th grade” and “Some college but no degree” and one of corresponding arbitrary integers like 33 and 40. The challenge was to synch those representations so that the ordering of categories for OLAP and statistical analysis would be sensible. Left to their own devices, most analytics tools present categories in alphabetical order – almost certainly not what is desired.

The additional data work expanded our demo deployment by over 30%. A stats guy who routinely sets expectations that the data side comprises more than three quarters of BI effort, I felt a bit humbled by my oversight. Hopefully, this refresher learning will stay with me for a while this time!

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access