Poor Man's BI: Getting Started with Open Source Tools for Analytic Intelligence
This is an article from the June 2006 issue of DM Review's Extended Edition. Click on this link for more information on DMR Extended Edition or to download this issue in a PDF format.
In his award-winning book The World is Flat, Thomas L. Friedman argues convincingly that open sourcing is one of 10 significant "flatteners" changing the world. Though much of that disruption is just now starting to play out in the market, it's safe to say that the open source software movement is no longer at the fringes of the technology. The successes of Linux, Apache, JBoss, MySQL, Postgres, OpenOffice, Mozilla and Perl, to name just a few products, have changed the landscape of IT, probably forever. While open source communities are thriving, businesses building on the open source model are just starting to hit their stride, struggling with strategies of how to make money surrounding "free" products. The successes in open source are expanding its scope, particularly as Fortune 500 software consumers and producers commit. Most open source technology use to date has been in commoditized areas, especially those serving ubiquitous infrastructure needs. But this is starting to change, and open source commercial business intelligence (OSBI) platforms are emerging. Still, the serious OSBI products are just now coming to market, and the business models driving the product companies are works in progress. While these companies, platforms and business models evolve, there is no need for patience from companies seeking BI answers. "Poor man's BI," the integration of open source stalwarts Python, PostgreSQL, OpenOffice and R, can be used to provide a meaningful startup BI solution now.
Consider a minimal architecture for BI, an architecture that, while not complete, can certainly deliver valuable startup intelligence to an enterprise. Data integration consolidates data from disparate sources to a common warehouse environment through myriad processes, including ETL (extract, transform and load). The data warehouse is the repository of intelligence data, functioning as the foundation of all inquiry - serving the marts and "convenience stores" that source reporting and analytics. Query and reporting accesses the data warehouse and marts using a consistent interface that (hopefully) insulates users from much of the complexity of the database language. And analytics brings a more sophisticated level of statistical-like methods and procedures, including exploration, estimation, inference, predictive models, time-series and multivariate, to the solution of business problems.
I call it poor man's BI, but the combination of the open source language Python, database PostgreSQL, desktop OpenOffice and analytics package R can deliver value along significant points of the BI lifecycle today and can indeed function as a BI proof-of-concept platform while the market sorts out the disparate new players and models. Poor man's BI is capable for data integration and ETL (Python), data warehousing (PostgreSQL), query and reporting (OpenOffice) and analytics/graphics (R). One thing is certain: Python, PostgreSQL, OpenOffice and R are products with established open source communities that will not disappear.
The Python/PostgreSQL/OpenOffice/R BI platform presents a maturity that will be comforting to users of poor man's BI. Like Linux, each product has reaped the benefits of especially strong early design and deployment leadership, either from an individual, a commercial venture or academia. Each tool is now supported by an enthusiastic community of developers and users. Each product is both stable - having survived the rigors of years of development and testing - and evolving to add significant new capabilities. A strong indicator of this maturity is the ease of product installation, both on UNIX/Linux and Windows. Each product can now be readily installed from either binary or source code on Windows, Linux or UNIX. Compare that to the old days of source code, make files and ... prayer.
The cornerstone of poor man's BI is the power of Python, a language developed by Guido van Rossum and named after Monty Python's Flying Circus. Python is similar in functionality to Perl, but was designed as object oriented from the ground up. Like Perl and later Ruby, open source Python has grown to enjoy enormous success, with an estimated 1,000,000 users today. Python differs from C and Java languages, which take low-level instructions and require programmers to closely manage computer resources. For its role in BI, the three most important features of Python are its power and breadth of capabilities, its ease of use/productivity and its expandability/scale as an open source model. Some refer to Python (and Perl/Ruby) as a dynamic scripting language. Python can indeed produce very succinct and powerful scripts, and is dynamic in that variables are not declared and the traditional compile/link/execute cycle reduces to one-step compile/execute, but the moniker "agile" is probably more appropriate. Python is general purpose and readily addresses a variety of problems - text, file, systems, database, GUI and Web management - and makes programming simple and fun. Python is designed to handle very small tasks but can also be used for large-scale programming. To borrow a phrase, Python makes the routine simple and the difficult possible.
As important as the foundation capabilities and ease of use are to the Python community, it is the scale of the open source model that drives much of Python's success. Programmers worldwide can develop modules to share with the Python community, earning reputation and status for their good work. The volume of Python modules available for download is impressive; many have special applicability to BI. Some are products of research inquiry from staff at top universities and research centers. On more than one occasion, I have started down the path of new development only to find that I was reinventing code already published in open source. And, despite the cries of quality risk from some, I've found the modules I've downloaded generally to be of high quality. Democratization of the open source model, where the community can add to an established, high-quality foundation, is Python's biggest boon for poor man's BI.
Python is particularly adept at the text and data processing steps that are at the core of ETL. Rich data structures, such as lists, dictionaries and sets, facilitate easy-to-understand programmatic manipulation of data. Powerful regular expression and string-and-file handling classes simplify the routines of cleansing data. Table lookups using dictionaries are simple and quick. Functional programming constructs that supplant loops are often available and are continuously added to new releases. Rotating data from vertical to horizontal, a common and often thorny task, is straightforward with Python. Modules that allow connectivity to relational databases through established protocols are available for open source as well as proprietary databases. A powerful exception-handling capability assures that programming can be near bulletproof. Python, a C-based platform, has Jython, based on Java, as a first cousin. Jython programmers can intermix Python syntax and Java, with access to the rich Java foundation classes. Finally, Python's object orientation makes it straight forward to develop reusable code. Modules are developed and shared simply by placing them in the Python path.
Python's denotation as a scripting language does not come without merit. The functionality of 1980s and '90s shell scripts can now be embedded in a comprehensive programming environment with a consistent means for handling errors. Python can manage files on the network, fork new processes, execute programs in place, execute dynamic code in situ, FTP data from the Internet and process XML - in short, manage the data warehousing scripting process - with a simple and consistent language.
Python can also be used for the complicated query and reporting that befuddles SQL and SQL-based tools. Queries such as "list the top 20 baseball hitting streaks over the last 50 years" or "which portfolios never experience more than a 10 percent decline in daily value" are best served by the rich combination of SQL and procedural-based data structures in agile languages like Python. Indeed, supporting the measurement of portfolio performance, the financial services community is one of the most pervasive users of Perl/Python/Ruby.
PostgreSQL, the database of choice for the poor man's BI data warehouse, has its origins, appropriately enough, at the University of California, Berkeley, one of the main incubators for the open source movement. PostgreSQL enjoys the 30-year legacy of Ingres and Postgres, both developed as student projects by the top-rated computer science department at Berkeley under Michael Stonebraker. PostgreSQL is a full-featured relational database offering SQL99 compliance with object-relational extensions. PostgreSQL supports transaction processing, referential integrity and triggers as well as an embedded database language, PL/pgSQL, which is similar to Oracle's PL/SQL. Programmers can extend the PostgreSQL server by adding functions, domains, new data types and procedural languages. Arrays and composite types similar to records are particularly pertinent for BI. PostgreSQL supports access to a wealth of APIs, including C, C++, JDBC, Perl, Python, Ruby, PHP, Tcl/Tk, DBI and ODBC. Performance in PostgreSQL is tunable, the server supporting several index types, tablespaces, storage options, partitioning and trace/explain features. At least one open source vendor is distributing a variant of PostgreSQL designed for the needs of large-scale BI, and another has extended PostgreSQL to accept Oracle-programmable server syntax.
OpenOffice is an open source version of StarOffice from Sun Microsystems that offers a suite comparable to Microsoft Office, and runs on Windows, Linux and UNIX. OpenOffice modules include Writer (word processing), Impress (presentations), Calc (spreadsheet), Base (database), Draw and Math. Like Microsoft Access, Base can be used as both a database for small applications and as a front-end query, reporting and forms-based tool for accessing RDBMSs through ODBC and JDBC. Though not as sophisticated as the commercially sold query and reporting (Q&R) tools that insulate users from the complexities of the database through a semantic layer, Base nonetheless offers powerful GUI-based Q&R that guides users through query construction. The results of database queries can then be seamlessly shared with Calc for subsequent what-if analysis. Base forms can be used to develop simple front ends tied to database tables with little effort. Indeed, OpenOffice database connectivity is all about getting a fair amount with little effort - a prerequisite for a poor man's tool.
R is an integrated, interactive programming environment that consists of an object-oriented language with rich vector and array operators, a comprehensive set of statistical and data analyses functions/procedures, a data management facility and a robust collection of graphics. Originally written by staff of the University of Auckland in New Zealand, R shares heritage with the S language developed at Bell Labs and commercialized by Insightful Corporation. A core group of developers is now chartered with maintaining R and releasing new versions several times a year. A worldwide community of statisticians and data analysts extends R by writing packages showcasing the latest analytics and making them accessible to the user community. The newest R statistical procedures are often posted by the very people who developed the theory. R special interest groups (SIGS) work to extend the product in meaningful ways, providing support via email communities. The SIGs help users in distress solve problems as disparate as Windows Registry anomalies and interpretation of residuals from GARCH time-series models. I once submitted a question to a list, went to lunch and had a problem-solving response from the Netherlands when I returned an hour later. If only for-pay software support were so responsive!
R's role in BI is one of analytics, visualization and predictive modeling. An R session might deploy a series of scripts designed for the specific tasks surrounding analysis, the details of which can be hidden from users through functions and packages. The results of each step are readily stored and shared for subsequent use. A script would first grab data from the data warehouse using an ODBC or DBI interface, storing the retrievals in a series of R structures such as data frames or matrices. Then, the script(s) could reshape and otherwise "munge" the data to meet the data-passing needs of the functions or procedures it wishes to use. R's ability to restructure data is a design strength. It is very easy, for example, to rotate vertical storage to horizontal and vice versa. R's vector and array operators make looping an exception rather than a rule. There are powerful "by group" functions available for computing statistics on each combination of a set of dimension variables and storing the results for subsequent use. Once the data is shaped properly, it can be submitted to a seemingly limitless collection of procedures and predictive models. The results of those models are stored as objects and available for subsequent graphical analysis deploying R's rich and industry-pacing innovations, such as Trellis. The graphs can then be exported to a variety of formats, including PDF and JPEG for email and network posting, making R an ideal tool for driving the poor man's scorecard.
We have discussed the Python/PostgreSQL/ OpenOffice/R platform as a series of powerful, individual components, each of which plays a significant role in poor man's BI. Python, the scripting language, can manage the data and system administration tasks, kicking off and conditionally executing jobs, assuring that tasks are completed as scheduled. Python, the database language, can access PostgreSQL efficiently through DBI and ODBC that, in tandem with Python's file, string, XML and URL modules, provide a foundation for ETL. OpenOffice provides the window to the database. R, the data analysis language, is used to transform data to intelligence. There is, however, additional value when all components are integrated as one. And Python, the agile language, functions as the glue to unite all tools. The Python/R community has developed RPy, a module that allows access to R functions and data structures from within Python programs. With RPy, all of the capabilities of R - functions, models, graphics, etc. - are available to Python data structures that can, in turn, write to R. This allows Python to coordinate the administrative, database, analytical and Web-delivered aspects of BI, fundamentally enabling complete BI solutions in Python.
Poor Man's BI in Practice
One illustration of the power of poor man's BI is in support of a startup portfolio management company that deploys open source technologies to measure the performance of its products against benchmark indexes. The company uses Python to collect daily stock and index values from Internet sources, consolidating the data to a mart housed in PostgreSQL. OpenOffice Base and Calc are the primary windows into native PostgreSQL, used to validate data and provide access through spreadsheets. Additional scripts written in Python and R push/pull PostgreSQL data into R, where R's array, statistics and graphics are programmed to produce the performance structures. Larger Python/R collaborations sending reports to customers are in the works.
Poor man's BI has proven a very effective proof-of-concept architecture, scaling from small to midsized data volumes as the company experiments with solutions. In time, they will probably migrate to one of the evolving OSBI platforms, using the knowledge gained from poor man's BI to build efficient intelligence.
Open source is now established as an IT force. The democratization of software development/support has proven successful and will continue to expand. Open source for BI is on the horizon and promises comprehensive platform solutions in the very near future. Until that time, poor man's BI can provide more than adequate ETL, data management, Q&R and analytics startup support for business.
The community Web site, http://www.python.org/, is the place to start for information, documentation and downloads of Python. Python is also distributed on most Linux flavors and is supported by a for-profit company, ActiveState, www.activestate.com, that markets development tools for Perl and Python. Like Python, PostgreSQL originates from a community Web site,http://www.postgresql.com/, that is a great point of departure for information and software downloads. Many versions of Linux also distribute PostgreSQL. There are a growing number of commercial PostgreSQL vendors whose business revolves on products and support surrounding different open source flavors of PostgreSQL. Greenplum (http://www.greenplum.com/) supports Bizgres for large-scale BI with commercial products and services. ExtenDB (http://www.extendb.com/) promotes a clustered database environment built on PostgreSQL and markets
both open source and commercial versions of its products in addition to support. EnterpriseDB (http://www.enterprisedb.com/) markets a version of PostgreSQL layered to support Oracle syntax, generating revenue through support and services. Pervasive Postgres, http://www.pervasive-postgres.com/, touts 20 years of experience with PostgreSQL and markets add-on products, services and support. OpenOffice, the desktop competitor to Microsoft Office, is available from http://www.openoffice.org/. The OpenOffice site offers a wealth of information on the OpenOffice community. The R Project for Statistical Computing, http://www.r-project.org/, is the community site for R, supported by the Comprehensive R Archive Network (CRAN), http://www.cran.r-project.org/, for downloads and documentation. Many analytics and financial engineering companies use R as an enabler for services to customers.