Free Site RegistrationFree Site Registration

Sign up today and access Information Management on the web!
Your FREE registration entitles you to:

FREE email newsletters

FREE access to all Information Management content

FREE access to web seminars, resource portals, our white paper library and more!

Whence WEKA - Open Source Data Mining, Part 1

Open BI Forum

Information Management Online, November 22, 2007

Steve Miller

Steve would like to thank colleague John Bowman, Ph.D., for his contribution to this column.

 

The discipline of statistical data analysis is expanding, and that's great news for business intelligence (BI). Oh, the traditional stats built on probability theory we learned in college are stronger than ever, but now the focus of statistics has broadened to include the study of algorithms for data analysis.

Advertisement

 

The Open BI Forum has written on the work of John Tukey with exploratory data analysis (EDA), which, along with Monte Carlo, resampling and other computer experimentation methods, have spawned data analysis approaches free of the restrictive assumptions of our college courses and promoted new visualization techniques now widely used in BI. In addition, the disciplines of computer science, decision science, operations research and statistics have conspired to create the exciting new field of machine learning (ML) - algorithms which automate the discovery of patterns in databases. ML, in conjunction with traditional regression and multivariate statistical models, are the foundations for data mining (DM). A DM capability is critical for businesses looking to differentiate through super crunching. Actually, DM is an unfortunate moniker - it's really knowledge mining and the even more accurate knowledge discovery through ML or knowledge discovery in databases (KDD) are sometimes used instead - but DM is commonplace, so we'll stick with it.

 

There is no shortage of proprietary DM tools in today's marketplace. Statistical vendors SAS, SPSS, Insightful and STATISTICA have standalone offerings, as does mining-centric KDnuggets. Enterprise juggernauts Oracle, Microsoft and IBM market DM solutions. Comprehensive BI players like Business Objects and Cognos have DM components, while ERP vendors Oracle and SAP showcase analytic functionalities as part of their business suites.

 

To the further benefit of the BI marketplace, there are now competitive, freely available, open source (OS) DM solutions as well. The R Project for Statistical Computing, lingua franca for academic statistical computing, supports more than a dozen DM packages written by its community. And the WEKA project is an open source initiative from the University of Waikato in New Zealand devoted to developing and applying ML technology to practical DM problems. More than 150 ML algorithms are currently implemented in WEKA.

 

A major challenge for DM vendors is integrating ML technologies with existing BI platforms to promote a seamless, cradle-to-grave intelligence solution fueled by high-end analytics. WEKA is seeking to redress this obstacle by linking its initiative to the Pentaho BI project - a comprehensive open source BI platform. While WEKA will persist as an open source project on its own, its DM software is now part of the Pentaho commercial, OSBI solution.

 

This Open BI Forum column is the first of a two part interview series with Mark Hall, Ph.D., core member of the WEKA development team and now liaison between WEKA and Pentaho. Part 1 focuses on defining DM with ML and discusses Waikato Environment for Knowledge Analysis project’s (WEKA) open source roots. Part 2 (January 2008) will revolve on applications of DM technology for business, and plans for integration of the WEKA suite with Pentaho.

 

 

 

Steve Miller: Tell us about the University of Waikato in New Zealand. Could you describe your academic “DM” background and current research focus at the university?

 

Mark Hall: The University of Waikato is home to about 14,000 students, both domestic and international. The campus is situated in the city of Hamilton, which is located in the heart of the Waikato district in the central north island of New Zealand. Waikato has an international reputation and is strong in computing, chemistry, biology and mathematics.

 

I completed a Ph.D. in ML at Waikato in 1999. Originally, I was interested in symbolic sequence modeling, with music the primary application. At the time, I was looking into ways of building and combining multiple, separate PPM compression models, each focusing on a different aspect of music - pitch, duration, contour and so forth. This led to feature and model selection. So, I came up with a simple method for selecting predictors (both features or models) based on correlations. The feature selection method turned out to be a good, general-purpose, preprocessing step for standard propositional ML algorithms. The focus of my thesis turned from music modeling/compression to ML. Current research of the ML group at the University of Waikato is centered on fundamental methods. A look at the publications pages on the group’s Web site shows papers on tree learning, rules, Bayesian methods and ensemble learning.

Page 1 of 5.

Advertisement

Advertisement