NOV 22, 2007 2:02am ET

Related Links

Innovative Organizations Likely to have More Pervasive BI and Data Governance
September 2, 2014
Revolutionize Your Business Intelligence with Lean, High-Performance Solutions
August 21, 2014
Should You Always Obey Orders from Your Executives?
August 7, 2014

Web Seminars

Why Data Virtualization Can Save the Data Warehouse
Available On Demand
Essential Guide to Using Data Virtualization for Big Data Analytics
September 24, 2014

Whence WEKA - Open Source Data Mining, Part 1


Steve would like to thank colleague John Bowman, Ph.D., for his contribution to this column.


The discipline of statistical data analysis is expanding, and that's great news for business intelligence (BI). Oh, the traditional stats built on probability theory we learned in college are stronger than ever, but now the focus of statistics has broadened to include the study of algorithms for data analysis.


The Open BI Forum has written on the work of John Tukey with exploratory data analysis (EDA), which, along with Monte Carlo, resampling and other computer experimentation methods, have spawned data analysis approaches free of the restrictive assumptions of our college courses and promoted new visualization techniques now widely used in BI. In addition, the disciplines of computer science, decision science, operations research and statistics have conspired to create the exciting new field of machine learning (ML) - algorithms which automate the discovery of patterns in databases. ML, in conjunction with traditional regression and multivariate statistical models, are the foundations for data mining (DM). A DM capability is critical for businesses looking to differentiate through super crunching. Actually, DM is an unfortunate moniker - it's really knowledge mining and the even more accurate knowledge discovery through ML or knowledge discovery in databases (KDD) are sometimes used instead - but DM is commonplace, so we'll stick with it.


There is no shortage of proprietary DM tools in today's marketplace. Statistical vendors SAS, SPSS, Insightful and STATISTICA have standalone offerings, as does mining-centric KDnuggets. Enterprise juggernauts Oracle, Microsoft and IBM market DM solutions. Comprehensive BI players like Business Objects and Cognos have DM components, while ERP vendors Oracle and SAP showcase analytic functionalities as part of their business suites.


To the further benefit of the BI marketplace, there are now competitive, freely available, open source (OS) DM solutions as well. The R Project for Statistical Computing, lingua franca for academic statistical computing, supports more than a dozen DM packages written by its community. And the WEKA project is an open source initiative from the University of Waikato in New Zealand devoted to developing and applying ML technology to practical DM problems. More than 150 ML algorithms are currently implemented in WEKA.


A major challenge for DM vendors is integrating ML technologies with existing BI platforms to promote a seamless, cradle-to-grave intelligence solution fueled by high-end analytics. WEKA is seeking to redress this obstacle by linking its initiative to the Pentaho BI project - a comprehensive open source BI platform. While WEKA will persist as an open source project on its own, its DM software is now part of the Pentaho commercial, OSBI solution.


This Open BI Forum column is the first of a two part interview series with Mark Hall, Ph.D., core member of the WEKA development team and now liaison between WEKA and Pentaho. Part 1 focuses on defining DM with ML and discusses Waikato Environment for Knowledge Analysis project’s (WEKA) open source roots. Part 2 (January 2008) will revolve on applications of DM technology for business, and plans for integration of the WEKA suite with Pentaho.




Steve Miller: Tell us about the University of Waikato in New Zealand. Could you describe your academic “DM” background and current research focus at the university?


Mark Hall: The University of Waikato is home to about 14,000 students, both domestic and international. The campus is situated in the city of Hamilton, which is located in the heart of the Waikato district in the central north island of New Zealand. Waikato has an international reputation and is strong in computing, chemistry, biology and mathematics.


I completed a Ph.D. in ML at Waikato in 1999. Originally, I was interested in symbolic sequence modeling, with music the primary application. At the time, I was looking into ways of building and combining multiple, separate PPM compression models, each focusing on a different aspect of music - pitch, duration, contour and so forth. This led to feature and model selection. So, I came up with a simple method for selecting predictors (both features or models) based on correlations. The feature selection method turned out to be a good, general-purpose, preprocessing step for standard propositional ML algorithms. The focus of my thesis turned from music modeling/compression to ML. Current research of the ML group at the University of Waikato is centered on fundamental methods. A look at the publications pages on the group’s Web site shows papers on tree learning, rules, Bayesian methods and ensemble learning.


SM: WEKA is a university research project with the goal of developing ML techniques that can be applied to practical DM problems. Could you give a brief history of WEKA? Who started the project? What were the goals in creating WEKA? How has the software evolved over time? Also, tell us a little about the role the University of Waikato has played in this project. What is your current role with WEKA?


MH: The project was started in 1992 with Professor Ian Witten. At that time, he applied to the New Zealand government for funding to build a state of the art facility for developing techniques of ML. The idea was to produce a freely available workbench for ML to develop new learning methods and to explore the application of ML in the agricultural industries in New Zealand. The goal of WEKA was to provide a framework that facilitated research in the field. While it was possible for a researcher to get software from other disciplines at the time, different programming languages and different data formats were the rule, making the task of comparing your own research to that of others more difficult and time consuming.


Get access to this article and thousands more...

All Information Management articles are archived after 7 days. REGISTER NOW for unlimited access to all recently archived articles, as well as thousands of searchable stories. Registered Members also gain access to:

  • Full access to including all searchable archived content
  • Exclusive E-Newsletters delivering the latest headlines to your inbox
  • Access to White Papers, Web Seminars, and Blog Discussions
  • Discounts to upcoming conferences & events
  • Uninterrupted access to all sponsored content, and MORE!

Already Registered?


Comments (0)

Be the first to comment on this post using the section below.

Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
Please note you must now log in with your email address and password.