Steve would like to thank colleague John Bowman, Ph.D., for his contribution to this column.
The discipline of statistical data analysis is expanding, and that's great news for business intelligence (BI). Oh, the traditional stats built on probability theory we learned in college are stronger than ever, but now the focus of statistics has broadened to include the study of algorithms for data analysis.
The Open BI Forum has written on the work of John Tukey with exploratory data analysis (EDA), which, along with Monte Carlo, resampling and other computer experimentation methods, have spawned data analysis approaches free of the restrictive assumptions of our college courses and promoted new visualization techniques now widely used in BI. In addition, the disciplines of computer science, decision science, operations research and statistics have conspired to create the exciting new field of machine learning (ML) - algorithms which automate the discovery of patterns in databases. ML, in conjunction with traditional regression and multivariate statistical models, are the foundations for data mining (DM). A DM capability is critical for businesses looking to differentiate through super crunching. Actually, DM is an unfortunate moniker - it's really knowledge mining and the even more accurate knowledge discovery through ML or knowledge discovery in databases (KDD) are sometimes used instead - but DM is commonplace, so we'll stick with it.
There is no shortage of proprietary DM tools in today's marketplace. Statistical vendors SAS, SPSS, Insightful and STATISTICA have standalone offerings, as does mining-centric KDnuggets. Enterprise juggernauts Oracle, Microsoft and IBM market DM solutions. Comprehensive BI players like Business Objects and Cognos have DM components, while ERP vendors Oracle and SAP showcase analytic functionalities as part of their business suites.
To the further benefit of the BI marketplace, there are now competitive, freely available, open source (OS) DM solutions as well. The R Project for Statistical Computing, lingua franca for academic statistical computing, supports more than a dozen DM packages written by its community. And the WEKA project is an open source initiative from the University of Waikato in New Zealand devoted to developing and applying ML technology to practical DM problems. More than 150 ML algorithms are currently implemented in WEKA.
A major challenge for DM vendors is integrating ML technologies with existing BI platforms to promote a seamless, cradle-to-grave intelligence solution fueled by high-end analytics. WEKA is seeking to redress this obstacle by linking its initiative to the Pentaho BI project - a comprehensive open source BI platform. While WEKA will persist as an open source project on its own, its DM software is now part of the Pentaho commercial, OSBI solution.
This Open BI Forum column is the first of a two part interview series with Mark Hall, Ph.D., core member of the WEKA development team and now liaison between WEKA and Pentaho. Part 1 focuses on defining DM with ML and discusses Waikato Environment for Knowledge Analysis projects (WEKA) open source roots. Part 2 (January 2008) will revolve on applications of DM technology for business, and plans for integration of the WEKA suite with Pentaho.
Steve Miller: Tell us about the University of Waikato in New Zealand. Could you describe your academic DM background and current research focus at the university?
Mark Hall: The University of Waikato is home to about 14,000 students, both domestic and international. The campus is situated in the city of Hamilton, which is located in the heart of the Waikato district in the central north island of New Zealand. Waikato has an international reputation and is strong in computing, chemistry, biology and mathematics.
I completed a Ph.D. in ML at Waikato in 1999. Originally, I was interested in symbolic sequence modeling, with music the primary application. At the time, I was looking into ways of building and combining multiple, separate PPM compression models, each focusing on a different aspect of music - pitch, duration, contour and so forth. This led to feature and model selection. So, I came up with a simple method for selecting predictors (both features or models) based on correlations. The feature selection method turned out to be a good, general-purpose, preprocessing step for standard propositional ML algorithms. The focus of my thesis turned from music modeling/compression to ML. Current research of the ML group at the University of Waikato is centered on fundamental methods. A look at the publications pages on the groups Web site shows papers on tree learning, rules, Bayesian methods and ensemble learning.
SM: WEKA is a university research project with the goal of developing ML techniques that can be applied to practical DM problems. Could you give a brief history of WEKA? Who started the project? What were the goals in creating WEKA? How has the software evolved over time? Also, tell us a little about the role the University of Waikato has played in this project. What is your current role with WEKA?
MH: The project was started in 1992 with Professor Ian Witten. At that time, he applied to the New Zealand government for funding to build a state of the art facility for developing techniques of ML. The idea was to produce a freely available workbench for ML to develop new learning methods and to explore the application of ML in the agricultural industries in New Zealand. The goal of WEKA was to provide a framework that facilitated research in the field. While it was possible for a researcher to get software from other disciplines at the time, different programming languages and different data formats were the rule, making the task of comparing your own research to that of others more difficult and time consuming.
Between 1993 and 1996, the infrastructure (including the ARFF data format) and user interface for WEKA was developed. In late 1996, the first public version of WEKA (Version 2.1) was released. The software ran on UNIX systems and had a TCL/TK user interface. Most learning algorithms were written in C, although, there was some Prolog and Lisp code as well. The last release of WEKA 2 (Version 2.3) was made in mid 1998. By that time there was quite a wide selection of algorithms included, and there was a facility (based on UNIX makefiles) for configuring and running large-scale experiments involving many algorithms applied to many data sets. However, by this time, maintaining and updating the various libraries that WEKA depended on with UNIX/Linux was proving tiresome. It was also the case that, especially for those inexperienced with UNIX/Linux, installing WEKA was nontrivial. Early in 1997, the group made the decision to rewrite the entire system from scratch in Java. This was quite a bold move at the time and there was some criticism of this decision (mainly concerning the performance of Java compared to compiled languages). The 100 percent Java, WEKA 3 was released in mid-1999. Initially, there was a console-only version that accompanied the new DM book by Ian Witten and Eibe Frank. A development version with graphical user interfaces came a little later.
My current role with WEKA involves development and support, but I've switched from the academic world to the business world. I now work for the Pentaho corporation and focus on WEKA as a product and DM services based on WEKA. I continue to have strong ties with the group at Waikato and usually meet with them once a week.
SM: One somewhat inclusive definition of DM or KDD I like is: Statistics at scale, speed and simplicity. A second depiction of DM is the confluence of database technology, information science, statistics and ML for automated discovery in large databases. How would you define DM?
MH: I like the simple definition given in Ian and Eibes book: the extracting of implicit, previously unknown and potentially useful information from data.1 ML and statistics provide the technical basis for DM.
SM: Comment, if you can, on the current state of DM. For what sorts of problems has it proven particularly successful? Could you give examples you find interesting or instructive? In what areas does DM still have a long way to go? Are there new frontiers of DM development? How do you see DM evolving in the future?
MH: DM appears to have gained widespread acceptance and found successful applications in many walks of life. Many online retail stores employ some kind of product recommendation system driven by DM techniques. The U.S. government is aided in its search for terrorism by DM. The term even crops up frequently in popular media. For example, the security analysts on the TV show 24 regularly data mine their databases for information crucial to the threat at hand. DM is successfully applied in the areas of banking, direct marketing, CRM, and fraud detection. A recent online poll suggests there is growth in the application of DM in the travel/hospitality and entertainment industries.
SM: Consider the following knowledge discovery functionalities discussed by Han and Kamber in their text Data Mining: Concepts and Techniques:characterization, discrimination, association, prediction, classification, clustering, outlier andtime related.2 How would you rate their relative importance for DM? For which of these is WEKA particularly well-suited?
MH: I dont think it makes sense to talk about the relative importance of these different categories. If your application falls into the market basket analysis domain, then its quite likely that association learning is going to be far more important than classification to you. However, having said that, data preprocessing functionalities in the form of cleaning, attribute selection and outlier detection are normally important regardless of the final type of learning that is applied. WEKA has all eight categories mentioned above covered to a greater or lesser extent. In particular, WEKA is strong in the areas of supervised learning (both classification and regression). There are currently 118 classification/regression schemes implemented. Attribute selection and other preprocessing techniques are also well represented in WEKA.
SM: WEKA is an open source project. As such, all code is freely available for others to use and modify. What does the development organization for WEKA look like? Is there a core team from the University of Waikato? How do interested developers contribute code? Could I write a new DM algorithm for the project? If so, what quality control procedures are in place?
MH: There is a small team of core developers from the University of Waikato. The code for the first release of WEKA 3 was essentially written by three Ph.D. students/postdocs: Eibe Frank, Len Trigg and myself. Since then, students have come and gone and, in the course of their studies, contributed code to WEKA, but the core development team has remained small. Those interested in submitting code to WEKA should contact us directly. The basic process of, for example, getting a new classifier into WEKA, involves making sure that the code meets our standards and adheres to the conventions we have for writing a new learning scheme. We then check to see if there is a publication in a reasonable conference or journal to back up the new method. If so, we take a quick look at it (if we arent already familiar) to see whether there are good experimental results that demonstrate an advantage of using the new method over already established methods from the literature. If its easy to do in WEKA, we normally have a quick go at trying to replicate the experimental results as well.
SM: How large is the community for WEKA? Does the open source model make it easier to fix bugs, add new features and respond to feedback from users? How, if at all, has the open source model contributed to WEKA innovation and the project's success?
MH: The WEKA community is pretty healthy. More than 2,000 people subscribe to the WEKA mailing list, and we see around 1000 downloads, on average, from SourceForge each day. The open source model has certainly helped with finding bugs. In some cases, where people have sent us patches or located a problem in the source code, it has hastened the process of fixing them as well. The open source model has been critical to the success of WEKA and the project as a whole. The goal was to facilitate research in ML by providing an open and extensible platform. The uptake of WEKA both in academia and in the commercial world is proof that this was the correct route.
SM: The R Project is an open source platform for data analysis, programming, statistical modesl and graphics that is now lingua franca of academic statistical computing. Can you contrast traditional statistics with DM as we've defined it? There have been several successful overtures to integrate components of R and WEKA. What are your thoughts about an inclusive R/WEKA platform? Do you see such open source analytics as viable alternatives to proprietary alternatives like SAS and SPSS for business going forward?
MH: Gads, this is a tough question. Off the top of my head, Id say that, traditionally, statistics has been about testing hypothesis, while DM is about generating hypothesis. Its also probably true that there is little in ML or DM that hasnt already been considered in statistics. Now, before Im drawn and quartered by outraged data miners and statisticians alike, its worth stating that, in many cases, an area of research has benefited from people working in parallel on similar problems but in different fields. Where ML has made large contributions is in providing efficient algorithms and work on scalability. R and WEKA are two separate projects, each with their own loyal following. I think that there is a real risk of turning a lot of users off by trying to create some kind of large unified system. I definitely think that open source analytics such as R or WEKA are genuine alternatives for businesses to consider and compare to proprietary offerings. Open source software is easy to obtain and evaluate for very low cost and little risk.
SM: The WEKA platform that is freely downloadable is built on a Java class library and can be accessed currently through three interfaces: Java API, Windows/Unix command line and GUI. The Java API can make WEKA ubiquitous for analytics, facilitating the inclusion of WEKA models in new or evolving business applications. Are you seeing new projects/applications based on the WEKA core? The fact that WEKA is a Java application and runs in a virtual machine imposes restrictions on the size of data sets bounded by available memory. Commercial vendors like SAS that exploit virtual memory will certainly use that limitation in competitive situations. How can the WEKA community respond?
MH: A search for WEKA on SourceForge results in a list of 34 projects, the majority of which extend, enhance or utilize WEKA in some fashion. There are projects that provide add-ons for WEKA that enable it to be applied to special domains (e.g. BioWeka, AstroWeka and Weka-CG) or provide specialist learning algorithms or preprocessing (Proper, Wekatransform, Judge and Agent Academy). There are some projects that have looked into applying WEKA in parallel and there are distributed processing environments (Grid Weka, Weka-Parallel and weka4WS). As far as the memory issue goes, once you have a trained model, there is no problem in using it to predict (score) as much data as you like. Furthermore, I would say that 95 percent of the time-effective models are created from carefully cleaned and sub-sampled data sets anyway. There are computational considerations as well. Most powerful methods have a runtime complexity that limits the amount of training data that can be applied.
SM: WEKA has quite an extensive library of algorithms, spanning all the standard discovery models (regression both multiple and logistic, Bayesian, neural networks, support vector machines, clustering, classification and regression trees, multivariate statistical techniques, and more). Has the WEKA project promoted new algorithms or implementations? How are new algorithms incorporated into the code base? Do intellectual property laws restrict in any way the algorithms you use?
MH: WEKA has abstract classes and interfaces for classification/regression, clustering, preprocessing (filters), association learning and feature selection. New algorithms are easily incorporated by extending/implementing these classes and interfaces. Javas reflection mechanism is used to provide automatic generation of UI property sheets to allow graphical configuration of an algorithms parameters. The WEKA project hasnt actively promoted any particular algorithms. Yet, you could say there is an indirect effect from the groups interests and biases. For example, there is little in the way of evolutionary or fuzzy techniques in WEKA.
SM: So far, we've discussed mining of structured data, primarily numbers, dates and factors. On your Web site I noticed a link to the keyword extraction algorithm (KEA) project. KEA automates the process of assigning keywords to electronic documents. How does this project relate to WEKA? Is WEKA heading towards unstructured DM? Will WEKA provide text analytics capabilities in the foreseeable future?
The KEA software is part of the New Zealand Digital Libraries (NZ DL) project at Waikato. The NZ DL project has sub-projects on metadata extraction, which KEA is one of, and text mining (mainly based on compression techniques). The KEA software uses WEKA as a library and employs the naïve Bayes algorithm for learning. WEKA isnt heading towards unstructured DM, though there are facilities in WEKA to support text classification and clustering. For example, there are tools to convert sets of documents to ARFF, the common WEKA data file format, using the bag of words representation along with implementations of several stemming algorithms.
- Ian H.Witten and Eibe Frank. Data Mining, Practical Machine Learning Tools and Techniques. Morgan Kaufman Publishers. 2005.
- Jiawei Han and Micheline Kamber. Data Mining, Concepts and Techniques. Morgan Kaufman Publishers. 2006.