This article looks at technical issues in data mining integration into warehousing environments. This is obviously only part of the larger challenge of integrating data mining into an organization. We show why a database integration approach is a feasible alternative to current developments in application integration. We present recent developments in the field of data mining that solve many of the current integration issues. These developments allow data mining functionality to be integrated into database engines by integrating models into databases. Apart from reducing the complexity of online use of models, this approach also opens the way for fully automated model updates and maintenance of the analysis environment.
The Data Environment
Figure 1: Data Environment
Figure 1 shows a typical data environment. The data warehouse contains data from several source systems to provide an integrated view on several subject areas (e.g., customer or product). Vertical applications directly support activities such as campaign management, using any available data and directly interacting with the CRM environment. This is the domain of business users. The analysis environment uses the same data. This is where data analysts try to model the available data and enrich it with new insights for use in vertical applications.
The most important technical aspect of data mining integration is the consistent operational application of models created in an analytical environment. In typical client/server IT architectures, data mining functionality has subsequently been positioned anywhere from server to client. While the server always has had the necessary processing power and detail data available, developments did initially focus mainly on application aspects at the client. This resulted in many desktop query or analysis tools incorporating some form of data mining. However, models developed locally are often hard to deploy to other applications, even without questioning their feasibility, being dependent on locally available data and processing power.
In order to find a solution to this deployment problem, tool vendors started implementing the functionality to export models in C or proprietary macro languages. This may be part of the solution, but model maintenance over several platforms is not trivial if the only way to apply a model from an application is to recompile a function module. Another solution for the deployment problem is to create mining models in the application environment, avoiding the need for deployment. Data mining functionality is then incorporated in vertical applications.
Functions and Skills
The problem with this approach is that the responsibility for model building is shifting from data analysts to decision- makers, but the specific skills needed to build reliable models are usually not part of their expertise. A big challenge for data mining on the technical side is data quality and data preparation. These aspects often have such deep technical implications that decision-makers typically should not handle them. These problems in the data are the domain of data analysts and, as such, they should also be responsible for building the various models. However, building relevant models evidently needs the business expertise of the decision- maker.
Decision-makers would be able to build models if all data preparation steps could be automated. This, however, is:
- Hardly possible due to the complexity and diversity of data preparation,
- Not desirable because the data preparation stage in itself usually shows many aspects of the data, your organization and its subjects.
The one step that could easily be automated is updating a model to reflect updates in the data. However, if you would want to automate this step, it is trivial to automate the whole process just by scheduling the update or base it on triggers. Manual intervention of decision-makers would not be necessary.
A functional separation between environments should not cause a physical separation. A difference in available data between the analysis environment and the application environment may cause a mapping problem. The mapping problem shows up when a model is based on data or functionality that is not available in the application environment, so you will have to find a way to map or impute the data that is available.
Integration of basic data mining functionality into vertical applications does not solve current issues with deployment:
- Vertical applications often do not have the amount of data available that is needed to build reliable mining models.
- They do not support the required functionality.
- They often run on platforms such as desktop PCs that lack the necessary resources.
When studying the deployment problem, it seems to be an application integration problem. If a vertical application would be able to call an analysis application to apply a model, we would not have to deploy models but we could simply call on them. At the moment several vendors spend much effort trying to integrate analysis and application environments in this way.
Models and Results
We will take one step back and look again at what the problem we try to solve exactly entails. During the building phase of the model the decision-maker has been exhaustively involved, but when internals of the model are being applied, the decision-makers are no longer interested. The decision-maker is interested in results, not in models. Sophisticated business applications use relational databases as their central data repository or reference point. As such it would be ideal to use the existing mechanism for accessing this data (SQL) to apply mining models.
This means that the model should be located close to the data instead of inside either the analysis environment or the application environment. You should be able to "start" the model from within your application and get the results back in that same application.
Getting results back from a database is trivial for an application with a database connection. Database results are the outcome of a database query, so what we actually need is a way to "hide" a model in a query. The available mechanism for hiding functionality in a query is a database function, so the model could become a database function. A disadvantage of this approach is that a deployment problem crops up again because of the maintenance needed on this database function. The most flexible solution would be a generic database function with the model to be applied as a parameter of that function. The only question left now is where to store models.
To keep away from the deployment problem, the analysis environment should be able to store models where it is easy for a database function to access them. Such a generic way to store models would also enable applications to transparently access the results from applying the model. Recent developments have resulted in a generic language to describe data mining models in a format that can be stored as records in a database table. We can now create a table holding the different models in such a way that applying the model on the data becomes a database join between a table containing models and various tables containing data. When model application becomes a database join, it can be hidden by a database view. A SQL query containing a model name and a data source will simply produce records with the model’s results. In other words, every application that can issue SQL is able to transparently use dependable data mining models.
With this solution, data mining also becomes an integral part of all the capabilities we recognize in current relational database systems. Examples of this are:
- Backup and recovery of data mining models along with data;
- Abstraction through database views (e.g., automatic selection of the most recent or best model);
- Security, scalability, parallel processing, hardware clustering, etc.
By using this approach technical model deployment issues are reduced to simple inserts, updates or deletes on database tables containing models.
An independent association (Data Mining Group, DMG, http://www.dmg.org) of data mining vendors has specified an open standard based on XML to describe data mining models – predictive model markup language (PMML). Also, generic application functions have been defined for several database management systems as extensions of SQL.
A typical example of such a design is shown in Figure 2.
Figure 2: Model Building and Application Environments
This design naturally supports functional separation of analysis and application environments while maintaining logical integrity. Models can be created in several analytical applications, based on the same data that is available to the application environment. These models are stored as PMML in one or more database tables. Any application that accesses data can call a single database function to apply any chosen model on this data.
We have seen that current data mining architectures enable both dedicated analysis environments and transparent application environments. At the same time, data warehousing is moving from batch updates to online data processing. What we need is not just an easy way of making models available but also an easy way of updating models when new data arrives. The hard work of creating models is mainly in the first creation, so automated periodic updates can already be provided without analyst intervention. The current challenge is in finding ways of automatically detecting the necessity for updating models or creating new models based on the availability of new data.
Rudi van Lent is working for Trillium Software as a channel sales engineer for EMEA. When writing this article van Lent was an IT specialist working for IBM Global Services. He has been working in the area of business intelligence for the last three years and specializes in data mining, data cleansing and visualization. He can be reached at firstname.lastname@example.org.
Damiaan Zwietering is a senior IT specialist at IBM Global Services in The Netherlands. Since 1996, Zwietering has worked on customer projects in the business intelligence field. His activities consist of services presales, technical consulting, architecture design, modeling and implementation of data warehouse projects. He has specialized in the analytical side of data warehousing and CRM, applying data mining to gain insight in segmentation, target marketing and fraud or risk analysis for various industries. He can be reached at email@example.com.