As the use of the Internet as a channel and e-business models have matured, a key trend is the merging of e-business initiatives with the rest of a company's operations. Carrying out these efforts requires increased visibility into all aspects of the business, both on the Web and off. This drives the need for an integrated view of all of the business data in the data warehouse.

Many enterprises already have some form of data warehouse to support business intelligence for the non-Web aspects of the business. Many of these companies have also installed some Web analytics tools, although they are typically separate from the rest of the data warehousing infrastructure. Examples of this include standalone packages such as WebTrends or outsourced Web analytics platforms that are incapable of bringing together information from both the Web site and back-end transactional systems, particularly when the data is stored in a data warehouse. The canned reports generated by these tools usually revolve around statistics that can be gleaned from site traffic, but are not structured to answer key business questions or provide a flexible analytical environment.

This has led to a trend to incorporate Web-based data into preexisting data warehouses, or to build new clickstream data warehouses that are capable of showing both e-business activities and the non-Web aspects of the business in an integrated fashion. My coauthors and I recently published a book on the topic, Clickstream Data Warehousing (Wiley, 2002), and this article highlights some of the major issues and solutions.

Integrating Clickstream and Business Transaction System Data

One of the big problems in merging Web and non-Web data into a data warehouse is the lack of data integration between the Web site and what is stored in the data warehouse from other online transaction processing (OLTP) systems. Web applications are normally designed and developed by separate groups within IT, and the technologies and design approaches for these applications differ radically from the traditional OLTP applications.

It isn't only the IT departments that are divided. The divide within IT often reflects the separation of the Web channel from the rest of the business operations ­ IT simply mirrors this structure. As a result, an integrated data warehouse must typically overcome organizational issues in addition to the technical challenges.

The root cause of many data integration problems between Web and other applications is due to these organizational conflicts. The boundaries are embodied in the systems that support the different aspects of the business.

Fortunately, most businesses realize that they must better align the business units and, therefore, the supporting technology. In order to address the core data integration problems, the data warehouse integration team must deal with users and IT groups across those organizational boundaries.

Data Integration Issues

The primary problems faced when integrating e-business data with the rest of the data warehouse are usually related to either data quality, missing data or an inability to link data between the various systems.

Missing data is usually the biggest issue. Because building basic Web site functionality has been the top priority in most organizations, little effort has been expended to make the Web applications log the data required for reporting and analysis needs.

For example, many Web sites maintain their own user IDs. Unfortunately, these IDs are not always retained on the back-end processing or fulfillment systems, which generally have their own customer IDs. This means that repeat visits by the same user on the Web site are not easily mapped to the customer that is stored in the core business applications.

A related problem is that most Web applications capture only the minimum data required to feed a transaction to the underlying business transaction systems. This means that data required by business users to view the e-business channel is not readily available from the business transaction systems that are feeding the data warehouse, and because of the independent applications, the data can't be tied together easily.

My experience with clickstream data is that most Web data problems stem from this lack of data rather than data quality issues. Because of these data integration problems, integrating clickstream data with other business data may be more difficult than anticipated and involve a significant amount of Web application rework.

Addressing the Problems with Web-Based Data

Most data warehouse projects require modifications to some source systems in order to provide complete data for analysis. This holds true for Web applications as well. Furthermore, it may not be possible to handle all clickstream data problems within the data warehouse. The only realistic solution may be to change the Web applications so they collect the data required for analysis. This echoes a standard mantra in data warehousing, "Fix the source and not the data."

The problems that are most frequently encountered involve user identity and tracking, and linking the clickstream data (pages viewed, sessions, etc.) to the transaction data stored in the underlying OLTP systems.

Disparate Web site user identities and business transaction-system customer IDs create a host of problems. While the Web site may know who the user is, this data is not typically sent to the business transaction system. Instead, the data for the transaction is sent to the OLTP system, and it determines the customer ID based on the entered data. A related problem is inconsistent tracking of user activities on the Web site which makes associating these activities with a given transaction difficult.

To associate clickstream data with the corresponding back-end transactions, user identity and clickstream components from the Web must be linked to transactions associated with this data. This is accomplished using a technique called page tagging.

Adding Page Tagging to the Web Site

The goal of page tagging is to ensure that every Web page can be categorized by function and content. This allows analysts to do much more detailed work than would otherwise be possible. The difficulty is that, in most cases, the URL can't be relied on to provide all the information desired by analysts, particularly if pages are dynamically generated.

The key element in page tagging is content coding. Content coding is driven by the end users' analysis needs. Therefore, it is important that analytical users define the attributes of transactions that they are interested in tracking for each page, or at least sign-off on the attribute list created by the data warehouse team. It is not uncommon to require additional tags for specific transactions so that the Web site records can be more easily linked to the data in the OLTP system.

If the pages are dynamic, then the Web application that generated the pages must implement a page-tagging mechanism. Regardless of how the page is created, there must be a way to uniquely identify each page on the Web site and store that information inside the page. The most popular Web application packages, such as BEA WebLogic, IBM WebSphere and ATG Dynamo, do not do this automatically, and providing page tags can require a significant amount of custom coding.

Even if pages are tagged by the Web application or content management system, the information may not be easily accessible to the data warehouse. For example, in an internally coded page-tagging environment, the page attributes are placed inside the page HTML in the form of meta tags or formatted comments. This means the data warehouse must parse the page's HTML in order to access the data, and then it must link this with log data to tie the page information to a specific user.

Cookies may seem like the solution for capturing user or transaction-related data, but this can be deceiving. Cookies or parameters in query strings are often encrypted, and logged clickstream data may have to be decrypted to make it usable.

In a dynamic page generation environment, the Web application that generates pages can use codes in the query string that are meaningful only within that application. For example, the following URL contains encoded parameters that are meaningless unless you know how the Web application interprets them: http://www.efilms.com/movie/new_releases/0,4277,,00.html? id=52683. The page name and parameter might be meaningful to a Web developer familiar with the ATG Dynamo application server used at this site, but it is probably meaningless to a data warehouse developer or user. This is the URL that would show up in the Web server log, and data warehouse load programs need to decode these parameters into a user legible format.

In general, the data warehouse designer must work with the Web site developers to determine how to process the data. This will require Web site modifications and the active participation of Web developers during the data warehouse design and implementation process.

The Gap in Recorded User Data

The Web site should track all visitors, and it should track them consistently. Many Web sites track customers who have registered or otherwise logged in, but do not track all visitors to the site, leaving a large gap in the data collected. Most important is the ability to track a given user across multiple visits. The Web site must generate a unique identifier for every user and track the user with that identifier.

Uniquely identifying customers or external users is a difficult task for many organizations. Multiple independent systems contain customer data, and the norm is for each system to utilize its own method for keying customer data. This becomes apparent whenever someone asks for aggregated information about customers, such as "What percentage of revenue is generated by the top 10 customers?" Trying to answer this simple question is confounded by each system's unique customer identifiers.

This problem is magnified with a Web site because many Web sites do not tightly manage user IDs. We've encountered numerous companies with loose user ID management, making user tracking over time difficult, if not impossible.

The resolution to these problems is twofold. First, a single Web application must be responsible for generating unique user IDs. The Web application must maintain a mapping of user IDs to customer identifiers in the underlying business transaction systems, or it must send its user ID along with the other transaction data so the OLTP system has a record of the data. This addresses the worst case scenario where the only way to match a Web user to a customer in the data warehouse is by comparing profile information stored in both places.

Extending to the Data Warehouse

Assuming you have an existing data warehouse, it should contain business transaction information at an atomic level, but it is almost certainly missing the dimensions needed to analyze the data associated with the Web channel.

For example, the Web is a distinct channel and not something that can be easily fit into a preexisting channel dimension. Stuffing data about the Web channel into the current sales geography dimension, for example, only provides a minimal level of analytical capability and violates many of the principles of an appropriate dimensional architecture. Other clickstream dimensions that allow for behavioral analysis, such as Web geography or page dimensions, would not exist in the typical data warehouse.

Marketing data is another area where there is a conflict between Web site and OLTP data. Web promotions often have different data sources, are often managed by a different marketing department than other promotions and are not easily identified within the back-end systems that fulfill the transactions. This means that the current data warehouse is probably missing both the data sources and the appropriate attributes needed to analyze promotions that occur within the Web channel.

When adding clickstream data to an existing data warehouse, it is important to examine the schema and decide how it should be modified to support the new requirements for analysis. Adding a few dimensions may not be enough. You may need to add an entire set of facts and dimensions. For example, the ability to link transactions to site activity provides detailed insight into what users are doing. Site activity is not something that exists in any form in a traditional data warehouse.

The main lesson is that integrating e-business data into your data warehouse is not always simple, and most of the time it will require modifications to the Web site in order to be effective.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access