Editor's Note: For this special issue, we asked our columnists to cover a variety of e-business topics. Their insightful commentary provides a well- rounded outlook as to the benefits and challenges of the e-world. Regular column format will return next month.

We've all heard or read about the new world of clickstream analysis ­ that is, the ability to examine all of one's Internet interactions ­ what sites one visited, what products one considered buying, what one actually did buy, and so on. The difficulty for the dot-com companies performing clickstream analysis is making sense of the massive amount of information they collect and using that information on the current or next visit by the individual being tracked. The ideal situation occurs when the dot-com can answer the following questions immediately upon one's arrival to their Web site:

  • Who is the person entering the Web site?
  • What sorts of products and services should we offer him or her?
  • What types of banner ads make sense to present to him or her?

Sounds simple, doesn't it? Unfortunately, it is not that easy. In fact, without a sound and proven architecture, it becomes not just difficult but almost impossible. Let's examine what is needed to create the ideal situation ­ keeping in mind that all this has to happen in Internet time.

What is Clickstream Data?

First let's start with an understanding of Internet data. Though relatively simplistic in nature, clickstream data is quite comprehensive and informative. It consists of the following pieces of information:

Tracking information:

  • The user's identification consisting of a client IP address or proxy server identification
  • Customer or user identity, authorized user element used when a secure log-on is required
  • Date and time that the Web server responded to the request
  • The request in the form of "GET" or "POST"

Server request information about the universal resource locator (URL):

  • Status of request (200 or OK is good)
  • Number of bytes sent

Prior site information:

  • URL
  • Host name
  • Path
  • Documents
  • Query string

In addition, you can capture browser and operating system information, cookie information and session and user identification.
With this vast array of information, we can now answer basic questions such as those posed in the introduction of this article. Figure 1 lists the types of questions we can answer with this invaluable asset.


From the questions listed in Figure 1, it is obvious that there is a need for access to both real time current data as well as to historical analytical data. The e-business environment requires both tactical and strategic data structures and an architecture that supports both. We have found that a good solution for this clickstream customer analysis starts with an architecture that has been proven over time to be very stable, the Corporate Information Factory. Figure 2 shows the entire Corporate Information Factory architecture.

Figure 1: Types of Questions Answered with Clickstream Data

The Need for the ODS

If the clickstream analysis is required in a real-time or near real-time frequency, as for questions one through five, then the data structures and access will resemble those created in the operational data store. This requires programmatic processes to analyze current clickstream data as well as the content data that goes with it (customer-focused ODS).

The ODS is used to not only determine who this person is, but what recent purchases they have made, when their last purchase was, when they last visited the site, where they live, contact information and so on. With the advent of the Class 4 ODS (see last month's column "A New Class of Operational Data Store"), we now have real-time access to strategic analysis results such as the best banner ad or the appropriate coupon to offer now. The ODS simply stores the results of analyses (appropriate banner ad IDs or coupon IDs). Now we can answer questions six through nine as well and instantly produce the right information for our customer.

Figure 2: The Corporate Information Factory

The operational data store structure must be designed for very fast access and usually requires a database designed in third normal form in which it is easy to add new data. Identifiers must be created quickly to add new customers into the ODS. This may require creating surrogate keys if the IP addresses are not always trustworthy or blocked. Based on where the customer has been or is going, data rules (created by marketing) are used to send the appropriate discounts coupons and ad banners. In other words, the Web site can readily act on that information upon identification of the visitor.

Data Marts and the DW

To analyze clickstream information over time, the data warehouse with its various data marts is the appropriate environment, one meant for heavy-duty analysis. The quality of the data at this level can be questionable, with as much as 40 percent of the data incomplete or bad. The decision must be made whether to drop the data or keep it knowing that the data may be unusable. What-ever the decision, it must be acted upon rather quickly so that analysis can begin.

The data warehouse structure now can support a variety of analyses as well as different levels of summarization. Data marts are created to perform different analytical functions (customer buying habits, demographic pattern analysis, campaign analysis and response analyses to ads, coupons, etc.). Analysis can occur in a star schema, an OLAP cube or perhaps some data mining or exploration environment. Note that one of the important analyses is whether the customer is a transactional one (likely to be only a one-time buyer) or one that wants a relationship with our company (that is, one with whom we are willing to expend the energy and money to develop this relationship).

To accomplish this in the necessary time frame (Internet time), we need to develop appropriate data acquisition and data delivery processes. Obviously the ODS will be a major source (and recipient) of data for our strategic analysis environment. Web server logs are usually collected throughout the day and accumulated for analysis within the data warehouse. Therein demonstrates the great difference between acting on the immediate clickstream data (ODS structure) versus analyzing the Web logs for future marketing ventures based on historic evidence (data warehouse and data marts).

Marketing must continue to study this clickstream data and customer responses to their ads/banners/coupons to correct any flaws in their strategy. Based on this analysis, it can be determined how customers navigate through the Web site, where they paused, what they examined and passed over, when and where they left the Web site, and so on. This navigational information is invaluable feedback for future enhancements to make it more usable.

Clickstream analysis is an invaluable tracking mechanism into the very personality of the Web site customer and must be handled appropriately so that you can act in a timely manner with suitable messages and products. This requires an architecture with a mix of both tactical and strategic structures, having fitting flows of data occurring between these structures. Whether you are responding instantly with a banner ad or determining if past campaigns were successful, you must have an architecture that is flexible enough to handle these diverse requirements. The Corporate Information Factory has proven to be just such a reliable, consistent and maintainable architecture for your dot-com enterprise and the massive clickstream data flowing into your site.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access