One of the more heartening trends of 2002 is that bad management is finally getting a bad name. Think Enron, Global Crossing, dot-bombs of all types, questionable accounting practices, etc. If your job title contains the words visionary, guru, wizard, evangelist or champion, allow me to suggest that this is the very last bubble, and it may be time to refurbish your image. I even have a recommendation: Be the first to start a clickstream data warehouse project at your enterprise.

I know what you are thinking: Clickstream data warehousing died with the Internet. Let's examine this statement. First, has the Internet actually died? Ask yourself the following questions:

  • Did you buy more goods and services over the Internet in 2001 than in 2000?
  • Did you install a high-speed Internet connection at your home last year or spend much of the year badgering your local cable and telephone companies to provide service?
  • Does your enterprise have a Web site?
  • Including both internal and external Web sites, how many sites does your enterprise now have? Too many to count?
  • Do portions of your enterprise supply chain now run over the Internet?
  • Do your employees obtain corporate information, everything from sales seminars to 401(k) statements, using the Web?

My guess is that you answered "yes" to most of these questions, and it is clear that the Internet continues to be the most powerful trend in information technology (IT). But, again, I know what you are thinking: I can't possibly go to my boss and ask for money to create a clickstream data warehouse.
Let's address this dilemma by asking another question: Does your enterprise already have a data warehouse for non-Internet related business activities? I would guess that the answer is yes. If traditional business activity warrants a data warehouse, why wouldn't Internet activity warrant the same level of business intelligence?

Out of Sight, Out of Mind

I think one of the reasons that Web activity is not given the same level of management attention as other kinds of business activity is that it is accomplished entirely by machines, without human involvement. Web servers continue to operate whether or not anyone is using them, and whether or not anyone is happy with the content and services your sites provide. Your Web applications will dutifully return "link not found" errors, page abandonments, shopping cart abandonments and site abandonments silently as an owl in flight unless somebody actually analyzes the resultant clickstreams. Imagine mail-order catalogs with random blank pages and call centers that hang up on users in the middle of their sessions and you begin to get the picture of what is probably happening on your Web sites that don't have clickstream analysis. Without a clickstream data warehouse, the e- business environment can be remarkably opaque –­ often nothing more than a set of grand assumptions with little understanding of the actual dynamics of the underlying user community.

The irony is that clickstream data is remarkably easy to collect. Unlike the tortured mechanisms that have been created to extract data from operational systems into data warehouses in traditional environments, all popular Web servers automatically record user clickstreams in several standard log file formats. This avalanche of detailed user-behavior data can be transformed and loaded into clickstream data warehouses.

Another aspect that has yet to be exploited is the richness of information collected by the clickstream. Clickstream data is much more comprehensive than the CRM-style data collected by many companies. The kicker is that in the Internet environment, an enterprise can know everything about what a user does on a Web site, whether the user is a customer or not. Thus, the enterprise can get a picture of total market behavior that goes well beyond its customer base, which is a key competitive advantage.

Unless your enterprise is a monopoly, your business only has a fraction of the total market. While you may understand these customers well in the CRM sense, the key to growth and ultimate market dominance is the understanding of what non- customers want and how to convert them. Clickstream data warehouses are the mechanism you use to move beyond CRM to eRM ­– electronic relationship management of the entire user marketplace, not just your customers.

Log File Analysis Tools

One fatal error in thinking that spread like an epidemic through the venture capital financed dot-com community was the belief that the site traffic statistics produced by Web log file analysis tools were a sufficient replacement for the business intelligence garnered from detailed clickstream data analysis. Many venture capitalists (VCs) required startups to utilize these tools, and both the VCs and the companies used the resultant site traffic statistics to "prove" that their businesses were robust and worthy of going public.

However, increasing site traffic does not necessarily mean business is good. If few visitors actually buy anything, or shoppers abandon shopping carts at the last moment or the cost of fulfilling an order exceeds its purchase price, the business is on very shaky ground. Without more sophisticated analyses such as what happened over the course of a site visit, why the shopping cart was abandoned or the effectiveness of a promotion targeted at a particular user population, log file analysis tools can dupe management and investors into thinking that a business is sound. You need a clickstream data warehouse to conduct these kinds of analyses, and the traffic statistics emitted by log file analysis tools are really aimed at Web site administrators, not business management and investors.

Building a Clickstream Data Warehouse

As so often happens in modern enterprises, management moves forward in an uneasy partnership with IT. After reading this article, I hope that management realizes the need for a clickstream data warehouse. The next step is to ask IT to implement it; and the initial reaction of most IT personnel, even those who are familiar with other data warehouse environments, is one of caution. This is not surprising. The technology base for Web applications is not rooted in the familiar SQL databases of older ERP and financial applications. Web sites use HTML, HTTP, cookies, Web bugs, JavaScript, J2EE, Apache, Microsoft IIS, BEA WebLogic, IBM Websphere and many more new and unfamiliar technologies. In order to overcome this knowledge barrier, the staff that builds the clickstream data warehouse needs to become familiar with these new technologies. Education is the key, and expert consultants or a handful of new books on this topic can provide this education.

Just as the data warehousing staff needs education about Web technologies, the Web site implementation group also needs an education in data warehouse requirements. Most Web developers are unfamiliar with databases, SQL, data warehousing concepts (such as star schemas), ETL (extract, transform and load) tools, query tools and so forth. Many Web application server products do not log critical clickstream information such as user cookies and dynamically generated page content unless customizations are made to the implementation. Without these customizations, a clickstream data warehouse can be hobbled from the start by missing business information.

Implementation of a clickstream data warehouse is somewhat similar to other data warehouse projects, with four important exceptions. First, the source data, consisting of HTTP transaction data (the clicks), cookies, URL query strings and other Web data is exotic and subject to quirks caused by Web-page caching, browser idiosyncrasies and other issues. The project-specific limitations of the source data must be thoroughly understood before proceeding with implementation. Secondly, the identity of Web users is established using a variety of nontraditional technologies, such as user cookies, session cookies, Web bugs and registration information. The exact nature of what site activities can be identified and to what degree they can be identified must be an integral part of the design of the clickstream data warehouse. Thirdly, the dimensional schema for a clickstream data warehouse is very different from other more traditional star schemas, with dimensions such as Users, Content, Activity, Host Geography and Referrer. Also, the User Activity fact table has three levels of aggregation –­ hits, page views and sessions –­ and none of these are straightforward rollups from the level below. Finally, clickstream ETL is tricky and counterintuitive. If the clickstream data is to mesh with existing data warehouse data, dimensions such as Time and Customer (called User in a Web environment because you generally don't have to be a customer to use a Web site) have to be conformed to deal with the new situation. This can involve a considerable amount of rework to an existing data warehouse schema.

With the help of an expert consultant or a good book on this topic, you should be able to implement a clickstream data warehouse in approximately five to six months. The resulting business intelligence from your enterprise's Web initiatives will identify you with the good management side of the information technology equation, and that's a precious thing in these troubled times.

Published in January 2002, Clickstream Data Warehousing, by Mark Sweiger, Mark Madsen, Jimmy Langston and Howard Lombard (John Wiley & Sons, Inc.), is designed as the benchmark reference volume on clickstream data warehousing. The book consists of two parts. Part 1, Clickstream Data Warehouse Architectural Foundations, explains the new Web technologies one needs to understand to conduct a clickstream data warehouse project, including log file formats, cookies and user identity mechanisms, Web servers, application severs and other Web application architecture components. In many cases, the data warehouse project team is unfamiliar with Web application infrastructure, and this lack of knowledge can result in slipped schedules, lukewarm user acceptance and even project failure.

Part 2, Building a Clickstream Data Warehouse, Step-by-Step, covers the remainder of the volume. It is a handbook covering how to design and implement a clickstream data warehouse, aimed at all the members of the clickstream data warehouse project team. It covers all the issues, including project staffing and management, schema design, physical database design, extract transform and load, and end-user analysis. Just as data warehouse-oriented staff can learn about important Web architecture issues in Part 1, Web site-oriented staff, such as site designers and Webmasters, can use Part 2 to discover which Web site design features are required to support a good clickstream data warehouse.

The book also has a companion Web site at www.ClickstreamDataWarehousing .com. This Web site has been designed to extend the information in this book and offer the data warehouse community a source for more enlightenment and interaction. The site has an annotated book table of contents, book reviews, additional information referenced in the book including a clickstream data warehouse project plan, related articles, links to related sites and a discussion forum.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access