Within the Internet bubble, clickstream data warehousing was an early and innovative development. It encouraged the use of the first significant new type of data source since the relational database – the Web log. It was supposed to be possible to get inside the Web visitor's head in a way not previously imagined to create revenue opportunities. Clickstream data warehousing emerged. However, while it brought forth a tidal wave of new data points about customer clicks, it did little to support the consumption of the data and its integration and transformation into accurate, usable, trustworthy information. That transformation required a series of data processing steps, which are now part of the standard IT repertoire. The enduring truth of the post-Internet era is that data warehousing was not a paradigm shift. Therefore, it has not participated in much of the hype and meltdown that characterized the dot-com meltdown.

Fast forward to today, and successful e-tailers such as Amazon do indeed operate with terabyte clickstream data structures that are the source of significant analysis of customer behavior, CRM-type promotions and collaborative filtering. That material is now mainstream, and the surviving Web e-tailers understand the practices. Their enterprise data models (and implementations) now include key data dimensions and attributes essential to the Web such as page hierarchies, sessions, user IDs and shopping carts.

The process of handling clickstream data is now well-understood, though not necessarily simple. For example, the process requires activities such as reformat the Web server logs, parse log event records, resolve IP addresses, match sessions, identify pages, identify user IDs, perform match-merge processing, and identify customers, products, abandoned shopping carts and click-throughs. The extent to which such processes have become de facto industry standard is indicated by the ready availability of such Web log extractors from the ETL vendors. The process is often facilitated by using a connector or adapter from one of the best-of-breed ETL tools such as Ascential, Hummingbird, Informatica or SAS. As indicated, building an intelligent information integration process to get from a click on a Web page to a relationship with a customer was indeed the right thing to do – and those Web-oriented firms that did not do so (regardless of the reason) no longer exist. The outcome is that the clickstream is similar to other data in its life cycle in that it starts out being transactional, and by various transformations on the information supply chain, it becomes decision support and a source of analytic insights. In data warehousing, the new realities are the old data management realities:

Perception of business value migrates in the direction of the user interface. If successful, all the work of upstream data integration, will result in an "Aha!" experience as the business analyst gains an insight about customer relations, product offerings or market dynamics. However, a new or better user interface is not in itself the cause of the breakthrough. Without the work of integrating the upstream data, the result would not have been possible.

Data integration requires schema integration. Data integration is arguably a trend with many of the enterprise application integration (EAI), extract, transform and load (ETL) and customer data integration (CDI) service vendors leading the charge. This is useful and valid, but as a trend is subject to marketing hype that leaves users with incomplete results. A schema is a database model (structure) that accurately represents the data in such a way that it is meaningful. To compare entities such as customers, products, sales or store geography across different data stores, the schemas must be reconciled as to consistency and meaning. If the meanings differ, then translation (transformation) rules must be designed and implemented. The point is that IT developers cannot "plug into" data integration by purchasing a "plug in" for a tool without also undertaking the design work to integrate (i.e., map and translate) the schemas representing the targets and sources.

Design consistent and unified definitions of product, customer, channel, sales or store geography, etc. This is the single most important action an IT department can undertake regarding a data warehousing architecture. Key data dimensions and attributes now also include those relevant to the Web such as page hierarchies, sessions, user IDs and shopping carts. Every department (finance, marketing, inventory, production) wants the same data in different form – that's why the star schema design and its data warehouse implementation were invented. Extensive research is available on how to avoid the religious wars between data warehouses and data marts by means of a flexible data warehouse design.

Though this author was no better at timing the April 2000 signal of the bursting bubble than anyone else, this prediction – the destiny of the clickstream to become just another enterprise data source – is a correct call that I made early and often. In summary, the Web – e-commerce logs, Web-sourced e-mail submissions, etc. ­ is now just another enterprise data source. The transformation of the clickstream into a source of insight about fundamental business imperatives – what customers are buying which products or services – is now part of the mix of heterogeneous data assets.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access