The coming year will show what data warehousing looks like now that the world is flat. Data warehousing trends in a flat world will be driven by open source platforms for data management, offshore everything and the commoditization of infrastructure through relatively low cost servers (which, if they were any less expensive, would be disposable).

Saying the world is flat is a metaphor that points to the leveling effect created by the digitalization, globalization and commoditization of information technology. An increasingly level playing field exists between the Midwestern United States and Bangalore, India, or Shanghai, China. An abundance of fiber-optic bandwidth, along with the Internet, open source, outsourcing, offshoring and information infusion into business processes mean that productivity-enhancing innovations will depend on ever-expanding communities of collaboration, communication and cooperation. But if the world is flat, it is still not completely flat.

Friction is required in order for trends to gain traction and move forward. Bumps in the level playing field of data warehousing in a flat world are presented by complex and heterogeneous master data, data quality issues and the opacity of distributed information. In most cases, forward motion will not be linear and may even double back on itself in circular and difficult ways. Three paradoxes will characterize the dynamics around data warehousing trends in the year ahead:

  • Location persistence amid transparency in one of the most powerful trends - service-oriented architecture (SOA);
  • The proprietary data warehousing appliance in a world of increasingly commoditized infrastructure; and
  • Thin slices of information in a tidal wave of data.

The paradox of data persistence amid the location transparency of SOA is the first bump in the road to data warehousing in a flat world.
The beauty of SOA for data warehousing is that it offers location transparency combined with the action-at-a-distance characteristic of Web-centric computing. SOA is one of the best approaches that have come along since the Web was invented. It enables enterprises to make the Web useful, reusable and manageable for business purposes consistent with all the basic first principles of tight coherence, loose coupling and object-oriented design. In short, it enables in a practical way what is one of the holy grails of business computing - information as a service.

At the same time, the challenges SOA presents to traditional data warehousing should not be underestimated. It is the exact architectural opposite of traditional data warehousing, especially if the latter is a large, centralized, persistent data store. SOA wants the underlying data to be transparent, location independent. But as powerful as computers have become, there is still room for doubt whether they are powerful enough to perform really big joins on the fly without performance penalty or regard for data movement. This is why the "virtual data warehouse" remains an illusion with minimum justification and minimum adoption in the enterprise. A tension exists in building a complete architecture - for example, in the form of a computing grid - between being realistic about performance and the need to abstract from the location in order to do what SOA does best - provide information as a service. In short, the race between growing volumes of complex data and computing power is expected to continue even as implementations of SOA for data warehousing go forward.

This is one area where the SOA approach will learn a few lessons from the proponents of extract, transform and load (ETL) tools. Data transformation is now a service, too. ETL technology has demonstrated significant ingenuity and innovation in building metadata adapters, connectors and interfaces a wide variety of data sources and targets on a truly dizzying variety of platforms. When this is combined with the concurrent development of on-the-fly data integration technology able to juxtapose, combine and compare unstructured and semistructured information interactively, albeit with constraints, then the stage is set for a breakthrough in squeezing latency out of the information supply chain and delivering business answers to those who need them in time to act on the recommendations.

The standalone ETL tool is being upgraded to the data integration hub that includes ETL-like process for big batch data volume and information integration messages for time-sensitive updates. A data integration hub is an ideal point in the data warehousing architecture to check on (and improve) data quality and rationalize heterogeneous master data to a conforming paradigm.

Wherever you have data, you have master data. The care and management of that data is how the information system comes to represent the market context in which the business operates. Master data is one of the ways to sett the standard for defining data and information quality. If the master data is out of line, so is the quality of the information. The ERP revolution raised the hope of finally consolidating master data around a single transactional system of record. But these hopes were disappointed as proliferating instances of ERP applications were supplemented with customer relationship management (CRM), supply chain management (SCM) and analytic applications (data marts) corresponding to each. Proliferating silos and data marts were the result. In short, the single version of the truth and its representation of the system of record continues to be a point on the horizon toward which our system development efforts converge, but which we never seem to be able to reach. If it is supposed to be a master file, then why are there so many of them? We are chasing a moving target. In the year ahead, the IT function will regroup around master data management, acknowledge that large data warehouses are common and differentiate in the ability to perform near real-time and real-time updates. Going forward, the critical path to enterprise data warehousing will lie through the design and implementation of consistent and unified representations (masters) of customers, products and whatever other master data entities are needed to run your business.

A tipping point has been reached, and going forward, you will need only one kind of database to run both the transactional and BI parts of enterprise systems. There will still be different instances due to performance requirements to support diverging transactional and BI workloads, but they will both operate with the same database. Proprietary systems that operate with special purpose technology stacks and databases are out. Open systems - including de facto standard such as IBM DB2, Oracle and Microsoft SQL Server - are in. Open source databases will remain outside the mainstream due to lack of features, functions and experience, but will exert a remorseless flattening influence on the major players in downward pressure on prices.

This is completely consistent with the trend toward data warehousing appliances that has emerged over the past two years and will continue to gain traction in 2006. Most firms do not have the in-house expertise to balance computing power, disk I/O and network capacity in a labor-intensive iterative process of data warehouse system configuration. Preconfigured data warehousing appliances, predefined quasi-appliances and balanced configuration systems will gain even more market traction reaching $2.5 billion in eighteen months (or about 20 percent of the overall DW market). However, the majority of those dollars will go to large, established, late-arriving major innovators, not the original upstart, proprietary ones. They will operate with a standard relational database.

The paradox of the data warehousing appliance - a proprietary and special purpose solution assembled out of low-cost, commodity components - will ultimately define the outer boundary of the appliance market as enterprise data marts. There is no reason why four-way Dell Intel servers should cost three to five times as much when overlaid by a proprietary parallel database as they do when purchased retail. They will not. Discounting will reach the point of no return under the pressure of an increasing coefficient of flatness dictated by open source, commodity infrastructure and competition defined in such terms. Meanwhile, data marts, no matter how big, rarely grow up to be data warehouses. The appliance phenomenon will itself be flattened, and it will merge with and be subsumed by enterprise data warehousing within three years, but only because the major players will have succeeded in co-opting the technology by then.

The third paradox is that of thin slicing. One of the keys in a coherent data warehousing design is deciding on the proper level of granularity. With the collection of point-of-sale records, individual transactions and now RFID tags, the tendency in BI has been shifting to finer and finer granularity. The relevant customer, inventory or service processes are put under an increasingly fine-grained microscope. The idea is that if you have the right thin slice - the transaction that shows the customer is about to churn - then you can make the right offer and keep the customer. But all those thin slices add up to a veritable mountain of data. It is true that the outlier in the data mining algorithm has a good chance of being a fraudulent claim or other interesting anomaly, but accumulating all the detailed data to find the trend against which the outlier is an outlier results in an explosion of data points and volume.

The paradox of thin slicing is that it leads to an explosion of data. The veteran salesperson knows immediately whether the prospect will buy or is lying about his intentions and gains enormous bandwidth by not wasting time on those who will not. But when an ordinary analyst tries to reverse engineer the veteran's method, an explosion of data results. The devil is in the details, and the details are numerous. The flicker of contempt in the client's expression shows the relationship with the bank (or the mobile phone company) is in trouble, but to get at that expression you have to code every millisecond in a 10-minute transcript, and that is 600,000 data points. The blink of an eye is indeed a short piece of data, and you only need one to make the proper inference. But how do you know which one? It turns out that there are a lot of blinks. The advantage is to the one who first develops the smart methods in predictive analytics to identify the right blink. With large data warehouses of clean, consistent, rationalized data becoming increasingly common, the competitive advantage shifts to those firms able to mine that data for predictive analytics about customers, product demand and market dynamics.

Regardless of filtering or predictive functions, a key challenge of data warehousing is to get the data out in a timely way. Many enterprises have demonstrated the ability to build really big data warehouses - to get the data in. Super large, multiterabyte data warehouses are now common. More of a challenge is to update this information and get access to it in a low latency, on-time way - to get the data out. This happens in a conforming and performing way much less frequently than the press and vendor hype might suggest. If you had spent 10 million dollars on a proprietary system over the past five years that was underperforming, would you want to see it written up in the press against your name? Of course not.

Better to build another data mart. Or is it? It's a bit of a dirty little secret that the result of this failure to master latency on the part of proprietary databases is the proliferation of data marts around some of the supposedly centralized, high-performance data warehouses. Going forward, the advantage - and a key differentiator among data warehousing competitors - will be to those enterprises that are able to perform sustained real-time update of the data warehouse, gaining access to low latency data in a timely way. An obvious corollary of this principle will be the value of (and trend to) data mart consolidation.

One trend that will be strictly limited in its traction deserves honorable mention. The much heralded convergence of structured data and unstructured content will continue to hang fire (not happen) due to immaturity of the technology, applications and business case. Until XML is driven into the database and becomes as easy to use and ubiquitous as SQL, managing the content for business intelligence advantage will be a nonstarter. Metadata is making progress in enabling intelligent information integration, but there is still a long way to go to render semantics sufficiently transparent to scale up to hundreds of systems.

No forecast is complete without commenting on the next high concept, the grid. The grid will make plodding progress in distributed industries with islands of automation over a course of several years. One of the industries is health care. Why health care? The use case is that it requires a highly distributed, heterogeneous architecture and presents a compelling business scenario. The emergence of a health care computing grid in the U.S. is a possibility over the next three to five years. The formulation by major employers such as the federal government, the companies comprising the Technology CEO Council and other major stakeholders of an employee medical record as a single version of the truth about individual physical well-being is a looming catalyst. The next requirement will be to put this on a virtual private health care network - a grid, if you will - that enables shared computing resources and communications to reduce medical errors, duplicate clinical testing, inconsistent diagnoses and redundant storage of the same information.

Such a health care provider computing grid puts this discussion right back where it started, using commodity components to flatten the inefficiencies in the information supply chain between participants in the digital economy. Suffice to say that grid computing is different than linking clusters of servers, though that is part of it, and relies on still-emerging standards to manage platform and computing heterogeneity along with advanced scheduling, workload management, security and fault tolerance. Much more work than can be accomplished in this short article will be needed before this computing grand challenge is engaged and tamed.

For those still sufficiently challenged by mundane, large-scale commercial computing, the top issue will be to use data warehousing systems to optimize operational (transactional) ones. You know the forecast; now source it. You know the top customer issues; now craft a promotion and communicate it in a timely way to take advantage of the narrow window for action. Top companies are doing this today - but not many. Everyone is working hard, but most are still not working smart, taking advantage of the breakthroughs in processing power and software to design businesses and business processes that are agile and responsive as the demands coming at companies.

Innovations in business processes as well as data warehousing will enable enterprises to connect the dots between the two realms. On the business side, sales and marketing will connect the dots between the business question, Which customers are leaving and why? and the BI available from the data warehouse. Finance will connect the dots between the questions, Which clients, products and categories are the profit winners and which are the profit losers? with the consistent, unified view of customer and product master data in the warehouse. Operations will connect the dots between the questions about supplier and procurement efficiency, stock outages, capital risks and reserves, dynamic pricing and the aggregations of transactional data in the warehouse. The result will be an even smarter enterprise working smarter.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access