It is a bold statement of the obvious to say that data warehouses are data centric. This is different from the way that order entry systems are process centric, payment card validation is real-time sensitive or airplane avionics is computationally intense. Yet, as data warehouses are incorporated into leading-edge business intelligence (BI) applications, they take on characteristics of many of the most demanding system profiles and service level agreements. The data-centric nature of data warehousing is on a collision course with other requirements unless a solution of reconciling and rationalizing the competing trade-offs can be found.

The suitability of a service-oriented architecture (SOA) to transform data warehouses into information as a service is addressing this challenge, though with important conditions and qualifications.1 For example, going forward, data access services should include the operations required by data warehousing, such as aggregation, multidimensional roll ups and complex joins. Look for SOA-enabled master data management functions to support both transactional systems and data warehousing. A SOA provides a method of abstracting from the underlying location of the master data, hiding the complexity under the hood as long as the defined service is accommodated. However, for the foreseeable future, do not forget to check on the location of the data when large data volumes or complex joins are invoked. With SOA, the service, not the database, is the primary integration vehicle. It answers the business question at hand by accessing and integrating the underlying data stores on the fly. This can be an effective method for addressing ad hoc queries with modest volumes of data (as demonstrated by the class of tools referred to as enterprise information integration or EII). Yet, in the real world of finite processing power, disk storage mechanical arm movement and network limitations, data placement remains a critical success factor. For large volume points with complex heterogeneous data, planning can safely assume that the database will remain a key data integration target.

First-generation data warehouses answered the question, "What customers are buying or using what product or service, and when and where are they doing so?" This enabled basic trend analysis. Take two data points and a ruler and draw a straight line through them, extrapolating from the past to the future, in effect, predicting the past. This results in basic applications in market trend analysis and similar patterns. Second-generation data warehouses deployed advanced applications in customer profiling and lifetime value (LTV) calculation as well as demand planning and forecasting. Third-generation data warehouses are still being developed at leading-edge companies. One of the key characteristics of such a system is that it closes the loop back to the transactional systems from which the data initially was derived, with the goal of optimizing operational processing. Such systems are intolerant of latency and need to reduce processing delays to a minimum. They tend to put a premium on real-time and near real-time results, and require careful design using components and principles of loose coupling, tight coherence. Winning use-case scenarios include forecasting inventory replenishment, fraud detection based on current customer behavior and presenting recommendations for customer purchasing while the individual is on the phone. In some cases, real time is too late. If a retailer is out of stock, all it can do is order and wait - adding up the lost sales. Predictive algorithms capable of generating alerts based on the three years of accumulated data in a demand plan are needed to turn such contingencies around and improve top-line revenue.

To date, such challenging third-generation data warehouses often needed to be designed and implemented opportunistically using whatever technology was available. If a trickle load extract, transform and load (ETL) tool was available, then that was what was employed, notwithstanding the obvious limitations. In the final analysis, many different solutions were cobbled together and did work. However, such "work arounds" tended to increase the total cost of ownership and operating complexity. It is a tribute to the designers and implementers that the advantages conferred by such solutions were so great that they absorbed the higher- than-average costs implied by proprietary tools and nonstandard approaches.

For example, in a scenario at a larger retailer, the near real-time data warehousing inventory replenishment and optimization requirements could not be accommodated by a legacy parallel data warehouse. In order to complete a high-performance memory-to-memory data transfer, the Web service and the database had to execute on the same database node. The legacy appliance did not support such a contingency. This was only one differentiator - but a key one - in the client's decision to migrate to a standard relational database on a generally available, shared nothing, UNIX technology stack. The point is that a s data warehousing systems address the need to handle real-time and near real-time scenarios, SOA moves to center stage of third-generation data warehouses because of the message-based format and adherence to open standards.

Just as SOA presents new opportunities for getting things right with low latency data warehousing, so data warehousing raises the bar on SOA. Most of the services provided by SOA have been relevant to transactional systems - open account, update customer, check inventory. The success of SOA has been in the area of transactional systems - running the business on a day-to-day basis. The business benefits of data warehousing are largely in getting above day-to-day operations: tracking market, customer, and product trends, distinguishing winning brands from losers, or forecasting demand based on historical shipments or sales. SOA is still developing in key areas such as complex data transformation, master data management, and the discovery of structural metadata. Data warehousing and the business intelligence applications it supports will require of SOA a whole new set of capabilities providing services such as queries and reports on top customers in rank order by date, dollar volume of business, and location; on-demand forecasts and reports for product ID; calculations and reports on customer LTV by individual customer ID, customer group and sales region; and calculations and presentation of customer suggestions for purchase recommendation.

As these services are abstracted, generalized, componentized, packaged and receive wider distribution as part of data warehousing packages and libraries, more enterprises will be able to design and implement third-generation data warehousing systems. In this context, SOA shows up as an improved means to an end. Infrastructure agility promotes bottom-line business results. Leading enterprises will focus on business process innovation within an enterprise - specifically, such processes as business workflow and optimization services. This will be enabled through a business's process execution, monitoring and reporting of key performance indicators (KPIs) based on corporate data, enterprise-wide event capture and correlation and the predictive analytics based on it. SOA is a mechanism on the critical path to enable this closed-loop processing. Hence, as we move into an "information as a service world," the traditional data warehouse is evolving to be more then just traditional structural repository. It is now an information service hub for publishing and subscribing to customer, product, employee, pricing and related master data. It does not just aggregate transactional detailed data - in general, a good thing - it optimizes transactional processes by detecting patterns and providing advanced warning of pending business events.

However, no matter how elegant, SOA still does not abolish distributed data. SOA is at the opposite end of a spectrum with traditional data warehousing, especially if the latter is a large, centralized, persistent data store. SOA wants the underlying data to be transparent, location independent. This implies the need for improvements in run-time metadata so that data sources can register with the system control center, get on the wire, and be accessible as a service. This is also why the virtual data warehouse has remained an idea with only an occasional, clumsy implementation - in the race between growth in data volumes/complexity and computing power, the decisive winner continues to be data living in I/O subsystems with all-too-limited bandwidth. An application programming interface (API) that hides a grid of data stores under the hood is one likely scenario. In many cases, however, the large data aggregation that is the key step in business intelligence will be best supported by large centralized data stores rather than voluminous data transport. Thus, the data warehouse - as a persistent data store - will complement SOA as a place for data integration for those scenarios where data volume or complexity present engineering challenges to real-world capabilities.


1. This essay is not intended to be a tutorial on SOA. As a working definition, in SOA "all functions, or services, are defined using a description language and where their interfaces are discoverable over a network. The interface is defined in a neutral manner that is independent of the hardware platform, the operating system, and the programming language in which the service is implemented. One of the most important advantages of a SOA is the ability to get away from an isolationist practice in software development, where each department builds its own system without any knowledge of what has already been done by others in the organization. This 'silo' approach leads to inefficient and costly situations where the same functionality is developed, deployed and maintained multiple times. A SOA is based on a service portfolio shared across the organization and it provides a way to efficiently reuse and integrate existing assets." For details go to

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access