Is bigger really better? That's the question many business users and IT executives have asked themselves before embarking on a data warehousing journey. After hearing horror stories of failed projects building massive warehouses, many smaller and mid-size companies have abandoned building data warehouses. Until now! With the unveiling of SQL Server, Plato and the accompanying data warehousing tool kit, Microsoft will truly ring in the era of distributed and affordable data warehousing. Distributed data warehousing is a term that is almost taboo for many warehousing pundits. For many years now, the gurus have been preaching the concept of a large enterprise data warehouse encompassing many subject areas. In theory, this concept is sound, but often not practically viable. Most companies are not Wal-Mart or Sears and have neither the resources nor the funding to build a large enterprise data warehouse. In distributed data warehousing, smaller data stores from individual subject areas would be built separately and joined physically over a network. Logically, the individual data stores would encompass a complete enterprise data warehouse but would be physically separate entities. Users would logically see one central data warehouse, but the physical implementation of that warehouse may be in several different locations.

Different Concept

This concept differs from the current data marts in two significant ways. First, today's data marts are mostly stove-piped ­ they are not built to be integrated with other data marts. In a distributed data warehouse, the focus is on eventually scaling up to include many subject areas, so much attention is paid to the initial design. The initial subject area implementation would be based on a data store development standard that would need to be defined before the initial implementation. This standard would then be the basis for the development of other data stores. In this way, the individual stores would be designed to work together as one warehouse. Secondly, today's data marts have no mechanism to communicate with each other. For example, if a user in London who had a departmental data mart needed to see data from the New York data mart, there may be no way for the data marts to communicate with each other. In a distributed data warehouse environment, the user would need no knowledge of the locations of the individual stores. The warehouse manager would automatically route the user request to the appropriate location. The individual data stores strung together are a logical data warehouse.

Advantages of Distributed Data Warehousing

The concept of distributed data warehousing affords several advantages to companies. First and foremost, it is a cost-efficient solution. By initially investing in a local NT server, a department could build a small individual data store that could be later connected to other data stores that form the "larger" distributed data warehouse. For example, a 20-person analyst user group may initially decide to build one data store; and after some growth, they might decide to build another. Instead of undertaking the painstaking task of integrating the two into one central store, the two could virtually combine using networks and cutting-edge software to form the distributed data warehouse. Secondly, the solution is scalable to support gigabytes, terabytes and more in the future. The distributed data warehouse has no size limitations. As the need for additional data stores arise, only additional servers need to be added into the distributed environment for each of those stores. This eliminates the entire discussion of "Will my server be able to support x terabytes?" Additionally, smaller data stores have less inherent cleansing/integrity issues, thereby increasing/facilitating more accurate data analysis and business forecasting.

From a business standpoint, the distributed warehouse is also very advantageous. By starting with one small data store, the IT side is able to deliver to its business users quickly, thereby lowering project costs. This will accelerate the ROI (return on investment) and allow the IT side to gain credibility and buy-in for further warehousing projects. The initial hardware costs are also much less costly than investing in the long-term systems required to support a large warehouse. In building a single server data warehouse, many companies have historically needed to invest in an expensive UNIX machine to support growth. In the distributed environment, an NT machine will support the initial business needs, and additional servers can be purchased (if necessary) as the warehouse expands. This way, both IT and the business side will be less hesitant to pursue initial projects as the cost of failure is much lower than in a larger warehousing effort.

New Data Warehousing Strategies

Until now, it would have been very difficult to build a distributed data warehouse. Most products do not support the integration of individual data stores over the network. Most products are solely focused on the concept of either building an enterprise data warehouse or individual data marts. This will change with the introduction of the Microsoft data warehousing strategy that is predicated on the concept of supporting distributed data warehousing.

The Microsoft data warehousing solution is focused on an NT-based solution involving SQL Server 7.0 and Plato, Microsoft's OLAP Server. NT-based networks provide an ideal distributed environment for building a distributed warehouse. The first data store can be set up in a single NT server domain utilizing Plato and SQL Server. As additional data stores are added in the same or different domains, they can be networked together to form the distributed warehouse. Microsoft provides the plumbing for this solution. As part of Plato, Microsoft allows administrators to create "virtual" OLAP servers that link data stores in more than one server. For example, an administrator in New York could set up a virtual OLAP server that includes servers from New York, London and Hong Kong. Thus, when the user is "slicing and dicing" data from London, the request is being made against the London server. The locations of the server are transparent to end users, and users communicate with Plato as they would against any single server. The individual data stores could also be managed centrally from a single NT server. The Plato product is especially nice because it also features HOLAP (Hybrid OLAP) technology, which is neither MOLAP nor ROLAP, but rather whichever one the corporation decides to use. Instead of continuing the great MOLAP versus ROLAP debate, Microsoft chooses to avoid it and supports both. NT also provides a great mechanism for warehouse security by integrating the Plato and SQL Server security with NT server security, making the administration of security features very easy. Above all, deploying a data warehouse with SQL Server, NT and Plato will undoubtedly be cheaper than any other solution currently on the market, as purchasing each of the tools for extraction, transformation, meta data and presentation is very costly. Microsoft plans to bundle all these tools in one affordable package. Distributed data warehousing is something that will take the market by storm, and Microsoft has built a product that will support this concept extremely well.

Distributed data warehousing using Microsoft products will open the data warehousing process to everyone. Until now, data warehousing has been an exclusive game played by a select few. The low-cost, low-risk option of distributed data warehousing is perfect for first-time builders. For a small company that wants to dabble in data warehousing, an NT-based solution with Plato is great. It allows for scalability, a high level of integration and centralized management. Most users also have a great deal of comfort using Microsoft products, and Excel will provide an excellent front end for Plato. Microsoft has positioned itself well in the data warehousing market and will now mark the beginning of an era in distributed data warehousing.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access