The Web1 has rapidly become an effective medium for people and businesses to communicate and collaborate. Nearly every area of information technology has been affected by the pervasiveness of the Web. Data warehouses are no different. They too have been affected, in that a large (and rapidly growing) percentage of data warehouses are being connected to the Web. By increasing access to your warehouse, you can increase the average knowledge level of your organization. Of course, even before the Web, it has always been true that you could increase the leverage of your warehouse by giving more people access to it, but it was never very easy to do so. You always had to struggle with proprietary networks, proprietary client/server protocols and special client-side applications. But, the Web (and the Web-based technologies) make it far easier than before. You don't have to worry about installing additional client software (everyone just uses a Web browser) or distributing application updates to all users (the application logic is stored centrally on the server, not the browser). And, since the Web is ubiquitous, you don't have to worry about connectivity issues. By leveraging the Web, the infrastructure is already in place to enable you to access your warehouse from any place in the world.
But, universal access via the Web creates a whole set of issues that must be handled. Distilled to its essence, Web access means your warehouse will be exposed to more access by more users to more data. These increases will, in turn, put more strain on your warehouse. Whatever level of scalability requirements your warehouse had will become magnified once you connect your warehouse to the Web.
Web Access Means More Users
Let's look at the "more users" issue. One of the most powerful reasons for connecting your warehouse to the Web is to make it more accessible. Logically, then, it follows that if you increase accessibility, you will have more users than you would otherwise have. And, not only will you have more users, but each user will typically access the warehouse more frequently. Because Web browsers are ubiquitous, and because most people keep them running on their computers all the time, it becomes easier for users to use the warehouse more frequently. That is, there is an increased inclination to access the warehouse, simply because it is so easy to do so.
But, it's not just the initial increased size of your user population that stresses the system. It's also the rate at which your warehouse user population grows to this increased size. This problem is most noticeable if you have a portion of your warehouse that is "public" for example, it's available to all employees via a corporate intranet or it's available over the Internet to your suppliers or customers. Once you decide to make the data "public," you can no longer easily control the ramp-up in the number of users. This can very quickly lead to rapid growth in the number of warehouse users. At first, a few users in the population will experiment with the warehouse, and then others will see the benefit and want to use it as well, and so on. Very quickly, you will find yourself with a large user population that is growing exponentially.
Hypergrowth and the Web
I call this phenomenon "hypergrowth." It refers to the fact that your user base will grow faster than your ability to scale up your warehouse's resources to meet these growing requirements. You won't be able to scale up your environment fast enough to keep up with this hypergrowth. You simply can't add new CPUs, disk drives and memory (and test them all to make sure you have no bottlenecks) fast enough to keep up with demand.
There are two approaches to dealing with hypergrowth. The first approach is to just avoid the problem in the first place. Most warehouse developers become over-zealous about making the warehouse instantly available to the "public." But, in most cases, I would suggest proceeding with caution. If you're not sure of the usage patterns, I would recommend against making it public initially. Instead, use the traditional approach of rolling your warehouse out to a few users, then a few more, etc. This can be done by password protecting the access, and only giving the password to select groups. (Of course, people can share the password with people outside their group, but you can monitor usage to see if unintended users are accessing your warehouse.)
However, sometimes there is a valid reason to make your warehouse accessible to your entire organization. So, since we aren't going to avoid hypergrowth, we need a different approach that allows us to handle it. The key to handling hypergrowth is to note that it only occurs during the initial stages of your warehouse's life cycle. The trick is to build the initial iteration of your warehouse so that it has enough resources to handle where you will be when the hypergrowth subsides. To do this, you have to determine where you think that point will be. This is accomplished by looking at the growth plan for the warehouse. How many users do you expect, what workload will they be generating and in what time frame?
After looking at the growth plans for your warehouse (or at least making your best guesses), you can define a graph that looks something like Figure 1. According to this graph, we can see that we expect the hypergrowth phase to level off sometime in May, supporting roughly 350 users. So, even though we plan to start with only about 50 users, since we expect to grow extremely rapidly to 350 users, we build our first iteration to support 350 users rather than 50.
Web Access Means More Data
Next, let's look at why Web-enabled data warehouses imply more data. The answer is intuitive: the graphical nature of the Web makes it natural and simple to make requests for multimedia data. For example, an insurance company may choose to not only store the traditional numeric and text data about car accident insurance claims, but might also store a digitized photograph of the car itself. End users could use the numeric and text data types to perform their analytical processing, and then perform drill downs on specific data items to get not only the traditional data on a particular record, but also the related image as well. With the potential need for storing large numbers of images, this means that the trend to larger and more rapidly growing data warehouses will only accelerate. In addition, not only will your warehouse be responsible for storing more data, but the requests for multimedia data also require much more bandwidth than is required for traditional data types.
Ultimately, what does more access by more users to more data really mean? It means that the requirement for a scalable warehouse environment is increased. Addressing these issues requires scalable design principles, such as those I've historically discussed in this column. Just remember that the requirement to use these scalable techniques will be even more critical if your warehouse is connected to the Web.
1 For simplicity, we will use the term "web" to refer to both the World Wide Web and to intranets that use Web- based technology.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access