Figure 1: Bill Inmon's Corporate Information Factory
From Figure 1, it is easy to see that data volumes can easily grow while demands on the data increase as well. For example, this can occur when processing the data and putting it into another form or database (whether for an online data store where it will be important to have an efficient delta processing capability or for additional business intelligence data marts where aggregation of the data in the warehouse can easily take hours upon hours to produce summary tables). The management of all of this data movement and the time it takes to process all of the information creates a new set of IT challenges.
Increasing Demands on IT Departments
Due to the growing data volumes and business need for more information and analysis, there has also been an increase in the number of applications being developed and supported in most organizations. This has placed significant demands on IT professionals since there hasn't always been a corresponding increase in staff. Companies are also increasingly focused on controlling costs. Management wants to solve the challenges caused by data volumes and demand for real-time information with the minimum possible expenditure of money and resources.
In order to find a solution, it's first important to have a better understanding of where the volumes of data are coming from and why they are increasing. Here are some examples of areas that generate large volumes of data:
- Banking, insurance, financial transactions - checks, ATMs, credit cards
- Consumer buying behavior - supermarket check-out scanner data
- Healthcare - pharmaceutical records
- Communications - call detail records
- Internet and e-commerce - (Web logs, clickstream)
How have these data volumes grown? From ATM withdrawals to deposited checks, consider how many banking transactions are conducted in a month. Or think about how many items are purchased at the supermarket using a bonus card. Then look at the number of visitors who browse the company Web site in a given day. Each time there is more data to aggregate, filter, reformat, analyze and so on. Of course, as the value of the information increases, it also becomes necessary to utilize more and more historical data.
If a business inherently involves large data volumes, whether the data consists of call detail records or Web clicks, there is the potential for performance bottlenecks in the CIF. Below is a list of some of the places where bottlenecks most frequently occur today:
- Impact on I/O time and network resources of large volumes of data. In order to accomplish this, a company needs optimized file and database access methods.
- The aggregation of data to produce summaries at user query time, which is typically not feasible with the immediacy of response time that users want. Because of this, many data marts are designed to precalculate and pre-store the summaries. This can take huge amounts of processing and data movement. Optimized aggregation algorithms are needed to enhance performance.
- The amount of raw data that comes from Web servers. This data is typically very verbose; for example URLs and CGI parameter strings contain large amounts of text, with only a few characters being of relevance to a specific application. Web data parsers can consume huge amounts of CPU processing. Optimized pattern matching is needed.
- Databases that do not have highly optimized access methods for extracting data. Obviously the database vendors are more motivated to get data in than out, but the organizational data flow typically isn't as simplistic as that - usually there are numerous databases that are different and the data needs to be moved around. Optimized database access methods will achieve this.
- The amount of data for database loads. These loads involve index creation, which requires sorting. Because of the amount of data used in typical loads, the fastest possible sorting algorithms are necessary.
Figure 2: Hardware Can't Solve Elapsed-Time Problems
IT departments are continually encountering situations where a system is not designed for performance. Frequently this is because the system was implemented at an early stage of the business when there wasn't much data volume. Another problem is that the designers thought that volume growth could be taken care of by simply upgrading the system, or adding additional systems, or that hardware capacity would just continue to increase in the future the way it has for the last 20 years. There are several reasons why this typically doesn't work.
- Runtimes are almost never linear with data volume growth - when the amount of data is doubled, the processing time more than doubles.
- Hardware increases are non-linear in the other direction (doubling the number of CPUs doesn't double the throughput) - the combination of these two means that hardware upgrades don't give nearly as much as expected.
- Adding processors only improves performance if the application was well parallelized to start with.
- Increases in hardware capacity (faster CPUs, bigger disks, faster networks, grids) allow new applications to be automated. These new applications typically generate higher volumes of data than the old ones, and then the data needs to be processed into useable information.