The author would like to thank R. H. Terdeman for assisting with this month's column.

The corporate information factory (CIF) is the framework that describes modern corporate information systems. The center of the CIF is the data warehouse with its granular, integrated historical data. From the CIF flow many different architectural components such as the data marts, exploration warehouse, operational data store (ODS) and others. Figure 1 depicts the classical CIF.

The CIF is a convenient way to get one's arms around the whole framework of information processing. Indeed, one of the primary benefits of the CIF is the ability to see the larger picture ­ to look at corporate information processing holistically.

One holistic perspective of the CIF is that of the flow of data through the CIF. More to the point, one can look at the speed of the flow of data through the CIF as it enters at the point of an application to the point where it is actually used in analysis in an exploration warehouse or data mart. The speed with which this flow is accomplished can be referred to as the velocity of the flow of data through the CIF. The velocity of the flow of data through the CIF is one of the factors which affects the usefulness of the data. The greater the velocity of the data through the CIF, the more useful the data. Conversely, if the velocity is too slow, the data can become almost useless.

Figure 1: The Corporate Information Factory

As an example of the usefulness of the velocity of data, consider a CIF in which it takes several weeks for data to flow from the originating applications to the analytical components. In that amount of time, much could change in terms of business conditions ­ the stock market could fall, the competition could introduce a new product, a key executive could resign, a merger could be announced, and so forth. Because of the very slow speed of the flow of data, the organization is ill-equipped to be in a responsive position.

Now consider an organization where the velocity of data through the CIF is much faster ­ for instance, an hour or two. (A velocity of an hour or two to pass through all of the different components of the CIF is a terrific speed.) When an organization has the capability of passing data through the CIF at this rate, the analytical, strategic positions of the corporation are able to be very responsive to business changes. Now the corporation can be proactive, rather than reactive.

The velocity of flow through the CIF needs to be measured in two ways. The two ways can be referred to as t1-t2 and t2-t3. T1-t2 refers to the flow of data from the originating source system to the data warehouse. T2-t3 refers to the flow of data from the data warehouse to the outlying analytical component (see Figure 2).

Figure 2: The Velocity of the Movement of Data Across the CIF

T1-t2 is a push velocity. In other words, data is pushed through the system to the data warehouse as fast as it can be pushed. The data flows from the different sources into the data warehouse. T2-t3 is a pull velocity. T2-t3 indicates that data is pulled from the data warehouse to the outlying analytical component as fast as it is needed.

There are fundamental differences between a push velocity and a pull velocity. As a rule, pull velocity can be much slower than a push velocity. For example, suppose an organization is doing exploration processing. A unit of data may arrive in the data warehouse at moment n, but that unit of data may not be pulled into the exploration warehouse until moment m. There may be a long period of time between n and m. In fact, a unit of data may arrive in the data warehouse and never be pulled into the exploration warehouse. The same phenomenon is observable for all the components outside the data warehouse ­ the data marts, the decision support system (DSS) applications and so forth. These analytical components only pull data into their component on an as- needed basis. Therefore, a measurement of the speed of t2-t3 is circumspect. A good velocity measurement of t2-t3 is how quickly the data can be available to the different analytical components, not how quickly the data is actually available. Measuring actual availability can produce very misleading results.

Measuring t1-t2 is a different matter altogether. Since t1-t2 is a push velocity, it is fair to measure t1-t2 on the basis of actual speed.

The midpoint for t1-t2 and t2-t3 is the data warehouse where the detailed, integrated, granular, historical data resides. The data warehouse is the essence of reusable data. The foundation of data found at the data warehouse serves as a basis for data mart processing, DSS application processing, exploration and data mining, ODS processing and many other things. The data in the data warehouse is looked at in a variety of ways in order to meet the different needs of different organizations. Because the data is granular and time-stamped, it still provides a basis for reconcilability.

Is the source of data for t1-t2 the same? Not at all. There normally are many sources of data that lead into the data warehouse. Each source of data will have its own unique t1-t2. In order to calculate the CIF velocity of t1-t2, all sources ­ each with their own unique t1-t2 ­ must be accounted for. In other words, the corporate t1-t2 is a function of the many underlying t1-t2's. In the same manner, t2-t3 for the CIF is the function of the many underlying t2-t3's that constitute the CIF.

One interesting perspective is the velocity of data through the CIF as a tradeoff to other processing. For example, if velocity were all that mattered, then velocity could be reduced to a very quick time indeed. However, there is a fair amount of processing that occurs as data passes through the CIF. As data passes from the operational environment to the data warehouse, integration and data cleansing occur. If all that mattered was the velocity of data flow, then integration and data cleansing could be eliminated; however, integration and data cleansing are an essential part of the CIF process. Therefore, the velocity of data flow is impeded by the need for integration and data cleansing. By the same token, as data passes from the warehouse to the analytical components, a fundamental act of restructuring of data is completed. Data is selected, added, aggregated, resequenced, merged and so forth as it passes out of the data warehouse. The speed of flow can be enhanced by eliminating or minimizing the work done after leaving the data warehouse, but doing so merely weakens the value of the CIF altogether. Therefore, there is a trade-off that is made between the velocity of data flow and the work that is done as data passes from one component of the CIF to another.

Another interesting aspect of the velocity of the flow of data through the CIF is the fact that data passes through some parts of the CIF and remains resident in other parts. For example, data remains resident in the data warehouse for a long time, but data merely passes through the ETL component of the CIF.

Some portions of the CIF are already sensitive to the velocity of data flow. For example, from the beginning the ODS has been classified according to the speed of the flow of data. There are four ODS classes. A class I ODS (a rarity) is one in which the flow of data into the ODS from the operational application source is almost immediate. In class II and class III ODSs, the flow is more relaxed (the much more normal case). A class IV ODS is fed from the data warehouse, and the rate of flow is very relaxed. The notion of the rate of velocity has been a part of the CIF for a while now, but has only recently been applied to the CIF as a whole.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access