The Corporate Information Factory (CIF) is a logical architecture whose purpose is to deliver business intelligence and business management capabilities driven by data provided from business operations. The CIF has proven to be a stable and enduring technical architecture for any size enterprise desiring to build strategic and tactical decision support systems (DSSs). The CIF consists of producers of data and consumers of information. Figure 1 shows all the components found within the Corporate Information Factory architecture.
The producers in the Corporate Information Factory capture the data (integration and transformation) from the operational systems and assemble it (data management) into a usable format (data warehouse or operational data store) for consumption by the business consumers. The CIF consumers acquire the information produced (data delivery), manipulate it (data marts) and assimilate it into their own environments (decision support interface or transaction interface).
Figure 1: The Corporate Information Factory Architecture
We use the simple model of separating these two fundamental processes into "getting data in" versus "getting information out." Figure 2 demonstrates the relationship between these two processes as well as the distinct components of the CIF involved in each. (Please see my September 1999 DM Review column, "Are You an Inny or an Outty," for more information on these processes.)
Figure 2: Getting Data In vs. Getting Information Out
In the following sections of this article, each of the components of the Corporate Information Factory will be defined in detail.
Producers: Getting Data In
Producers are the first link in the information food chain. They synthesize data into raw information and make it available for consumption across the enterprise. Each producer will be defined and explained in further detail.
Operational systems are the family of systems (operational, reporting, etc.) from which the Corporate Information Factory inherits its characteristics. These are the core that run the day-to- day business operations and are accessed through application program interfaces (APIs). The operational environment represents a major source of data for the CIF. Other sources may include external data, informal data such as contract notes, e-mails, spreadsheets, etc.
The ability or inability to capture the appropriate data in operational systems sets the stage for the value of the Corporate Information Factory itself. The success or failure of the CIF depends heavily on these operational systems to supply the richness in data needed to understand the business and to provide the history needed to judge the health of the business.
Let's examine some of the problems surrounding the operational environment. Unfortunately these problems often find their way into the systems and processes in the CIF as well.
- The operational systems are usually built around the product they support. Corporations are moving their organizational focus from product to customer in an effort to differentiate their offerings and ultimately survive in the business environment. Operational systems have inherently focused on product and thus lack the ability to recognize or manage data about customers. These include such fundamental questions as:
- What products belong to a customer (for integrated billing)?
- What offerings are relevant to a particular customer and/or household (for smart targeting)?
The CIF must provide facilities to define how corporate data relates to a customer, rules for integration (data modeling) and the means to document these relationships/rules (meta data).
- Operational systems, by and large, were not designed with integration of data in mind. They were built to perform a specific function, regardless of what other operational systems contain for data. Again this is handed down to the Corporate Information Factory where it is somewhat rectified through the integration and transformation process.
- A related problem with operational systems concerns the weak linkages between the systems. At best, the systems pass a very limited amount of data between them the bare minimum to satisfy the receiving system. The Corporate Information Factory uses massive amounts of data from all of the operational systems and must synthesize, clean and integrate the data before it is usable.
- Operational systems are not able to handle large amounts of history. By their nature, they should be responsible for only the most current state of affairs. That is what they were designed to do, and they do that quite well. However, there is much to be learned from our historical data. Therefore, the Corporate Information Factory must act as the historian for the entire enterprise.
Integration and Transformation
Integration and transformation consists of the processes to capture, integrate, transform, cleanse, reengineer and load source data into the data warehouse or operational data store (see Figure 3).
Figure 3: Integration and Transformation Process
Integration and transformation is one of the most important processes in the Corporate Information Factory producer role. It has the critical job of converting the chaos in the operational world to the ordered world of information. This process assimilates the data from the operational environment of heterogeneous technologies into the integrated, consistent world of the CIF, suitable now for consumption by the decision support processes i.e., consumers.
It is the responsibility of this process to prepare and load data into the data warehouse and/or operational data store. In doing so, you should consider the following:
- Where possible, leverage the horsepower of the mainframe to preprocess the operational data. If this preprocessing can be performed throughout the day rather than waiting for the batch window, it can greatly speed up the integration and loading process.
- An integration and transformation process that "pulls" data from the operational systems, rather than "pushes" data from it, provides better control.
- The greatest challenge for integration and transformation is when data is received from sources that have organized data around different keys. For example, one system manages data by a demographic code, another system manages data by account number and yet another manages data by invoice ID. In the end, all three sources need to be integrated to provide a complete view of a customer. This process can involve sophisticated matching (fuzzy logic) rules and name/address normalization and standardization to determine what data belongs to what customer.
- Before you begin the process of producing data to be loaded into the data warehouse, you should develop protocols for configuration management and the scheduling of these processes.
- The level of effort needed for integration and transformation is greatly affected by the level of understanding you have of the source data. The more familiar you are with the operational data and its creation, the easier the integration and transformation process will be.
- Once the integration and transformation piece is delivered, the good news is that it is relatively stable and predictable. There will always be changes occurring to these processes and programs simply due to the fluidity of the operational environment and the need for new information as the end users begin to explore new DSS possibilities. However, these should be handled by standard change request procedures.
- It is a wise producer that develops an audit strategy up front rather than after the integration and transformation process is in place. You must be able to confirm that the conversions, integrations, transformations, etc., are performing as expected and planned. (Please see my April through June 1999 DM Review columns, entitled "Measure Twice, Cut Once" for more information on the audit processes.)
- One of the functions this producer can perform is in data preparation for the consumers. It is reasonable and prudent to create summarizations, derivations and even start star schema tables that can then be used easily by data delivery. Otherwise, the burden of calculating, deriving and setting up dimensions falls on the hardy data delivery piece.
The data warehouse (DW) is a subject- oriented, integrated, time variant (temporal) and non-volatile collection of data used to support the strategic decision making process for the enterprise or business intelligence.
The data warehouse acts as the central point of data integration the first step toward turning data into information. It serves the following purposes:
- The data warehouse delivers a common view of enterprise data, regardless of how it may later be used by the consumers.
- Since it is the generic "foodstuff" for the consumers, it supports the flexibility in how the data is later interpreted (consumed). The data warehouse produces a stable source of historical information that is constant, consistent and reliable for any consumer.
- Because the enterprise as a whole has an enormous appetite for information, the data warehouse can grow to huge proportions (one to 20 terabytes or more!).
- The data warehouse is set up to serve many rather than a few in terms of consuming information. That is, many data marts can be created from the data contained in the data warehouse, rather than each data mart serving as its own producer and consumer.
Because of this central role for the data warehouse, there are several considerations that IS developers should remember:
- Usage of the data warehouse by the ultimate consumers (the business community) may be restricted. You may find that access to this producer should be limited to the data delivery process rather than opening it up to all consumers. This will allow you to maintain your focus on data loading and management.
- Because this data warehouse producer must focus its energy on holding the corporation's history and producing information to be consumed later, little or no transaction processing occurs within its database. These activities are far better suited for other producers such as the operational systems or the operational data store.
- Due to the lack of transactional processing and the large volume of data that these databases contain, you may want to limit the number of refreshes to a minimum e.g., perhaps a weekly or even monthly refresh.
- Finally, due to the size of these constructs, they are generally found on relational and high performance technologies such as MPP or SMP platforms.
Operational Data Store
The operational data store (ODS) is a subject-oriented, integrated, current and volatile collection of data used to support the tactical decision-making process for the enterprise or business management.
Just as the data warehouse is the central point of integration for business intelligence, the operational data store becomes the central point of data integration for business management. It is a perfect complement to the strategic decision-making processes provided through the data warehouse/data mart constructs.
The operational data store has the following roles:
- It delivers the common view of enterprise data for operational processing. By being the point of integration for operational data, the operational data store produces the "foodstuffs" for the tactical decision-makers of the corporation.
- The operational data store supports the actions resulting from business intelligence activities by supplying the current, integrated enterprise-oriented data. The ability to act upon the result sets generated from data marts is critical in balancing the ecosystem to support "planning" and "action" activities of the business.
- The operational data store is relatively straightforward to deploy. However, deployment becomes increasingly difficult as the demands for currency of data grow.
Data management is responsible for the ongoing management of data within and across the data warehouse and operational data store. This includes archival/restoration, partitioning, movement of data between the DW and ODS, event triggering, aggregation of data, backups and recoveries, etc.
Data management can be thought of as an extension to the data warehouse database management system in that it:
- Is responsible for the application- level partitioning and segmentation of the data warehouse.
- Performs the data archival and retrieval functions from near-line storage media. This can be a particularly difficult problem as the archived data ages.
- Is responsible for disaster recovery and backups and recoveries.
- Monitors and measures the quality of the data in the data warehouse and operational data store.
- Creates standard summarizations and aggregations.
Unfortunately, data management is a process that is usually not planned for at the beginning of most projects. However, soon after the data warehouse is up and running, data management quickly becomes a primary concern of the development team.
A secondary challenge is that the availability of tools in the marketplace is limited. Unfortunately, this forces corporations into the position of building these capabilities.
Consumers: Getting Information Out
Consumers gain their energy from the output of producers and manipulate it for their own purposes. In the CIF, these consumers constitute the decision support mechanisms for the corporation. The ultimate consumers in the Corporate Information Factory are members of the business community and have been classified as farmers, explorers, miners, operators and tourists. (Please see the five articles in the July/August 1999 DM Review for a discussion of the five business communities that use the CIF.)
Data delivery is a work group environment designed to allow end users (or their supporting IS group) to build and manage views of the data warehouse within their data mart.
Data delivery provides the mechanism for requesting, prioritizing and monitoring data mart creation and refinement. There are three steps in the process of creating the data mart:
- Filter The information consumed by the data delivery process is obtained from the data warehouse. A filtering mechanism removes all information that is not needed by the data mart process.
- Format The filtered information is then assimilated into a schema that is suitable for the secondary consumer (i.e., DSS). Usually this is in the form of a star schema or snowflake schema, a set of flat files or perhaps a normalized subset of data from the warehouse.
- Deliver The last step in the process is to ensure that the correct information is delivered to the appropriate data mart technology in a timely manner with the appropriate notifications to the ultimate consumers the business community.
In creating the data delivery process, you should:
- Try to keep the process simple until the dynamics of the environment are understood and all other infrastructure components are in place.
- Build a system to manage the data mart requests first. This should include a process to prioritize and consolidate the requests. This becomes a very useful process in managing requests and promoting communications with the end users.
- Try to develop and use templates for formatting request results wherever possible. These will be invaluable as your data mart population grows. They will be used to assist in automating the format process.
A data mart contains data from the data warehouse tailored to support the specific analytical requirements of a given business unit or business function.
The data mart is the recipient of the information assimilated by the data delivery process. Data marts may have either a business unit or functional view of the data warehouse data; thus data marts utilize the common corporate view of strategic data established in the data warehouse by the integration and transformation process. (Please read my column entitled, "Will the Real Data Mart Please Stand Up?" in the March 1999 issue of DM Review for more on data marts.) Some points about the data mart:
- The data mart may or may not be located on the same machine as the data warehouse. This allows consumers to select the best technology to support their particular style of decision making.
- Data marts should be conservatively implemented as an extension of the data warehouse, not as an alternative. Note: This does not mean that you should not implement a data mart as a proof of concept. Indeed, this is perhaps one of the best ways to demonstrate viability of the DSS environment. However, the long-term strategy dictates that the full Corporate Information Factory infrastructure is necessary for a healthy DSS environment.
- Data marts are the ideal constructs for classical decision support, including data mining and data visualization processes. However, you should keep in mind the tradeoff between the simplicity of design and the cost of administration of many data marts.
Decision Support Interface (DSI)
The decision support interface provides the end user with easy-to-use, intuitively simple tools to distill information from data.
DSI consists of the secondary consumers in the Corporate Information Factory. It is from these systems that analysis activities are enabled. There is much flexibility in terms of tool and technology choices, allowing the end user to match the tool to the task at hand.
Some of the considerations in this environment are:
- The data mart is the source of information for DSI while the data warehouse itself may be somewhat restricted in access.
- The types of tools may be categorized as query, reporting, multidimensional or online analytical processing, data mining, data exploration or data visualization tools.
- It is recommended to prototype extensively before making a purchase of DSI tool. Also don't try to do too much at first. It helps to understand how the end users will use the tools and the information by starting small and growing.
- You should plan on supporting several tools in each category. This can become a very resource intensive situation that may prohibit further construction of your CIF.
Transaction Interface (TrI)
The transaction interface provides the end user with an easy-to-use, intuitively simple interface to request and employ business management capabilities. It uses the operational data store as its source of data.
TrI is the catalyst (or messaging infrastructure) that provides the delivery and management of requests. It provides the presentation and functionality to prepare/submit/process requests for information. A good example of a consumer is CTI (computer telephone integration). CTI provides a very sophisticated environment for managing customer calls but lacks the information (and subsequently the knowledge) for driving the interaction with the customers. By integrating CTI (the application) with the operational data store (via TrI), customers can be routed to the appropriate business professional who has the critical information needed to deliver premier customer care.
Up to this point, all components of CIF provide visibility into data to drive both business intelligence and business management activities. But this is not enough. We need meaning in order to achieve our full potential in the CIF. Meta data provides the legibility necessary to achieve meaning.
Meta data provides the necessary details to promote data legibility, use and administration. Its contents are described in terms of data about data, activities and knowledge.
Figure 4: Data Management
Meta data is a formal component of the Corporate Information Factory and should not be given short shrift. It is meta data that provides comprehension to the end users and information concerning the management of the environment to the administrators. Some of the considerations for meta data are:
- Start gathering and managing meta data from the very start of your Corporate Information Factory creation. Meta data becomes of primary interest to the end users almost immediately.
- Develop a rational versioning scheme for meta data. Determine what events or conditions constitute a new version.
- Integrate the business and technical meta data and provide views that are appropriate for each group. Incorporate robust search capabilities such as browsers. You should consider using the Internet or an intranet for meta data delivery.
- Integrate the various sources of meta data and maintain the accuracy of this information. Make sure that you can easily accommodate new requirements as your construction progresses.
Figure 5: Data Delivery
The Corporate Information Factory may be separated into two fundamental processes:
- Producers who "get data in" by placing data into context for use by the entire enterprise
- Consumers who "get information out" by using the enterprise data to deliver business intelligence and business management capabilities.
The Corporate Information Factory is a proven, robust, logical architecture for strategic and tactical decision support. It demonstrates the interaction between the various components and the processes needed to support it. Each component has a specific function and purpose and, if left out, may cause disruption in the overall efficiency and usability of the architecture.
Check out the online version of the Corporate Information Factory Poster