Wherever there is a large amount of data that needs both integration and history, data warehousing has applicability. This means that data warehousing is for banks, insurance companies, manufacturers, retailers, public utilities, government agencies and telecommunication companies.
The concept of data warehousing may apply to each of these organizations, but in many cases the data warehouse that results has its own characteristics. In other words, an insurance data warehouse will not look like a banking data warehouse, a manufacturing data warehouse will not look like a retailing data warehouse and a human resources data warehouse will not look like an e-commerce data warehouse. Each has unique characteristics.
Human resources data warehouses are perhaps some of the most unique data warehouses, because they exist for a different reason than most other warehouses. Most data warehouses exist as a way to manage large volumes of data. Retailers, telecommunications, banking and finance all have data warehouses that contain or represent a huge number of transactions. When you stop and think of it, there just aren't that many transactions that occur in human resources. About the only significant amount of transaction processing that occurs is payroll processing. The reason why human resources needs a warehouse is because of the need to have an orderly and complete store of data over a lengthy period of time. In other words, human resources environments need data warehouses in order to manage history.
Human resources data warehouses will store data for 20 to 50 years, a length of time almost unheard of elsewhere. Another implication is that the human resources warehouse will not store as much data as other warehouses because very few transactions need to be kept. A third implication is that the human resources warehouse contains a lot of data structures which are time continuous. The normal kind of warehouse structure is one that is best described as a snapshot. But in a human resources warehouse, long continuous records of information are kept, centered around employees.
The insurance environment is unique because of the length of time required for a transaction to occur or for a unit of business to be complete. In order to understand this notion, contrast an insurance company with the business of a bank. A customer walks into the bank, cashes a check and takes his/her money. The entire transaction takes a few seconds. Or consider the telecommunications environment. A person dials a phone, talks to the other person, then hangs up. Or consider the retailing environment. The person goes into a store, selects an item, then pays for it. The business transaction is complete in a short amount of time.
Contrast these transactions with the kinds of transactions that occur in the insurance environment. Consider a car accident: a report is made of the accident, a claim is filed, a counterclaim is filed. The police are interviewed. An adjuster makes a report. An offer is tendered to the victim. The victim goes to court. It may take years for the settlement of the accident claim. In the insurance environment, the transaction occurs as if in slow motion. An example involves purchase of a whole life insurance policy. The policy is to be paid off at 55 and will cover the insured until death. The policy takes ten years to pay off. It may take another twenty years for collection. The length of time for the transaction to be complete is measured in decades.
Because of the very lengthy transaction time periods found in insurance warehouses, there is often considerable overlap between the insurance data warehouse and the insurance operational environment. This overlap is found nowhere else.
The retail warehouse and telecommunications warehouse are marked by the need to manage very many occurrences of data. The number of individual sales made per day, the amount of items stocked on a daily basis and the number of phone calls made per day are all transactions and occurrences which accumulate at a fantastic rate. The issues inside the retail/telco warehouse relate to how large amounts of data can be managed.
There are at least two common ways to manage this data. The first is the creation of aggregate/profile records as the data is loaded into the warehouse. Instead of loading detailed data into the warehouse, data is summarized and aggregated in order to reduce the volume of data placed inside the data warehouse. One of the disadvantages of this aggregation and profiling is the fact that a certain amount of detail is lost when the data is structured.
In order to save space and unnecessary processing, retail/telco data warehouses oftentimes create what is called rolling summary data. In a rolling summary data structure, data is cascaded from one database to another based on time. Hourly data is loaded into a daily database at the end of the day. Daily data is loaded into a weekly database at the end of the week. Weekly data is loaded into a monthly database at the end of the month, and so forth. In such a manner, data is kept over a long period of time. However, the longer the data is kept, the more summarized the data. Even with rolling summarization, data is not kept in a retail/telco warehouse for nearly as long as it is kept in an insurance or human resources data warehouse.
Increasingly, a popular alternative for the storage of very large amounts of data found in the retail/telco warehouse is to store the bulk of the data in alternative storage near-line storage or "fat" disk storage. By storing the bulk of the detail in alternative storage, the organization is able to store data at the lowest level of detail more economically. When needed for analysis, the alternative storage data is loaded into an exploration warehouse or a standard data warehouse residing on disk storage. By keeping only data that is actively used for processing in disk storage, the costs of the warehouse are reduced and the performance of the warehouse is raised significantly.
To a different degree, e-commerce data warehouses have the same problem with aggregation and profiling as retailing and telco warehouses. E-commerce data warehouses contain significant amounts of data. The difference between an e- commerce data warehouse and a retail/telco data warehouse is that aggregation and profiling are very desirable with e-commerce data. In general, the clickstream data that is collected in e-commerce is at too low a level to be useful in an unrefined state. The clickstream data needs to be conditioned before it can be entered into a data warehouse and used in an analytically effective manner. This conditioning process causes e-commerce clickstream data to be greatly reduced in volume. Undesirable units of data which are not useful for analysis and merely get in the way are removed from the clickstream before being placed in the data warehouse. Clickstream data fits very comfortably on alternative storage.
However, there is a world behind clickstream data, and that is the data that is interactive with the tables of the corporation. This data tells the tale of much of the transaction processing that is relevant to the business of e-commerce.
In some engineering environments e.g., taking data from process control in the analog environment and moving it to the digital environment tons of data is generated. Data reduction is simply a way of life in these environments and existed long before there were data warehouses. A lot of the data is simply not relevant and does not deserve to be stored in a data warehouse. What does need to be stored are the exceptions.
In a data warehouse suited to these forms of engineering, keeping track of data values lying within bands of tolerances is done very efficiently by noting where occurrences happen outside of the bands of tolerances. Of course, the engineer that specifies the bands of tolerances must be very sure what will be of interest at a later point in time, because the data that lies within the bands of tolerances is lost forever. The data that then comes from process control engineering fits into a data warehouse but with a different structure than found elsewhere.
One of the other quixotic facts of engineering data warehouses is that the value of the detail data does not degenerate over time as it does in a sales or retailing environment. In many cases, the details of manufacturing are as important today as they were ten years ago. There simply is no way that much manufacturing data can be summarized and aggregated as it ages and maintain its usefulness. The need to have a carefully controlled archival environment where details are preserved is an interesting characteristic found in many manufacturing engineering data warehouses.
Even star join database design has its own peculiarities. A typical star join consists of fact tables and dimension tables. The dimension tables contain far less data than the fact tables. The fact tables contain the vast amount of occurrences of data.
Interestingly, the fact tables contain almost exclusively numeric data. The dimension tables contain the alphabetic data. About the only alphabetic data found in the fact table are those instances where data is used as a key. The keys in the fact tables may be alphabetic. Otherwise the fact table contains purely numeric data.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access