Our situation is that we need to build a system having: stage (loaded daily)-->ODS (loaded monthly)--> DW (loaded monthly).
The stage is being fed from flat files and houses tables - a structure which is exactly same as the flat files. The DW is a dimensional structure, having common dimensions. We have no idea about the best way to model the ODS in this scenario. The ODS needs to act as stage for the DW and also takes care of operational reporting needs. Can you validate this architecture and suggest best ways to model the ODS? An example will be really helpful.
Les Barbusinski's Answer: In the real world, there are usually three main data stores in a data warehouse:
Many people confuse the function, structure and content of these data stores and, hence, get themselves into trouble.
The purpose of the operational data store is to hold current (not historical) operational information in a denormalized structure (i.e., master and transactional data "mish-mashed" together) that facilitates the efficient generation of operational reports (e.g., lists, exception reports, summary and detail reports, statements, etc.). This means that the ODS is populated by tables that look a lot like the reports they facilitate, and the same data may appear in multiple tables. Also, data that is not needed for operational reporting is not stored in the ODS.
The purpose of the analytical data store is to hold current and historical information about the enterprise in a dimensional structure that facilitates the generation of analytical reports (i.e., OLAP/BI reports with atomic and aggregated metrics measured against groups of dimensions which permit drill downs and other manipulations that help answer questions about the enterprise, and facilitate decision making). As with the ODS, the DDS does not store data it does not need for its analytical reports.
Notice that each data store has a different function, works with a different set of data, and stores its information in a different type of data structure. If you confuse these data stores, you usually get into trouble. A common mistake is to bypass the HDS completely, and have the ODS perform both functions. In this case you wind up with one of two problems: either 1) the ODS is fully normalized (i.e., 3NF) and the performance of the operational reports is abysmal or 2) the ODS is denormalized for operational reporting and the performance of extracts for the DDS is abysmal. Also, in the second case, it's difficult - if not impossible - to find atomic "master" or transaction information in the ODS because of the denormalized structure (i.e., data elements are repeated all over the place).
To illustrate this, let's use an insurance company as an example. The insurance company's operational systems generate a variety of data about products (e.g., term life insurance, whole life insurance, fixed and variable annuities, etc.), customers, branch offices, agents and brokers, policies and a host of transaction types (e.g., new contracts, agents added/deleted, premiums paid, policy cancellations, policy claims and payouts, policy "surrenders", commissions paid, etc.). All of this data is extracted and stored in flat file format in the data warehouse's Staging Area on a nightly basis. External data (e.g., demographics, stock exchange codes, census data, etc.) is also stored in the staging area.
An initial set of ETL processes then cleanses and integrates the data from the staging area, transforms it into a normalized data structure and loads it into the historical data store. The tables in the HDS would have names such as CUSTOMER (containing static information such as name, DOB, gender, ethnicity, etc.), CUSTOMER_HISTORY (containing snapshots of dynamic information such as marital status, address, occupation, income level, etc.), POLICY (containing a current snapshot of each Policy), POLICY_HISTORY (containing historical snapshots of every Policy), PREMIUM_PAYMENT (containing information about every premium billed and paid against a Policy), POLICY_CLAIM (containing information on Claims against every Policy), and so on. Because the structure is fully normalized, HDS tables are 1) rarely modified, 2) almost always formatted as a "snapshot" of some kind and 3) the data tends to be kept for years before being archived (i.e. table sizes can be huge). Thus, information in an HDS is complete and easy to find, but generating reports from an HDS is not advisable (too many joins required on too many large tables).
A second set of ETL processes then searches the HDS (or its load files) for new and updated operational data, extracts what it needs, transforms it into denormalized reporting structures and loads it into the operational data store.
Tables in the ODS would have names such as CUSTOMER_LIST (containing customer information by Branch and Agent), POLICY_LIST (containing summary Policy data by Branch and Agent), DELINQUENT_POLICY (containing exception data about Policies with delinquent premium payments), COMMISSION_DETAIL (containing individual commission payments by Branch, Agent, Policy, and Month), POLICY_STATEMENT (containing summary Policy data that accompanies the monthly premium billing), and so on. Note that the structure and content of each of these tables closely resembles that of the report(s) it feeds. Because of this characteristic, the structure of ODS tables tends to change more frequently (i.e,. as the reports they feed evolve). Also, ODS tables are created/dropped more frequently than HDS tables (as operational reports are created or dropped). Furthermore, since ODS tables contain only current information (e.g., current Agent profiles, sales for the current week or month, etc.), the data in ODS tables tends to be overlaid rather than added to ... which means that ODS tables tend to be small.
Finally, a third set of ETL processes searches the HDS (or its load files) for new and updated dimensional and transactional data, extracts what it needs, aggregates it, transforms it into a dimensional structure, and loads it into the analytical data store. Tables in the DDS would have names such as CUSTOMER_DIM, PRODUCT_DIM, POLICY_DIM, POLICY_PREMIUM_FACT, POLICY_PAYOUT_FACT, AGENT_COMMISSION_FACT, etc. Dimension tables in the DDS tend to be stable in both structure and content, while fact tables tend to appear/evolve/disappear more quickly as management's analytical needs evolve. Most dimension tables tend to be rather small, although some (like CUSTOMER_DIM and PRODUCT_DIM) can be huge. Fact tables, on the other hand, are almost always large and continuously growing.
In this manner all incoming source data is cleansed, integrated, and synchronized once for the HDS, then subsetted and reformatted from the HDS for the needs of the ODS and DDS. Note that the ODS and DDS only extract the data they need from the HDS before formatting it to their own liking. Note also that the source for the DDS is the HDS ... not the ODS because the ODS does not necessarily carry all of the data needed by the DDS. Hope this helps.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access