Sid Adelman's Answer: For someone new to the data warehouse there is a wealth of information on past questions and answers in the Question Archive section of Ask the Experts on DMReview.com. There are 23 on OLAP, 12 on project management, 24 on data quality, and 122 on DW design and methodology.
Clay Rehm's Answer:
- Fundamental OLAP architecture design focuses on used, duplicated and denormalized data. Let me explain - used data is stored in the data warehouse. This means that there is data in the data warehouse that is already somewhere else in some operational system. The data warehouse captures, stores and reformats data that has come from somewhere else. If you have data that is in the data warehouse that is not in an operational system (other than derived values), then there is a problem with your processes. Data is duplicated not only from the operational systems but also with the warehouse - the same data can be duplicated in many different types of detail or summary tables depending on how your audience needs to see the data. Data warehouses are meant for non-technical users and as such, the data needs to be in formats and structures that are easy for them to use.
- The difference between OLTP and DSS is the same difference between OLTP and DW. Data warehousing is a fairly new term for a concept that started many years ago. In the early days of data management, terms such as reporting databases and DSS surfaced to provide a repository of data and processes that focused on user's query and reporting needs. Added to that is the term Business Intelligence (BI) which in my opinion provides a bigger picture - data and processes.
- The best processes to use in a data warehouse? There are far too many to list here and I could recommend some books and methodologies. I always recommend a project management methodology (PMM) and the Project Management Institute (www.pmi.org) can help you out there. In addition, you will need a systems development methodology (SDM) that is tailored to the specifics of data warehousing. The books published by Ralph Kimball are good, as well as the book Data Warehouse Project Management by Adelman and Moss.
- If by architecture you mean hardware/network requirements, then your best bet is to have your own dedicated servers and high-speed networks. If your question pertains to how the data is structured, then you will have a mixture of fact, dimension and summary tables. Fact tables are normalized like in the OLTP world, but dimensions and summary tables are highly denormalized. In addition, your star schemas (facts and dimensions joined together) will usually be presented to the user as one table or view.
- Data quality deals with processes to keep your data of high quality. In addition there are levels of quality, and tolerance levels of what your users will accept. For example, say you are capturing customer addresses. On a very simple level, users should not be allowed to key in their own codes for States. Having a decode table (and drop down list) should provide a standard list of States the user would choose from. However this does not prevent the user from choosing a city that does not belong in that state. Your operational applications can be built to include many edits to prevent common sense errors, but there also needs to be business processes in place to prevent the entry of non-quality (bad) data. Whatever is keyed into the operational systems will be carried over to your data warehouse, and in many cases this bad data really becomes apparent. You then need to fix the source systems and/or fix the extract processes that load the data warehouse.
Les Barbusinski's Answer: You are doomed. Entire volumes have been written on the subjects that you mention, and you probably won't have time enough to prepare yourself. DBAs usually cut their teeth on small departmental data marts before graduating to terabyte data warehouse implementations.
Nevertheless, here's a partial list of essential reading that will give you an understanding of the critical issues in data warehousing with regards to data architecture, data quality, ETL processes, OLAP concepts, meta data concepts and project methodology:
- Building the Data Warehouse by Bill Inmon
- Corporate Information Factory by Bill Inmon, Claudia Imhoff, and Ryan Sousa
- The Data Warehouse Toolkit by Ralph Kimball
- The Data Warehouse Lifecycle Toolkit by Ralph Kimball
- Data Warehouse: From Architecture to Implementation by Barry Devlin
- Improving Data Warehouse and Business Information Quality by Larry English
- Data Quality for the Information Age by Thomas Redman
- Building and Managing the Metadata Repository by David Marco
- Data Warehouse Project Management by Sid Adelman and Larissa Moss
Scott Howard's Answer: I think that I answered your first question in one of this month's co-questions. For your second question, let's define OLAP as a continuum of decision support. Your question is: what is the difference between OLTP and OLAP systems? This is very similar to your fourth question so I'll try to answer both. Short of an entire course or book on the subject, all I can do here is help expose the major differences in the models and architectures.
The OLTP model is one that is intended to capture and efficiently manage the current state of your business. Short term transactions, current inventory, monitoring current manufacturing processes and the like are the focus of most OLTP applications and systems. OLAP systems represent history and need to function in a way contrary to OLTP systems. That is they need to capture everything that goes on within our business including the net business result of an OLTP transactional update or delete and represent and preserve that net meaning in a historical model. They also need to combine that with external events like promotions, special external events such as holidays, weather, manufacturing floor conditions, economic conditions, so business analysts can make sense of the changes in our business captured from the OLTP models. These external events are also not generally represented in the OLTP models. Now we don't have room or time today to expand on how to do just that, but that's what OLAP or data warehouse modeling is all about and why it differs so from OLTP models and architecture.
Let's now consider the operational differences between OLTP and OLAP systems. OLTP systems usually have a high constant transaction rate usually consisting of very simple read/write transactions. Systems administrators thus tune these systems to take most advantage of the resources available at constant rates, thus drive CPU and I/O usage as close to 100 percent as possible. This is in contrast to OLAP systems use which is inconsistent, primarily long running and complex read-only transactions. This pattern leads to peaks and valleys in resource usage that when combined with OLTP usage can cause usage spikes well over resource capacity. These spikes can result in service level violations for your OLTP system. Again this illustrates the need for very different system and data models.
Now let's consider data quality. Data quality relates to the reliability of source data regarding both content accuracy and business integrity. Addressing the issues is a simple task, recognize and fix as many anomalies as your can. Well the charge is simple; accomplishing it is a challenge that needs to be addressed individually model by model. I have a feeling that my Ask the Experts associates will expand on this.