Q: | This is a simple data mart question. What is summary data? Most books and white papers state that the star schema contains aggregated and summarized data. I understand aggregated data (i.e., monthly sales), but what is summarized? Is it a subset of the detail data or is there some type of transformation to the data to make it summarized? |
A: | Les Barbusinski’s Answer: In my experience, the two terms are used interchangeably most of the time. However, I have heard some people make the following distinction between the two: aggregate tables simply sum up transactional metrics against one or more dimensions, while summary tables store "synthesized" metrics. An example of this would be an accounting data warehouse where historical general ledger transactions are stored. An aggregate table would simply summarize these transactions by account, department and month. A summary table, on the other hand, would provide P&L metrics (revenue, gross margin, net income and other KPIs) by department and month. Scott Howard’s Answer: I also have problems with this phasing. It's something that someone penned many years ago and seemed to stick. Yes summarization is a form of aggregation, thus the basis for our confusion. I think what was intended was an attempt to compare a typical application oriented or OLTP model to a typical DW or data mart model. OLTP models are generally very specific containing current detailed data. On the other hand, DW models contain very summarized historical information materialized and maintained in a way not possible in OLTP models. Now how do we get from one model to the next? Aggregation: average, sum etc. Chuck Kelley’s Answer: I believe that aggregate and summarized are the same thing. A synonym for aggregate is summative (according to the Thesaurus in Microsoft Word). Some people use the term summarized and others use aggregate (including me!). Some of us try to be a bit more precise and use the word "or" instead of "and" (i.e., "… contains aggregated or summarized data …"), but it doesn’t always happen. Sorry for the confusion! David Marco’s Answer: Summarized and aggregation are the same thing. Clay Rehm’s Answer: In its simplest form, summary data is data that has been "summarized," or aggregated. This means that some form of detailed data has been rolled up to less detail. How this is physically stored depends on the preference of your data analysts and DBAs. Summary data can reside in a star schema, and it can reside in a single table that has been "flattened;" that is the key element of each dimension and the fact table are built into one "flat" table. Summarized can mean that the data was simply rolled up (added up) or there were complex transformation rules to summarize the data. Additionally, summary tables can be at whatever level of detail the user needs, just as long as it is in one easy to access place. |
Chuck Kelley is an internationally known expert in database and data warehousing technology. He has 30 years of experience in designing and implementing operational/production systems and data warehouses. Kelley has worked in some facet of the design and implementation phase of more than 50 data warehouses and data marts. He also teaches seminars, co-authored four books on data warehousing and has been published in many trade magazines on database technology, data warehousing and enterprise data strategies. He can be contacted at chuckkelley@usa.net.
Les Barbusinski is vice president of technology and co-founder of Digital Symmetry, LLC, a consulting firm that specializes in the design and development of data warehousing and business intelligence solutions. He has more than 20 years of experience in data warehouse and operational systems development and provides hands-on expertise in data warehouse design, development and project management. Les can be reached at dwexpert@dsym.com.
Scott Howard has been with IBM for more than 22 years. Howard’s experience includes staff and management assignments ranging from microapplications programming to mainframe and systems programming. He is an internationally recognized expert on business intelligence, data warehousing, DRDA, distributed databases and multivendor database integration, and an author and contributor to many publications. Scott is an IBM certified Advanced Technical Expert for DB2 UDB, an IBM Certified Business Intelligence Specialist and Certified Technical Trainer. Howard is currently with Learning Services, IBM Global Services and is its business intelligence and data integration curricula worldwide leader. He has worked with IBM’s Silicon Valley, Toronto, Rochester and Austin development labs for the past twelve years, developing client/server database and data warehousing courses.










Be the first to comment on this post using the section below.