Q: I have a client that uses snowflake schemas to model his data marts. I would like to show him that star schemas have better performance to queries than snowflake schemas. Please compare the two schemas.
Chuck Kelley's Answer:
I do not think that you can conclusively say that snowflakes are faster the stars. I believe that when there are less joins, the performance is generally faster, hence I would say that stars are faster than snowflakes. In determining whether snowflakes are better than stars, I would look at my user community tool. For example, MicroStrategy must have a snowflake in order to be used successfully. Other tools will work better with stars. So to be fair to your client, in my opinion, I would suggest that you match the design to the tool.
Tom Haughey's Answer:
First, let's describe them. A star schema is a dimensional structure in which a single fact is surrounded by a single circle of dimensions; any dimension that is multileveled is flattened out into a single dimension. The star schema is designed for direct support of queries that have an inherent dimension-fact structure. A snowflake is also a structure in which a single fact is surrounded by a single circle of dimensions; however, in any dimension that is multileveled, at least one dimension structure is kept separate. The snowflake schema is designed for flexible querying across more complex dimension relationships. The snowflake schema is suitable for many-to-many and one-to-many relationships among related dimension levels. However, and this is significant, the snowflake schema is required for many-to-many fact-dimension relationships. A good example is customer and policy in insurance. A customer can have many policies and a policy can cover many customers. The primary justification for using the star is performance and understandability. The simplicity of the star has been one of its attractions. While the star is generally considered to be the better performing structure, that is not always the case. In general, one should select a star as first choice where feasible. However, there are some conspicuous exceptions. The remainder of this response will address these situations.
First, some technologies such a MicroStrategy require a snowflake and others like Cognos require the star. This is significant.
Second, some queries naturally lend themselves to a breakdown into fact and dimension. Not all do. Where they do, a star is generally a better choice.
Third, there are some business requirements that just cannot be represented in a star. The relationship between customer and account in banking, and customer and policy in Insurance, cannot be represented in a pure star because the relationship across these is many-to-many. You really do not have any reasonable choice but to use a snowflake solution. There are many other examples of this. The world is not a star and cannot be force fit into it.
Fourth, a snowflake should be used wherever you need greater flexibility in the interrelationship across dimension levels and components. The main advantage of a snowflake is greater flexibility in the data.
Fifth, let us take the typical example of Order data in the DW. Dimensional designer would not bat an eyelash in collapsing the Order Header into the Order Item. However, consider this. Say there are 25 attributes common to the Order and that belong to the Order Header. You sell consumer products. A typical delivery can average 50 products. So you have 25 attributes with a ratio of 1:50. In this case, it would be grossly cumbersome to collapse the header data into the Line Item data as in a star. In a huge fact table you would be introducing a lot of redundancy more than say 2 billion rows in a fact table. By the way, the Walmart model, which is one of the most famous of all time, does not collapse Order Header into Order Item. However, if you are a video store, with few attribute describing the transaction, and an average ratio of 1:2, it would be best to collapse the two.
Sixth, take the example of changing dimensions. Say your dimension, Employee, consists of some data that does not change (or if it does you do not care, i.e., Type 1) and some data that does change (Type 2). Say also that there are some important relationships to the employee data that does not change (always getting its current value only), and not to the changeable data. The dimensional modeler would always collapse the two creating a Slowly Changing Dimension, Type 2. This means that the Type 1 is absorbed into the Type 2. In some cases I have worked on, it has caused more trouble than it was worth to collapse in this way. It was far better to split the dimension into Employee (type 1) and Employee History (type 2). Thereby, in such more complex history situations, a snowflake can be better.
Seventh, whether the star schema is more understandable than the snowflake is entire subjective. I have personally worked on several data warehouse where the user community complained that in the star, because everything was flattened out, they could not understand the hierarchy of the dimensions. This was particularly the case when the dimension had many columns.
Finally, it would be nice to quit the theorizing and run some tests. So I did. I took a data model with a wide customer dimension and ran it as a star and as a snowflake. The customer dimension had many attributes. We used about 150MM rows. I split the customer dimension into three tables, related 1:1:1. The result was that the snowflake performed faster. Why? Because with the wide dimension, the DBMS could fit fewer rows into a page. DBMSs read by pre-fetching data and with the wide rows it could pre-fetch less each time than with the skinnier rows. If you do this make sure you split the table based on data usage. Put data into each piece of the 1:1:1 that is used together.
What is the point of all this? I think it is unwise to pre-determine what is the best solution. A number of important factors come into play and these need to be considered. I have worked to provide some of that thought-process in this response.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access