Q: We have a raging debate going here about whether to load null values into the data warehouse or not. One side argues that nulls give misleading query results and that we should translate nulls to some kind of 'dummy' value. The other side insists that loading artificial data violates the integrity of the DW and also carries a cost in terms of storage coonsumption, performance, and data management. What's the best practice approach to loading nulls into the DW?
Joe Oates' Answer:
It is been my experience that loading nulls into a data warehouse is generally not a good idea. You don't say whether the issue of nulls pertains to fact tables, dimension tables or both. I am going to assume that you are referring to fact tables.
For fact tables, there are two situations that need to be addressed. One is having nulls where you're expecting an integer or a decimal value such as the number of items sold or totals. The other is when doing surrogate key replacement no valid value is found in the appropriate cross-reference table.
In the case of having nulls for your expecting an integer or a decimal value such as number of items sold or totals, substituting zero for null is no worse than storing the null and not including the row(s) in a calculation. In my opinion, it also gives more accurate results in counts and other functions that can give misleading or incorrect values when nulls are present. You can always run reports to identify rows that contain unexpected zero values. It is unusual for a customer to buy zero products or to buy some number of products and be charged to zero dollars and cents.
If the issue is that a valid customer ID, product ID, etc. could not be found so that proper surrogate key replacement could be done, a strategy of providing a role in each dimension table with the ID of zero and a description field set to "unknown," or some similar value will allow you to take these source system transaction rows into the data warehouse and allow routine queries to be run without having complex logic to handle null values. Also, you can run a separate report after each ETL run showing those transactions with invalid source keys. If you don't want to do that, when reports are run there will be grouping headers with the title of "unknown."
Either way, the end users will be alerted to which transactions in the source system need to be corrected.
Clay Rehm's Answer:
For each field, a list of all possible values or range of values must be defined in your metadata repository/data dictionary, which includes what a null means for that field. Having said that, why not have a value of "No Value Available" or something to that effect? I think most nontechnical users struggle with the null concept, so provide an easy to use value that means the same thing.
Adrienne Tannenbaum's Answer:
Both scenarios will cause equivalent problems if you cannot guarantee that your future data warehouse users (developers as well as business users) will have easy access to absolutely up to date and correct metadata. And of course there is a lot more to be said as to where these "values" fit in the grand scheme of things. Is the specific field crucial to the data warehouse itself, or is it secondary info...?
Tom Haughey's Answer:
You are not the only one. Here are some considerations: There are several factors to consider, as follows:
- class of the column, such as amount, count, quantity, percentage, etc.
- role of the column, specifically is it a primary key, foreign key or non-key column,
- how the column is used, for summation or joining and
- finally, the type of system, namely, OLTP or data warehouse.
In OLTP, in general it is better to use nulls, as Codd designed them. He designed them to represent 'no value' and they should be used for that purpose. We should, however, fins some way to distinguish three cases: 1. "no-value specified" when there could be a value, 2. "value not available" for those cases where it should not be specified; and 3. a zero or blank value where that is what was specified. Each of these is different. Data warehousing (and here I am using the term here to cover all forms of reporting, analysis, mining and OLAP), are read-only systems that query large amounts of data. So here is what I think for this.
Primary keys should not be null in whole or in part. This rule is true of all systems. For them, it is best to have an explicit value. In some cases, this may mean putting in a dimension table an artificial value that signifies the something out of range, or otherwise invalid, and that sort. If the foreign key is optional, because the relationship is optional, I'd leave it null.
The problem with null key and foreign key has to do with the join or equi-join. Instances are contained in the result set only when the value appears in the join column of all joined tables. The outer join solves this but is significantly less efficient. Though the dummy values are artificial, they produce a more efficient join. For querying purposes, I would prefer an explicit value, even if it is a dummy value. Your choice of technology can affect this decision. A robust parallel platform may be able to do the outer join very efficiently, in which case I would stick with nulls. We used this in one DW on a fast IBM parallel platform to find which orders did not have deliveries and which deliveries had no orders.
As a general rule, leave monetary columns as null. The reasoning is that a dollar amount as unknown is different than a dollar amount as zero. It is probably necessary to deal with the unknowns wherever and whenever the amounts are shown to the end user. Remember, this is not just a data administration question; it is a business question. You may have to address some of these questions to the end user, and get them to tell you what they prefer, what they want the defaults to be, whether they want to skip the data, and how they want the reports to look.
For text fields, it may be better to convert to ' '. OLAP likes that better. I have seen some people convert nulls to 'BLANK'. Some business preference can be involved in this decision.
Non-key dates can stay null. A null date is easy to deal with.
I do not believe that the "cost in terms of storage consumption, performance, and data management " are the real factors in favor of nulls. Actually some performance is degraded by nulls as above. Real values join better than null values - at least in terms of the equi-join vs. the outer join. The DW is for querying. Which method will work for querying is the major question. Which one will give you the best results in terms of your reporting and analysis needs? Some DBA personal preference and experience is also involved in this decision, and whether the DBAs have mostly supported OLTP or analytics.
I have seen good DW's handle nulls either way.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access