I'm designing a star schema for a client for reporting on the use of several systems. Every single usage of a system is logged and modeled as events in the fact-table QUERIED with an attribute called query=1. In combination with two dimensions User and System we can report on the number of queries a user fired on a system. Here's my question: I also want to report on the answers the user got from the system he queried. So, the star schema has to incorporate both event data and all the answers related to the events. I currently modeled it this way:

  • fact table QUERIED with attributes: foreign keys to the dimensions query=1
  • dimension table USER key id etc.
  • dimension table SYSTEM key id etc.
  • dimension QUESTION
  • dimension ANSWER

QUESTION and ANSWER are related to the fact table QUERIED and must provide the actual questions and answers causing the event to happen. Is this the way to model it or are there any other ways?
Joyce Bischoff’s Answer: It appears that you are trying to duplicate the functionality of several software tools that are designed to track data usage in the warehouse environment, such as Pine Cone, Teleran and others. If you actually mean that you are going to capture the data returned by a query, the overhead will be impossible. If you want to track the SQL generated by each query and you have a significant number of users, your ANSWER table will become very large and the overhead will also be significant. I suggest that you look into the available software products rather than attempt to reinvent the wheel with you own software.

Scott Howard’s Answer: Proper design here must be based on a good understanding of the user query patterns and their scope of analysis, neither of which I possess. Now that I’ve established that I’m not qualified to answer your question, let’s try; as that hasn’t stopped me before ;-). You’ve designed a factless fact table, which should be avoided unless the typical scope of analysis is very random or indeterminate. That’s a generalization based on experience, so let me be more specific. If you have data that will be returned for the majority of your queries or joining from the dimension to the fact is a performance problem, that data should exist only in the fact table. The curious thing about your example is your maintaining a history of the answer sets. I assume that these answer sets can be large, so it may not make sense to keep them in the fact table even if they are frequently required for the typical query. You’ll have to base your decision on your own experiences with the BI tools and RDBMS with which you deal. However, your QUESTION dimension appears to be a major scope of analysis and probably can be incorporated into your fact table as even large queries can be easily stored as a single attribute. Doing so would eliminate one dimension. I recommend this based on the assumption that the most likely query will be what question(s) did a specific user ask on a specific system or a specific day. A secondary query may be what was the answer set. I don’t know if a query of who asked a question that returned X would be too common (just an assumption based on not being intimate with your application). If my assumptions were wrong, use the following advice to help construct the proper star: the dimensions should represent the questions that you ask of the fact. In your case; who, the USER; when, TIME; where, the SYSTEM; and finally what which can be both QUESTION or ANSWER. Notice that you have two whats here, the basis of our quandary. The fact should represent the major measurement or the reason for analysis. Is that reason the QUESTION or the ANSWER? This will depend on whether your system allows ad-hoc QUESTIONS or users are restricted to a finite set of canned or predetermined queries. I recommend that you keep your dimensions to a minimum and your reason for analysis in the fact table.

Chuck Kelley’s Answer: I am not clear from your question what QUERY=1 means. From my best guess at what you are trying to do, you could create a new dimension that would be the equivalent of the QUERY=1 with all the data associated with it and tie that back to your fact table.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access