Fact Tables and Dimension TablesThe logical foundation of dimensional modelingContinued from Page 1 Logically, a fact table doesn't need a primary key because, depending on the information available, two different legitimate observations could be represented identically. Practically speaking, this is a terrible idea because normal SQL makes it very hard to select one of the records without selecting the other. It would also be hard to check data quality if multiple records were indistinguishable from each other. Relating the Two Modeling WorldsDimensional models are full-fledged relational models, where the fact table is in third normal form and the dimension tables are in second normal form, confusingly referred to as denormalized. Remember that the chief difference between second and third normal forms is that repeated entries are removed from a second normal form table and placed in their own "snowflake." Thus the act of removing the context from a fact record and creating dimension tables places the fact table in third normal form. I resist the urge to further snowflake the dimension tables and am content to leave them in flat second normal form because the flat tables are much more efficient to query. In particular, dimension attributes with many repeated values are perfect targets for bitmap indexes. Snowflaking a dimension into third normal form, while not incorrect, destroys the ability to use bitmap indexes and increases the user-perceived complexity of the design. Remember that in the presentation system in the data warehouse, you don't have to worry about enforcing many-to-one data rules in the physical table design by demanding snowflaked dimensions. The staging system has already enforced those rules. Declaring the GrainAlthough theoretically any mixture of measured facts could be shoehorned into a single dimension table, a proper dimensional design allows only facts of a uniform grain (the same dimensionality) to coexist in a single fact table. Uniform grain guarantees that all the dimensions are used with all the fact records (keeping in mind the No Store example), and it greatly reduces the possibility of application errors due to combining data at different grains. For example, it's usually meaningless to blithely add daily data to yearly data. When you have facts at two different grains, you place the facts in separate tables. Additive FactsAt the heart of every fact table is the list of facts that represent the measurements. Because most fact tables are huge, with millions or even billions of rows, you almost never fetch a single record into your answer set. Rather, you fetch a very large number of records, which you compress into digestible form by adding, counting, averaging, or taking the min or max. But for practical purposes, the most common choice, by far, is adding. Applications are simpler if they store facts in an additive format as often as possible. Thus, in the grocery example, you don't need to store the unit price. You merely compute the unit price by dividing the dollar sales by the unit sales whenever necessary. Some facts, like bank balances and inventory levels, represent intensities that are awkward to express in an additive format. You can treat these semiadditive facts as if they were additive but just before presenting the results to the end user, divide the answer by the number of time periods to get the right result. This technique is called averaging over time. Some perfectly good fact tables represent measurements that have no facts! This kind of measurements is often called an event. The classic example of such a factless fact table is a record representing a student attending a class on a specific day. The dimensions are Day, Student, Professor, Course, and Location, but there are no obvious numeric facts. The tuition paid and grade received are good facts but not at the grain of the daily attendance. Degenerate DimensionsIn many modeling situations where the grain is a child, the natural key of the parent winds up as an orphan in the design. In the grocery example, the grain is the line item on a sales ticket, but the ticket number is the natural key of the parent ticket. Because you have systematically stripped off all the ticket context as dimensions, the ticket number is left exposed without any attributes of its own. You model this reality by placing the ticket number by itself, right in the fact table. We call this key a degenerate dimension. The ticket number is useful because it's the glue that holds the child records together. In the next issue, the sixth column in this Fundamentals series will detail the latest thinking on how to handle slowly changing dimensions. Ralph Kimball co-invented the Star Workstation at Xerox and founded Red Brick Systems. He has three best-selling data warehousing books in print, including The Data Warehouse Toolkit, Second Edition (Wiley, 2002). He teaches dimensional data warehouse design through Kimball University and critically reviews large data warehouse projects. You can reach him through his Web site, www.ralphkimball.com. RESOURCESThis is Ralph Kimball's fifth column in his Fundamentals series. The previous four are: Part 1: "An Engineer's View," July 26, 2002 Part 2: "Design Constraints and Unavoidable Realities," Sept. 3, 2002 Part 3: "Two Powerful Ideas," Sept. 17, 2002 Part 4: "Divide and Conquer," Oct. 30, 2002
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
|
|











