|
March 30, 1999 Volume 2 Number 5
Fundamental GrainsAlmost all data warehouses sit on one of three bedrock designsBy Ralph Kimball
Our duty as data warehouse designers is to publish these measurements for all the managers and analysts in our organization so they can make sense of the pattern of transactions. Ultimately, revenue, costs, profit, and much of the important customer behavior should be understandable from this series of measurements.
Having made this tidy argument for everything consisting of measured transactions, I might have you wondering: If transactions drive everything, has data warehousing been reduced to storage and performance issues? Even better, maybe we can just defer many of the hard data warehouse implementation issues because, we think, technology will take care of us in a year or two. As compelling as the transactional view of the world is, theres more to life than transactions. Although we can always reconstruct a complete picture of a business from the transaction history, it is often grossly impractical to do so. Often we need to augment the transactions with an alternate view of the world: the snapshot.
A snapshot is a measurement of status at a specific point in time. In the data warehouse world, there are two different snapshots of status, both equally important. The periodic snapshot is a regular, predictable statement of status. The monthly snapshot, the most common type, may contain a whole set of measurements. Some of these snapshots are instantaneous measurements at the end of the period, and some are measurements accumulated throughout the period. The design of this snapshot focuses on the activity that occurred during the time interval between snapshots. The other important kind of snapshot is the accumulating snapshot, which shows us the status at any given moment. If we ask for the current status an hour from now, it may be different. The accumulating snapshot often combines the most recent volatile status with measures that accumulate from the beginning of history. The three kinds of measurements transaction, periodic snapshot, and accumulating snapshot are the choices for the grain of any given fact table. All three are useful, and you often need a minimum of two to get a complete picture of a business. In Figure 2, Ive lined up all three next to each other to point out their similarities and differences. The first thing to note is that many of the dimensions (indicated by the key fields in Figure 2) are the same across the three fact tables. Whenever the same key is listed in separate fact tables, we insist that the definitions of the keys are exactly the same and that they point to exactly the same dimension tables. Ive discussed this requirement for conformed dimensions many times in this column, and it is the secret behind building separate data marts that you can use together and that arent isolated, incompatible stovepipes. Although the dimensions are similar, the administration and rhythms of the three tables are different. The transaction-grained fact table represents an atomic action that occurs at an instantaneous point in time. There is often no guarantee that a record for a given policy or customer exists in the transaction fact table; a record exists only if a transaction occurs. Conversely, there is no upper limit to the number of records for a given policy or customer. The date/timestamp on the transaction record may be accurate to the individual day, or it may be accurate to the hour and minute. Normally there is only a single fact generically called amount. Its meaning depends on the value of the transaction key. After posting it, we dont usually revisit a transaction-grained fact record for any type of update. The periodic-snapshot-grained fact table represents a predefined time span. There is often a strong guarantee that a record exists in the fact table as long as the policy is in effect or as long as any claims against the policy are open. Usually, exactly one record exists for each combination of significant keys. The timestamp on the periodic snapshot is only the name of the period and usually denotes the end of the period. Periodic snapshots can have any number of facts, depending on what measures are possible or useful to calculate. Some of these facts may be extraordinarily difficult to calculate directly from the transactions. In Figure 2, earned premium is the fraction of the total policy premium that the insurance company can book as revenue during the particular reporting period. The calculation of earned premium may be very complex. In such cases, it is often better to let the OLTP system provide this calculation as part of its production-reporting capability, and to treat this result as data simply to be loaded into the data warehouse. Finally, once its posted, we do not normally revisit a periodic-snapshot fact record for any type of update unless we decide to add a completely new periodic-snapshot fact to the entire table. The accumulating-snapshot-grained fact table represents an indeterminate time span, covering the entire history starting when the collision coverage was created for the car in our example and ending with the present moment. There is a strong guarantee that a single record exists in the fact table if the customers car was ever covered. Accumulating-snapshot fact tables almost always have multiple time stamps. One time stamp represents the last time the snapshot was updated, and the others represent generic or predictable events that may have taken place in the life of the coverage. Note that some of these time stamps may have to handle null values because, for example, the first_claim_date will not have a real value unless a claim has been filed. The correct treatment of null date keys requires using integer-valued surrogate keys. Accumulating snapshots, such as the periodic ones, can have any number of facts depending on what measures are possible or useful to calculate. In dramatic contrast to the other fact-table types, we frequently revisit accumulating-snapshot fact records to update the facts. Remember that in this table there is generally only one fact record for the collision coverage on a particular customers car. As history unfolds, we must revisit the same record several times to revise the accumulating status.
The transaction-grained fact table is the only table that can answer detailed questions about timing and questions about unpredictable, repetitive behavior. In our insurance example, the transaction-grained fact table can tell us the length of time between an agents preliminary approval of a coverage and the underwriters final signoff. This time period is a useful measure of the efficiency of our insurance policy processing. Other transaction-grained fact tables can tell us about customer behavior. The periodic-snapshot fact table is often the only table that can easily generate a regular, predictable view of the important measures of a business. Calculating basic measures, such as revenue and cost, may be impractical if starting from any sort of transactional view of the business. So we publish such measures only once each month at a low level of granularity and let all the companys managers and analysts roll these measures upward into different combinations from our periodic-snapshot table. The accumulating-snapshot fact table is appealing when we are tracking items that have a finite lifetime, such as insurance policy coverages or line items on orders. If the coverage or the order line item exists for a finite period of time, there is a flurry of updating activity on those records when they are open, and then the record quiets down. It remains in the database as a summary of the activity surrounding that coverage or line item. The advantage of the accumulating snapshot is that we can do a great deal of useful reporting without complex constraints on the time dimension and without double counting across the multiple records of a periodic snapshot. It would be rare to implement all three fact tables for a single business process. In my experience, a fact table representing one of the two snapshot styles is the first and most urgently needed. This generally keeps the boss happy because revenue and cost summaries often fall out of either style of snapshot table. We can add a transaction-grained table to the data warehouse later, when we need detailed analyses of timing and behavior. In some cases, when both a periodic- snapshot table and a transaction-grained table are available, we can calculate cumulative measures at arbitrary times during the month by starting with the snapshot and rolling forward with the transactions. Although there is some theoretical information redundancy between the transaction and snapshot tables, we dont object to such redundancy as data warehouse providers. Our mission is to publish data effectively for the rest of our organization, and these separate types of fact tables speak in their own ways to different audiences.
Ralph Kimball, Ph.D., co-invented the Star Workstation at Xerox and founder of Red Brick Systems, works as an independent consultant designing large data warehouses. He is the author of The Data Warehouse Toolkit (Wiley, 1996) and the newly published The Data Warehouse Lifecycle Toolkit (Wiley, 1998). You can reach him through his Web page at www.ralphkimball.com. |
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
| ||||||||||||||||||||||||||||||||









