Describing the Past Accurately: Data Warehouse Designers Unique Challenge |
|
Many Alternate Realities |
|||||
|
|||||||
Data warehouse designers take an implicit oath to represent the past correctly. We promise our users that if the definition of a customer or product changes slowly over time, we will be very careful to keep the old definitions in the database and apply them to history correctly. If we work for an insurance company and we are looking backward in time to see why we approved insurance for a particular cluster of customers, we must have the correct descriptions of those customers at the moments in the past when we approved their insurance, not the descriptions that are valid in the present. These customers are now older and presumably wealthier, wiser, and have larger families. We dont want these revised descriptions when we try to evaluate an old decision. Data warehouse designers have long recognized the need for describing the past accurately. In the dimensional data warehouse, the world divides into fact tables and dimension tables. Fact table data, by its nature, represents a time series of measurements and is always augmented with an explicit time dimension. Finding old fact-table data is easy and is one of the standard queries in the data warehouse. Just constrain the time dimension to the appropriate previous time interval.
But dimension-table data requires more thought. Dimensions dont change in predictable ways. Individual customers and products evolve slowly and episodically. Some of the changes are true physical changes. Customers change their addresses because they move. A product is manufactured with different packaging. Other changes are actually corrections of mistakes in the data. And finally, some changes are changes in how we label a product or customer and are more a matter of opinion than physical reality. We call all these variations slowly changing dimensions (SCDs).
Over the last decade, data warehouse designers have sorted out three major approaches to SCDs. We call these Type 1, Type 2, and Type 3. Briefly:
A Type 1 SCD is an overwrite of a dimensional attribute. History is definitely lost. We overwrite when we are correcting an error in the data or when we truly dont want to save history.
A Type 2 SCD creates a new dimension record and requires a generalized or surrogate key for the dimension. We create surrogate keys when a true physical change occurs in a dimension entity at a specific point in time, such as the customer address change or the product packing change. We often add a timestamp and a reason code in the dimension record to precisely describe the change.
A Type 3 SCD adds a new field in the dimension record but does not create a new record. We add a new field when we have a new label for the customer or the product that is a matter of opinion rather than physical reality. Maybe we change the designation of the customers sales territory because we redraw the sales territory map, or we arbitrarily change the category of the product from Confectionery to Candy. In both cases, we augment the original dimension attribute with an old attribute so we can switch between these alternate realities. Type 3 is distinguished from Type 2 because we could regard both the old and new descriptions of a Type 3 change as true simultaneously. In the product category example, the addition of an Old Category field gives us exactly one alternate category reality to switch to.
These three types of slowly changing dimensions handle most of the situations faced by the data warehouse designer. But a few weird situations remain. Several times in my design consulting I have encountered a situation that seems to be a hybrid of Type 2 and Type 3. I call this situation many alternate realities.
Predictable Multiple Realities
Consider a situation where a sales organization continually revises the map of its sales districts. Perhaps it adjusts its sales districts each year to try to adapt to changing market conditions. Over a 10-year period the sales organization accumulates no fewer than 10 different maps. On the surface, this organization would seem like a candidate for the Type 2 slowly changing dimension. But as the database designer, you discover during the user interviews that this sales organization has a much more complex set of requirements. It wants to:
Report each years sales using the approved district map for that year
Report each years sales using a district map from an arbitrary different year
Report an arbitrary span of years sales using a single district map from any chosen year.
The most common version of the third requirement would be to report the complete span of years using todays map.
You cannot serve this set of requirements with the Type 2 model because Type 2 perfectly partitions history and a year can only be reported using its assigned unique map. The requirements cannot be met with a Type 3 model because Type 3 only allows a single alternate reality, and in this case, we have 10 alternate realities.
In this example, we can take advantage of the regular nature of these 10 alternate realities by generalizing the Type 3 model to have not one, but 10, versions of the District attribute for each sales team. The sales team dimension would then look something like:
Sales Team Key
Sales Team Name
Sales Team Physical Address (stays constant)
District1999 (the district assignment for the team in 1999)
District1998 (the district assignment for the team in 1998)
District1997
District1996
District1995
District1994
District1993
District1992
Distric t1991
District1990
plus other unrelated sales team attributes.
Each sales team record would have all 10 district interpretations, and the end user could choose to roll up all sales teams with any of the 10 district maps. In this design, there is one record for each sales team, and the addition of more realities merely requires the addition of more district attributes in the original records.
Unpredictable Multiple Realities
Now lets make the problem harder. Suppose that the assignment of each sales team to a district is not synchronized to calendar years but occurs at random and unpredictable times, and each sales team is different. Now we will restate the reporting requirements a little. The users want to:
Report district sales at any past instant in time using the assignments that were valid at that instant
Report all sales over all time using whatever todays district map may be
Report all sales over all time using a selected obsolete district map.
This design requires a hybrid of Type 2 and Type 3. We issue a new record for a sales team whenever its district assignment changes, but we also carry along a Current District attribute in all the versions of each sales team record that is overwritten whenever the current district map is altered. The sales team record now looks like:
Sales Team Key
Sales Team Name
Sales Physical Address (remains unchanged)
District (the district assignment valid between the following dates)
Begin Effective Date (the first date this record is valid)
End Effective Date (the last date this record is valid)
Obsolete District1 (a selected obsolete definition)
Obsolete District2 (a different obsolete definition)
Current District (the most current district assignment; periodically overwritten)
plus other unrelated sales team attributes.
We administer this design differently. When a sales team is deemed to be part of a new district, we issue a new record for that sales team. This is a straightforward Type 2 SCD response. We bookkeep the Begin and End dates correctly. We assume here that the End Effective Date for one assignment of a district is exactly one day less than the Begin Effective Date of the next assignment. We keep the same value in a given obsolete definition across all the records that describe a specific sales team. Finally, we also sweep back through all previous instances of this sales teams records and overwrite the Current District attribute. The Obsolete and Current attributes are a variation of the Type 3 SCD. Now we can meet all of the reporting objectives given to us by the end users. When we report all district sales at a particular instant in time, we must constrain the query in the following way:
Reporting_date >= Begin_effective_date and
Reporting_date <= End_effective_date.
You might have been tempted to write:
Reporting_date BETWEEN Begin_ effective_date and End_effective_date
but amazingly, this code is not SQL! SQL allows field-between-values, not value-between-fields. Fortunately, Oracle and other DBMSs do support nonstandard extensions to SQL to let you use this intuitive construction without getting scolded by the syntax checker.
Complex Political and Geographic Dimensions
Ive described this last example in terms of a sales organization in order to make the explanation as clear as possible. A few sales organizations actually are managed this way, especially when the definition of sales districts are confounded with the definition of sales channels.
An interesting real-world example of this hybrid Type 2/3 dimension is a dimension describing geographic entities over a period of time. Imagine a map of Europe dating from about 1500. On this map we identify hundreds of small city states, districts, and portions of countries. Each of these small geographic entities is made into a record. Then we administer this geographic entity dimension in the same way as the last sales organization example. This way, we can assemble the little geographic entities into countries at any point in time from 1500 to the year 2000. By constraining to a particular reporting date, we can instantly snap in a logical map of Europe at that time. By using the Obsolete fields, we can track the Hapsburg Empire from its origins in 1500 to the modern day, even though it does not exist today. By using the Current field, we can track the contemporary definition of the European Union similarly.
We can even augment the geographic entity record with a field that links either to a simple JPEG file that is the correct Europe map, or links to a geograhic information system (GIS) representation of that same map. To make this work best, we would need a contiguous set of maps spanning 1500 to 2000 for all the geopolitical changes that occurred.
This kind of geographic dimension is personally fascinating to me because of the complexity of representing the complicated changes in the definitions of states and countries over the period from 1500 to 2000. If this interests you as well, then check out a wonderful atlas devoted to this perspective with dozens of historical maps called The Times Atlas of World History (Hammond, 1989).
Ralph Kimball, Ph.D., co-invented the Star Workstation at Xerox and founder of Red Brick
Systems, works as an independent consultant designing large data warehouses. He is the author of The Data Warehouse Toolkit (Wiley, 1996) and the newly published The Data Warehouse Lifecycle Toolkit (Wiley, 1998). You can reach him
through his Web page at www.ralphkimball.com.
|
|
|
|
|











