August 03, 1999, Volume 2 - Number 11
When the changes become too rapid, give these SCDs a breather with some advanced design
techniques
When A Slowly Changing Dimension Speeds Up
Ralph Kimball
Slowly changing dimensions are one of the hallmarks of dimensional modeling. They have become
so well recognized as a standard data warehousing design problem that they now have their own acronym.
Recently, I saw a discussion on the dwlist list server about SCDs, and I was wondering
what an SCD was. Then I realized they were talking about slowly changing dimensions.
A slowly changing dimension is exactly what it sounds like. Any dimension can be slowly changing. A
record in a customer dimension intuitively represents a customer. A record in a product dimension
intuitively represents a product. A subtle assumption in all dimensions is that they are independent of
time. But what do we do when a customer changes? Or a product changes?
The data warehouse has a responsibility to track history accurately, so we cant afford to overwrite
the prior customer description with the new one. Similarly, we cant overwrite the old product
description. So whenever a dimension record undergoes a change, we issue a new dimension
record. This is the heart of the SCD logic, and it has significant implications. The grain of our
dimension is no longer the individual customer or the individual product; rather, its the
individual customer snapshot or the individual product snapshot. Also, we are forced to generalize the
dimension key if we havent already done so. Obviously, we cannot use the simple customer ID or
product ID as the physical key for the dimension. This situation drives the use of surrogate
keys, which I have written about extensively in the past.
Now that data warehousers have been using SCDs for several years, what have we learned? What are the
refinements of this technique, and what are the limits of this approach? What happens when a slowly
changing dimension isnt slow after all?
Date Stamps in Slowly Changing Dimensions
It certainly feels natural to include a date (or time) stamp in an SCD record. Date stamps represent the
moment in time when the current description in the record becomes valid. It is important to realize,
however, that the stamp does not participate as a normal constraint on the dimension when it is connected
to a fact table. Suppose, for example, we are talking about a product dimension record. The date stamp
refers to the moment when the product description became valid, but this stamp may have little to do with
the date and time dimensions attached to the main fact table. Its often incorrect to constrain the
date stamp in the dimension record to coordinate with the date and time in the fact table. A company may
sell an old product well after a new version has superceded it. In that case, the fact table date has
little to do with the date stamp in the dimension.
Nevertheless, the date stamp in the dimension has
its own uses, and using such stamps in a dimension is an advanced SCD technique. It is possible, with a
little careful SQL, to perform very precise time slicing of the dimension by itself. If, for example, you
are asking for all the customer dimension records as of September 13, 1999, at 1:00 p.m., you must ask
for all the records that have distinct customer IDs and whose date stamps are the latest ones less than
or equal to September 13, 1999, at 1:00 p.m. Ill leave it to you to work out the SQL. Send an email
to my fellow columnist Joe Celko if you have any questions.
Precision time slicing of a dimension using the technique just described is particularly powerful in
human resources applications, in which you may want to ask such questions as, how many employees of
a particular job grade did we have at a precise moment in the past?
SCDs That Arent Slowly Changing
SCDs live on a continuum. The records in some product dimensions change less than once per year. A small
dimension of less than 100,000 records that seldom change less than once a year is an ideal
candidate for the SCD approach. But other dimensions, such as customer dimensions or organization entity
dimensions, may change several times per year. The worst combination for an SCD approach is a very large
dimension (such as a multimillion-row customer dimension) where the average record changes many times per
year.
There is no hard and fast rule that says the SCD approach of creating a new dimensional record is no
longer practical. Its just a matter of to what degree. When a dimension table gets too large and is
changing with such frequency that administration and query performance are getting ridiculous, then it is
time to do something.
But before we throw the baby out with the bath water, lets remember why the SCD approach is
good:
Keeping the dimension as a flat, denormalized table is desirable because it simplifies the
user interface presentation in all query tools, and it allows symmetric cross browsing among all the
attributes in the dimension.
The SCD approach naturally hides the existence of multiple versions of the customer (for example)
until the end user asks for an attribute that has changed. This is what the user expects. So if we have
two records for a customer, one showing single and one showing married, reports
using all the other unchanged attributes will not be able to tell that there are two underlying records
for the customer. Only if the end user uses the marital status field does the existence of two records
become apparent.
The SCD approach remains completely consistent with the basic structure of dimensional modeling, in
which the fact table of measurements is surrounded by a set of dimension tables, all connected by a
foreign key: primary key relationships. The entity-relation (E/R) modeling approach does not specifically
address slowly changing entities.
If a dimension table is changing too rapidly, then one powerful technique is to break off and make one or
more new dimensions out of these rapidly changing parts. For example, if we have a large customer
dimension in which we have embedded various behavior scores or behavior classifications, we may find that
these behavior metrics are driving the dimension crazy. Each customers behavior metrics might
change every month. This is a pretty obvious example of needing to take these metrics out of the
dimension. If the metrics are largely textual (such as premium customer, new customer, chronic product
returner, or overdrawn customer) and are usually the basis for constraints and report breaks, then these
metrics need to be in their own dimension.
We collect all the combinations of these textual behavior metrics into an abstract behavior
dimension, make a surrogate key for this new dimension, and attach the new key to each fact table, where
we also have a regular dimension key. The compromise, of course, is that in order to associate a customer
with his or her behavior labels, you must have a fact table context in which you can expose both customer
and behavior keys simultaneously. But the alternative is likely to be unthinkable. The biggest customer
dimensions built by national retailers and Web businesses routinely have more than 100 million distinct
customers, so you dont want to multiply this dimension by anything.
If the behavior metrics are not textual and discrete, but are numeric and continuous, then you should
place them in a fact table. The kinds of operations we perform on these numeric measurement-like fields
are quite different from the constraints and report breaks we do on dimension tables. For numeric,
continuously valued facts:
We trend the facts numerically over time. We may graph them, average them over time, and use
predictive techniques to guess the next few measurements we will encounter in the future.
We count the number of facts falling within various ranges of values. We call this a value banding
report.
We almost never constrain on a specific value, especially if the fact is truly
continuously valued.
We often have a family of similar measures. We may have a shopping behavior score, a credit
behavior score, a payment behavior score, a product support behavior score, and a product return behavior
score.
If you have a business in which you send your customers a monthly statement, then you already have a fact
table tailor-made to hold these numeric behavior scores. Take them out of your rapidly changing SCD, and
just append them to your monthly statement fact table if you are comfortable with monthly snapshots of
these scores.
If you do not have a reliable, periodic fact record waiting to receive these scores, then consider making
a special behavior tracking fact table that will store and publish these behavior scores. A monthly
snapshot of every customer that is complete is attractive from an applications point of view because you
can profile your entire customer base without having to email Joe Celko.
Whether you have rapidly changing textual attributes or rapidly changing numeric attributes, the
techniques discussed in this column will give your big, important, slowly changing dimensions a
well-deserved breather. Somehow, somewhere, you have to draw a line between your SCDs and your rapidly
changing dimensions (RCDs).
Ralph Kimball, Ph.D., co-invented the Star Workstation at Xerox and founder of Red Brick
Systems, works as an independent consultant designing large data warehouses. He is the author of The Data Warehouse Toolkit (Wiley, 1996) and the newly published The Data Warehouse Lifecycle Toolkit (Wiley, 1998). You can reach him
through his Web page at rkimball.com.