Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Digital Library
Subscribe
Home


August 03, 1999, Volume 2 - Number 11

When the changes become too rapid, give these SCDs a breather with some advanced design techniques

When A Slowly Changing Dimension Speeds Up


Ralph Kimball

Slowly changing dimensions are one of the hallmarks of dimensional modeling. They have become so well recognized as a standard data warehousing design problem that they now have their own acronym. Recently, I saw a discussion on the dwlist list server about “SCDs,” and I was wondering what an SCD was. Then I realized they were talking about slowly changing dimensions.

A slowly changing dimension is exactly what it sounds like. Any dimension can be slowly changing. A record in a customer dimension intuitively represents a customer. A record in a product dimension intuitively represents a product. A subtle assumption in all dimensions is that they are independent of time. But what do we do when a customer changes? Or a product changes?

The data warehouse has a responsibility to track history accurately, so we can’t afford to overwrite the prior customer description with the new one. Similarly, we can’t overwrite the old product description. So whenever a dimension record undergoes a change, we issue a new dimension record. This is the heart of the SCD logic, and it has significant implications. The grain of our dimension is no longer the individual customer or the individual product; rather, it’s the individual customer snapshot or the individual product snapshot. Also, we are forced to generalize the dimension key if we haven’t already done so. Obviously, we cannot use the simple customer ID or product ID as the physical key for the dimension. This situation drives the use of surrogate keys, which I have written about extensively in the past.

Now that data warehousers have been using SCDs for several years, what have we learned? What are the refinements of this technique, and what are the limits of this approach? What happens when a slowly changing dimension isn’t slow after all?

Date Stamps in Slowly Changing Dimensions

It certainly feels natural to include a date (or time) stamp in an SCD record. Date stamps represent the moment in time when the current description in the record becomes valid. It is important to realize, however, that the stamp does not participate as a normal constraint on the dimension when it is connected to a fact table. Suppose, for example, we are talking about a product dimension record. The date stamp refers to the moment when the product description became valid, but this stamp may have little to do with the date and time dimensions attached to the main fact table. It’s often incorrect to constrain the date stamp in the dimension record to coordinate with the date and time in the fact table. A company may sell an old product well after a new version has superceded it. In that case, the fact table date has little to do with the date stamp in the dimension.

Nevertheless, the date stamp in the dimension has its own uses, and using such stamps in a dimension is an advanced SCD technique. It is possible, with a little careful SQL, to perform very precise time slicing of the dimension by itself. If, for example, you are asking for all the customer dimension records as of September 13, 1999, at 1:00 p.m., you must ask for all the records that have distinct customer IDs and whose date stamps are the latest ones less than or equal to September 13, 1999, at 1:00 p.m. I’ll leave it to you to work out the SQL. Send an email to my fellow columnist Joe Celko if you have any questions.

Precision time slicing of a dimension using the technique just described is particularly powerful in human resources applications, in which you may want to ask such questions as, “how many employees of a particular job grade did we have at a precise moment in the past?”

SCDs That Aren’t Slowly Changing

SCDs live on a continuum. The records in some product dimensions change less than once per year. A small dimension of less than 100,000 records that seldom change — less than once a year — is an ideal candidate for the SCD approach. But other dimensions, such as customer dimensions or organization entity dimensions, may change several times per year. The worst combination for an SCD approach is a very large dimension (such as a multimillion-row customer dimension) where the average record changes many times per year.

There is no hard and fast rule that says the SCD approach of creating a new dimensional record is no longer practical. It’s just a matter of to what degree. When a dimension table gets too large and is changing with such frequency that administration and query performance are getting ridiculous, then it is time to do something.

But before we throw the baby out with the bath water, let’s remember why the SCD approach is good:

•Keeping the dimension as a flat, denormalized table is desirable because it simplifies the user interface presentation in all query tools, and it allows symmetric cross browsing among all the attributes in the dimension.

•The SCD approach naturally hides the existence of multiple versions of the customer (for example) until the end user asks for an attribute that has changed. This is what the user expects. So if we have two records for a customer, one showing “single” and one showing “married,” reports using all the other unchanged attributes will not be able to tell that there are two underlying records for the customer. Only if the end user uses the marital status field does the existence of two records become apparent.

•The SCD approach remains completely consistent with the basic structure of dimensional modeling, in which the fact table of measurements is surrounded by a set of dimension tables, all connected by a foreign key: primary key relationships. The entity-relation (E/R) modeling approach does not specifically address slowly changing entities.

If a dimension table is changing too rapidly, then one powerful technique is to break off and make one or more new dimensions out of these rapidly changing parts. For example, if we have a large customer dimension in which we have embedded various behavior scores or behavior classifications, we may find that these behavior metrics are driving the dimension crazy. Each customer’s behavior metrics might change every month. This is a pretty obvious example of needing to take these metrics out of the dimension. If the metrics are largely textual (such as premium customer, new customer, chronic product returner, or overdrawn customer) and are usually the basis for constraints and report breaks, then these metrics need to be in their own dimension.

We collect all the combinations of these textual behavior metrics into an abstract “behavior” dimension, make a surrogate key for this new dimension, and attach the new key to each fact table, where we also have a regular dimension key. The compromise, of course, is that in order to associate a customer with his or her behavior labels, you must have a fact table context in which you can expose both customer and behavior keys simultaneously. But the alternative is likely to be unthinkable. The biggest customer dimensions built by national retailers and Web businesses routinely have more than 100 million distinct customers, so you don’t want to multiply this dimension by anything.

If the behavior metrics are not textual and discrete, but are numeric and continuous, then you should place them in a fact table. The kinds of operations we perform on these numeric measurement-like fields are quite different from the constraints and report breaks we do on dimension tables. For numeric, continuously valued facts:

•We trend the facts numerically over time. We may graph them, average them over time, and use predictive techniques to guess the next few measurements we will encounter in the future.

•We count the number of facts falling within various ranges of values. We call this a value banding report.

•We almost never constrain on a specific value, especially if the fact is truly continuously valued.

•We often have a family of similar measures. We may have a shopping behavior score, a credit behavior score, a payment behavior score, a product support behavior score, and a product return behavior score.

If you have a business in which you send your customers a monthly statement, then you already have a fact table tailor-made to hold these numeric behavior scores. Take them out of your rapidly changing SCD, and just append them to your monthly statement fact table if you are comfortable with monthly snapshots of these scores.

If you do not have a reliable, periodic fact record waiting to receive these scores, then consider making a special behavior tracking fact table that will store and publish these behavior scores. A monthly snapshot of every customer that is complete is attractive from an applications point of view because you can profile your entire customer base without having to email Joe Celko.

Whether you have rapidly changing textual attributes or rapidly changing numeric attributes, the techniques discussed in this column will give your big, important, slowly changing dimensions a well-deserved breather. Somehow, somewhere, you have to draw a line between your SCDs and your rapidly changing dimensions (RCDs).



Ralph Kimball, Ph.D., co-invented the Star Workstation at Xerox and founder of Red Brick Systems, works as an independent consultant designing large data warehouses. He is the author of The Data Warehouse Toolkit (Wiley, 1996) and the newly published The Data Warehouse Lifecycle Toolkit (Wiley, 1998). You can reach him through his Web page at rkimball.com.





IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo JitterPlug Into The Cloud
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet EvolutionPyramid Research
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space