Making sense of rogue fields in your fact table |
|
Mystery Dimensions |
||||||
|
||||||||
Often, we data warehouse architects drive a fact tables design from a specific data source. A typical complex example might be a set of records describing investment transactions. A recent example I studied had more than 50 fields in the raw data, and the end users assured me that all the data was relevant and valuable.
Because every record in the data represented an investment transaction, and all the investments were somewhat similar, I hoped that the data source would generate only one fact table, where the grain was the individual transaction. But the 50 fields intimidated me. What on earth was all that stuff?
Investment transactions are good examples of complex, messy data. The complexity isnt the database designers fault. These transactions are complex because there are a lot of context parameters and many special parameters describing modern financial investments. When a design challenge such as this confronts me, I try to stand back from the details and perform a kind of triage.
Find the Obvious Dimension-Related Fields
For the first step of triage, I find fields in the source data that are obviously parts of dimensions. Timestamps are straightforward. Maybe four separate timestamps describe our investment transaction. Each of these can be a time dimension, where we ask a single underlying calendar dimension to play four roles. We can accomplish this task by creating four views on the single underlying calendar table. I discussed the details of this approach in my DBMS column Data Warehouse Role Models (August 1997; see Resources).
Other straightforward dimension-related fields in our investment transaction include account numbers, account types, portfolio numbers, transaction types and codes, customer names and numbers, broker names and numbers, and location-specific information. A typical raw source data record is likely to be a kind of flat record containing both keys for these entities as well as descriptive text such as account type and customer name.
In the case of the 50 investment transaction fields Ive described, we can quickly identify no fewer than 20 of the fields as dimension related. We need to place a lot of redundant textual information in conventional dimensions, but after the dust clears, there are still 12 independent dimensions, of which four were roles the time dimension played. (See Figure 1, page 34.)
Find the Fact- Related Fields
The second step of the triage is to look for the numeric measurements. Anything that is a floating-point number or a scaled integer (such as a currency value) is likely to be a measurement. If the value varies in a seemingly random way between records and takes on a very large number of different values, then it is almost surely a measurement.
In the case of the 50 investment transaction fields, 20 of the fields clearly fit the characteristics of measurements. But five of the fields turned out to be cumulative measures that are not appropriate to the grain of an individual transaction. We excuse these five fields from the design and keep the remaining 15 fields, which we model as facts.
Now you may be thinking, what kind of weird transaction could possibly have 15 simultaneous facts? Thats a good question, because none of the source data transaction records actually had all 15 facts. Certain kinds of transactions gave rise to one set of facts and other transactions gave rise to an overlapping, but different set of facts. Most important, there was no disjoint partitioning of all the transactions that would separate the clumps of facts into nice groups. There were many transaction types and many investment account types; the pattern of measurements across these types and accounts was too complex to describe or neatly segment. In this sense, we could vindicate the raw datas design because it had to be flexible enough to handle many different investment transactions situations, including future types of investments that the records had not yet described.
Decide What to Do With the Rest
So far we have accounted for 40 of the 50 fields in our original data. But there are still 10 mystery fields left over. (See Figure 1.) These fields arent obvious textual dimension attributes or obvious foreign key values, so they may not feel like dimensions. The fields do not appear to be numeric measurements. When the fields are present, they seem to take on a small range of discrete values. Some of them are designated as codes, but no one is entirely sure of their significance. At this point, I ask an obligatory, but pointless question: If we dont know what the field means, why dont we leave it out of the design? The answer, of course, is that someone may need it, so we will leave it in. Actually, in spite of this frustrating third step of the triage, we have proceeded correctly. The value of the triage approach is to quickly identify the easy choices (in this case the obvious dimensions and facts), and to isolate a hopefully small subset of difficult data elements that require individual attention.
FIGURE 1 The logical progression of transforming a complex single data source into its
corresponding dimensional model.
Also, perhaps at this point you are thinking that if we had a proper enterprise data model, then all
these problems would have been sorted out and we wouldnt have to pursue such an ad hoc approach.
Well, I couldnt agree more. If an enterprise data model is a model of real data, then I am its
biggest fan. In that case, this article probably describes a specific episode in building that very
useful enterprise data model. But if the enterprise data model describes a kind of abstract, ideal data
world, describing how data should be if only it were designed correctly, then I have very little
patience. Idealized enterprise data models are only of marginal use when we try to take real data and
deliver it to end users on a tight budget and time frame. Idealized enterprise data models arent
populated with data.
Transform Mystery Fields Into Mystery Dimensions
Returning to our problem of 10 rogue fields that seem to be neither dimensions nor facts, we may be tempted to just leave them in the fact table. This is almost certainly a bad idea. Our goal should be to make these fields into dimensions. Many of the codes or alphanumeric fields would otherwise take up too much room, and we could drastically compress them if we could make them into dimensions.
Another easy approach is to just make 10 more dimensions, one for each mystery field. While this does place these low cardinality codes and textual values in dimension tables where we can easily index and constrain upon them, we now have 22 dimensions in our design, and that should raise a warning flag.
More important, many of these fields may be mildly or strongly correlated with each other, even if we are not completely sure what they mean. We need to make a significant effort to find the correlated mystery fields and group them together into a smaller number of new dimensions. Grouping correlated fields together has a couple of attractive benefits. First, it will be interesting to browse the correlated fields against each other. Maybe some of them will turn out to have hierarchical relationships. These relationships can be revealed when the fields are compressed into dimension tables where only the unique combinations are presented. Second, the number of new dimensions required will be reduced.
When Should Two Fields Be in a Single Dimension?
We are almost within reach of our final goal. We have separated off the obvious original dimensions and facts with our triage decisions. We have decided that all that remains is to package the rest of the mystery fields into a few more dimensions.
Should we just make one huge mystery dimension for all these remaining fields? That would seem to solve a number of problems. All the fields would go away to be replaced by a single key. But this approach is likely to produce a dimension with as many records as the fact table itself. If the dimension contains several uncorrelated fields, then there will be very few repeated values for the whole dimension record, and every transaction would produce a new mystery dimension record.
The secret to this last step of the design is to group the mystery fields together into correlated groups. Each of these correlated groups becomes a new dimension. It is wise to be flexible when searching for these correlations. Suppose FieldX has 100 discrete values and FieldY has 1,000 discrete values. The key question is: How many unique FieldX + FieldY combinations exist in the data? If there are exactly 1,000 such combinations, then FieldX is a hierarchical parent of FieldY and they should absolutely be in the same dimension table. If the number of FieldX + FieldY combinations approaches 100,000, then the two fields are virtually independent and we would gain very little by placing them in the same dimension. But the situation is rarely so extreme. The number of FieldX + FieldY combinations might be 5,000 or 10,000. Even this correlation is pretty interesting, and the two fields should be part of the same dimension. To discover this case, you may have to comb the data, counting combinations of values in order to figure out what to do.
Finally, try to keep perspective. If you have five uncorrelated fields, but they each have only three values, then it would be reasonable to package them all in a single mystery dimension. Yes, we end up with the Cartesian product of the fields, but there are only 35 = 243 possible combinations, a small and convenient mystery dimension. Ultimately, you should not be striving for mathematical elegance; rather, you should be making pragmatic packaging decisions that best fit your data and your tools.
Ralph Kimball, Ph.D., co-invented the Star Workstation at Xerox and founder of Red Brick
Systems, works as an independent consultant designing large data warehouses. He is the author of The Data Warehouse Toolkit (Wiley, 1996) and the newly published The Data Warehouse Lifecycle Toolkit (Wiley, 1998). You can reach him
through his Web page at www.ralphkimball.com.
RESOURCESRalphs DBMS column:www.dbmsmag.com/9708d05.html |
|
|
|
|
|











