Completing the 20 criteria for rating your dimensional data warehouse |
|
Is Your Dimensional Data Warehouse Expressive? |
|||||||
|
|||||||||
In the April 28 issue, I proposed 20 criteria for judging a dimensional data warehouse system. (See Table 1) My intent with these criteria is to establish an objective standard for how well a system supports the dimensional view of data warehousing, and to set the bar high so that vendors have a target for improving their systems. Remember that the intended way to use this list is to rate your system on each criterion with a simple 0 or 1. A system qualifies for a 1 only if it meets the full definition of support for a criterion. For example, a system that offers aggregate navigation (the fourth criterion) that is available only to a single front-end tool gets a zero because the aggregate navigation is not open. There is no partial credit, and no creative interpretation of a criterion to make a square peg fit into a round hole.
|
Architecture 1. Explicit declaration 2. Conformed dimensions and facts 3. Dimensional integrity 4. Open aggregate navigation 5. Dimensional symmetry 6. Dimensional scalability 7. Sparsity tolerance Administration 8. Graceful modification 9. Dimensional replication 10. Changed dimension notification 11. Surrogate key administration 12. International consistency Expression 13. Multiple-dimension hierarchies 14. Ragged-dimension hierarchies 15. Multiple-valued dimensions 16. Slowly changing dimensions 17. Roles of a dimension 18. Hot-swappable dimensions 19. On-the-fly fact range dimensions 20. On-the-fly behavior dimensions |
| TABLE 1 The 20 expression criteria for comparing systems.
|
The final eight expression criteria are analytic capabilities that are needed in common real-life situations. The end-user community experiences all expression criteria directly. These expression criteria for dimensional systems are not the only features users look for in a data warehouse, but they are all capabilities that we need to exploit the power of a dimensional system.
Here are the eight expression criteria for dimensional systems:
Multiple-dimension hierarchies. The system allows a single dimension to contain multiple independent hierarchies. No practical limit exists to the number of hierarchies in a single dimension. Hierarchies may be complete (encompassing all the members of a dimension) or partial (encompassing only a select subset of the members of a dimension). Two hierarchies do necessarily have common levels or common attributes (fields), and may have different numbers of levels. Two hierarchies may also share one or more common levels but otherwise have no correlation.
Ragged-dimension hierarchies. The system allows dimension hierarchies of indeterminate depth, such as organization charts and parts explosions, where records in the dimension can play the roles of parents as well as children. Using this terminology, a parent may have any number of children, and these children may have other children, to an arbitrary depth limited only by the number of records in the dimension. A child may have multiple parents, where these parents total ownership of the child is explicitly represented and adds up to 100 percent. With a single command the system must be able to summarize a numeric measure from a fact table (or cube) on a ragged hierarchy for all members:
1. Starting with a specified parent and descending to all the lowest possible levels summarizing all intermediate levels
2. Starting with a specified parent and summarizing only children exactly n levels down from the parent or n levels up from the lowest child of any branch of the hierarchy, where n is equal to or greater than zero
3. Starting with a specified child and summarizing all the parents from that child to the supreme parent in that childs hierarchy
4. Starting with a specified child and summarizing all the parents exactly n levels upward in the hierarchy from that child
5. Starting with a specified child and summarizing only that childs unique supreme parent. A given ragged-dimension hierarchy may contain an arbitrary number of independent families (independent supreme parents with no common children). Conversely, independent supreme parents may share some children as I stated previously when discussing of total ownership.
Multiple-valued dimensions. A single atomic measure in a fact table (or cube) may have multiple members from a dimension associated with that measure. If more than one member from a dimension is associated with a measure, then an explicit allocation factor is provided that optionally lets the numeric measure spread across the dimensions associated members. In such a case, the allocation factors for a given atomic measure and a given multivalued dimension must add up to 100 percent.
Slowly changing dimensions. The system must explicitly support the three basic types of slowly changing dimensions: Type 1, where a changed dimension attribute is overwritten; Type 2, where a changed dimension attribute causes a new dimension member to be created; and Type 3, where a changed dimension attribute causes an alternate attribute to be created so that both the old and new values of the attribute are simultaneously accessible in the same dimension member record. Support for slowly changing dimensions must be systemwide, as the following requirements imply:
Changes to a dimension that invalidate any physically stored aggregate must automatically disqualify that aggregate from use.
A Type 2 change must trigger the automatic assignment of a new surrogate key for the new dimension member, and that key must apply for all concurrent fact records loaded into the system. In other words, the creation of a new Type 2 dimension member must automatically link to the associated concurrent facts without the user or application developer needing to bookkeep beginning and ending effective dates.
If the system supports ragged-hierarchy dimensions and/or multiple-valued dimensions, then these types of dimensions must support all three types of slowly changing dimensions.
Roles of a dimension. A single dimension must be associative with a set of facts via multiple roles. For instance, a set of facts may have several independent timestamps that you can simultaneously apply to the facts. In this case, a single underlying time dimension must be able to attach to these facts multiple times, where each instance is semantically independent. A given set of facts may have several different kinds of dimensions, each playing multiple roles.
Hot-swappable dimensions. The system must allow an alternate instance of a dimension to swap in at query time. For example, if two clients of an investment firm wish to view the same stock market data through their own proprietary stock ticker dimensions, then the two clients must be able to use their versions of the dimension at query time, without requiring the fundamental fact table (or cube) of stock market facts to duplicate. Another example of this capability would let a bank attach an extended account dimension to a specific query if the user restricts the query to a cluster of accounts of the same type.
On-the-fly fact range dimensions. The system provides direct support for dynamic value banding queries on numeric measures in a fact table (or cube). In other words, at query time the user can specify a set of value ranges and use these ranges as the grouping criteria in a query. All the normal summarizing functions (count, sum, min, max, and average) can apply within each group. The sizes of the value bands neednt be equal.
On-the-fly behavior dimensions. The system supports constraining a dimension via a simple list of that dimension. For the sake of vocabulary, call such a list of members a behavior dimension. The support of behavior dimensions must be systemwide, as the following requirements imply:
A behavior dimension can be captured from a report showing on the users screen from: a list of keys or attributes appearing in a file extracted from a production source; directly from a constraint specification; or from a union, intersection, or set difference of other behavior dimensions.
A user may have a library of many behavior dimensions and can attach a behavior dimension to a fact table (or cube) at query time.
The use of a behavior dimension in a query restricts the fact table (or cube) to the members in the study but in no way otherwise limits the ability to select and constrain attributes of any regular dimension, including the one the behavior dimension affects directly.
A behavior dimension may be of unlimited size.
A behavior dimension may have an optional date-stamp associated with each element of the list in such a way that two behavior dimensions can merge so that membership in the combined behavior dimension requires a specific time ordering.
Are You Dimensional?
These eight expression criteria complete the list describing a full-fledged dimensional data warehouse. A system that supported most or all these dimensional criteria would be adaptable, easier to administer, and able to address many real-world applications challenges. The whole point of dimensional systems is that they are business-issue and end-user driven.
I urge you to apply these criteria against your data warehouse to see how it does. Although I am loathe to tinker with the criteria definitions because I dont want to create a moving target, I would be interested in comments that help me make the criteria more understandable and more actionable.
Ralph Kimball co-invented the Star Workstation at Xerox and founded Red Brick Systems. He has three best-selling data warehousing books in print, including The Data Webhouse Toolkit (Wiley, 2000). Ralph teaches dimensional data warehouse design through Kimball University and critically reviews large data warehouse projects. You can reach Ralph through www.ralphkimball.com.
|
|
|
|
|











