Asymmetric AggregationWhat to do if your data mart isn't a shiny dimensional diamondBy Gabriel Tanase Continued from Page 1 If the external feed E is manually produced and subject to repeated amendments until declared ready, it is likely that those manual changes must be logged for approval and audit purposes. Figure 1 presents a structure enabling such a process. For each aggregated table (AGG_Final) where the final measure values are stored, you need three other tables to store the same measure set at different steps in the amendment process:
This structure's design enables the application of constant amendments to aggregated measures over more than one time period. Supposing that the target aggregated table is at a week time grain; this structure would enable the entry of one set of amendment values applicable for several weeks. This could be necessary when estimates in lieu of missing data must to be kept constant over several weeks, until real data comes in. In this case users would not want to reenter the same values each week. The system loads data mart aggregates from available transactions into the AGG_ Sys table. This table has key columns (symbolized as Key_S in the figure) per its specific combination of dimensions and their grains. The summary-level data that is available only via manual entry that is, the amendments (at the same grain as in AGG_Sys) is initially stored in the AGG_Manual table via a custom user interface. The set of key columns for the AGG_Manual table must include, in addition to the keys in Key_S (except the time key) the following:
Once an amendment set is flagged as "Ready/approved," a process will clone it into the AGG_Applicable table in as many copies as time periods to which the set is to be applied, differentiated only by the time key. The AGG_Applicable table has exactly the same columns in Key_A as AGG_Sys and AGG_Final do in Key_S and Key_F, respectively. At the end of each relevant time period, a subprocess of the extraction, transformation, and loading batch would simply add up the manually entered measures held in AGG_Applicable to the system-aggregated measures held in AGG_Sys, storing the results into AGG_Final. If there were no data in AGG_Applicable for a certain time period, the values in AGG_Final would be identical with the system-aggregated values from AGG_Sys. Question: If some aggregations are nonlinear complex calculations, how should their parameters be stored? Should they be fully and exclusively included in the dimensional database design or kept completely separate in a fully normalized area of the data mart? Answer: No surprises here, the answer again being, "It depends." First, you generalize Figure 1 by allowing the results of the nonlinear aggregation process to enter your data mart design at the same place as the manual aggregations discussed in the previous section. But there must be additional underlying tables (not shown in the figure) designed to store parameters that drive the nonlinear calculations. Again, you probably should populate these tables via a custom user interface. These driver tables will have to support versioning, including timestamps and an audit of changes. If there are relationships that have to be enforced among calculation parameter entities, then a third-normal-form, history-aware model may be the best choice. However, it is likely that some of these parameters that drive the nonlinear aggregations, especially non-numeric ones (such as flags and codes), will also happen to be attributes of some dimensions. Even if this were not the case, it would still be wise to collect them all into a "Calculation Parameters" dimension so that users can analyze the results while having access to underlying parameters in the familiar dimensional framework. I have tried in these two articles to describe some typical issues that arise in complex data mart environments. The asymmetric aggregate architecture illustrated in Figure 1 is flexible. You should be able to support all three usage styles listed at the beginning of this column, and at the same time allow linear aggregations and nonlinear aggregations to coexist in a way that the end user can understand. Gabriel Tanase [gabriel@gabrieltanase.com] is a system designer based in Ireland. He has worked on several business intelligence projects for a leading European insurance provider.
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
| |||||||||||||||||||||||||||||||





















