Aggregate ImprovementsAre you taking full advantage of the ability to presummarize your data?By Lawrence Corr Continued from Page 1 Full AdvantageOnce we start to see aggregates in their true light, we can start planning to take full advantage of them. Although aggregates can profoundly improve query performance, their advantage must be balanced against the cost of storing, building, maintaining, and administering them. The legendary retail data warehouse that collapsed under the weight of more than 2,500 aggregate tables and that took more than 24 hours to refresh is a particularly dramatic example of imbalance. Guidelines can help decide the right number of aggregates for each situation. Set an overall aggregate storage limit. Consider setting aside a reasonable amount of storage for aggregates, increasing the overall storage by a factor of two. This limit initially might seem high, but we are willing to consider this kind of overhead for traditional indexes, so why not for summary indexes? Significantly undershooting this figure leads to more money spent on additional processing power and memory to speed up queries. Significantly overshooting it leads to diminishing returns from your storage investment because you lack the time to maintain so many additional tables. We must let the DBA provide a dynamic portfolio of aggregates that can be adjusted to match changing demands as the data warehouse matures. Define aggregates that are small enough. Each aggregate should be 10 to 20 times smaller than the fact table or existing aggregate on which it is based, while still containing enough information to answer common queries. An aggregate's size determines its performance improvement; in the final analysis, it is the reduced I/O activity that provides the performance gain. If the aggregate table is only two or three times smaller than its base table, the limited improvement will not justify the cost of maintaining it. Unfortunately, it is very easy to define aggregates that are much larger than you anticipate. How many times smaller would a monthly product sales aggregate be than a daily product sales fact table? If your answer is 30 times, you are forgiven but you are likely to be wrong. The reason is "sparsity failure." Is every product sold every day? Probably not. Is every product sold at least once a month? Much more likely (or we stop offering that product), in which case the monthly aggregate may not even be 10 times smaller. Spread aggregates. Our goal should be to accelerate a broad range of queries, including a few that will be 1,000 times faster. To meet this goal we need to spread our aggregates. We don't just want 10 aggregates that offer a 10-fold improvement; we need a few relatively large aggregate tables that offer 10X improvement and a larger number of small tables that are 100 or 1,000 times faster. It is useful to plot the performance improvement factors of your aggregates against the amount of space used. To visualize aggregate spread, I use an Excel pivot table, so I can see the aggregate space used for the whole warehouse or drill down on specific fact tables. (See Figure 1.) This example shows a poor spread - with less than half the allocated space for aggregates in use - and of that, far too much taken by aggregates that are less than 10 times faster (smaller) than the base fact tables. I need to work on increasing the two green slices. Think of aggregates as indexes. Planning a good spread helps enormously, but in many instances aggregates are still not as easy to define or maintain as traditional indexes, even with the latest DBMS features. The good news is that if you have adopted dimensional models for your atomic detail data, it becomes easier to recognize the opportunities to use aggregates and to then describe them. Guest columnist Lawrence Corr [lcorr@decisionone.co.uk] designs large data warehouses through DecisionOne Consulting, based in the United Kingdom.
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
| ||||||||||||||||||||||||||||||||









