CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise





October 4, 2001



Aggregate Improvements

Are you taking full advantage of the ability to presummarize your data?

By Lawrence Corr

"The single most dramatic way to affect performance in a large data warehouse is to provide a proper set of aggregate (summary) records ... in some cases speeding queries by a factor of 100 or even 1,000. No other means exist to harvest such spectacular gains."

Those are Ralph Kimball's words from "Aggregate Navigation With (Almost) No Metadata" (DBMS magazine, August 1996). If the results are so spectacular - and I don't hear anyone arguing they aren't - why then is aggregation so underused?

First, I believe that the answer lies in our relational database culture or folklore. We were all taught not to redundantly store what could be calculated. This, by the way, is a restriction that users of multidimensional databases have happily ignored. Second, many of us are not clear on what constitutes a good set of aggregates.

One way to get more comfortable with aggregate tables, and to see how mandatory they are, is to think of them as indexes. We would not dream of implementing any significant OLTP or data warehouse database without indexes. These traditional indexes usually duplicate the information content of indexed columns, yet we don't disparage this duplication as "redundancy," because of the benefits.

Traditional indexes get us very quickly to a small number of qualifying records - enough to fill a screen in an OLTP application or satisfy an operational report. However, in the data warehouse, the vast majority of queries ask for millions of records to be summarized, in many cases causing the database to bypass indexes altogether and perform table scans. To speed up these queries, we need another kind of index that can quickly and logically get us to millions of records. I say "logically" because we need only the summarized results of these millions of fact records, as no one but the data miner wants to see the individual records.

Fresh Perspective

Once we start thinking of aggregates as summary indexes we get two tremendous benefits:

First, we stop talking about data marts as containers of only departmental, summarized data and the data warehouse as (the slow) holder of atomic, detailed transaction data. Instead, we naturally think of the summary indexes (aggregates) belonging in the same database as the low-level atomic data that is indexed. We can then reclaim the term data mart to more usefully describe the logically complete building block that we use to incrementally deliver high-performance enterprise data warehouses one subject area at a time.

Second, we become very clear about how aggregates should be used. We don't expect to tell the database to use a particular index when we construct a query, barring the occasional use of hints in the hand-tuned SQL within production systems. What we do expect is for the database optimizer to quietly choose the best combination of indexes behind the scenes. Such transparent use of indexes allows a DBA to drop unused indexes and rebuild better ones.

Ideally, aggregates should be used and administered in exactly the same way. We should not point our queries directly at aggregates. Instead, we should take advantage of aggregate navigation that automatically rewrites queries to access the best presently available aggregates. For a detailed description of how aggregate navigation might work, take a look at Ralph Kimball's aforementioned article. We can now view aggregation navigation as another form of optimization that should be offered by database query optimizers or intelligent middleware.

If we continue to view data marts as mere departmental summaries, we fall into the trap of hard-coding access to these aggregates in queries written against the specific marts; it then becomes unclear how they might drill through automatically to detail data in a different structure in a more atomic database. This obsolete take on data marts is an example of design that hasn't kept pace with technology as aggregate navigation has become more widely available.






IE Weekly Newsletter
Subscribe to the newsletter
    Email Address