Realtime PartitionsBuild a special extension of each fact table to complement your static data warehouseContinued from Page 1 Attach the realtime partition to your existing applications by drilling across from the static fact table to the realtime partition. Time series aggregations (for example, all sales for the current month) will need to send identical queries to the two fact tables and add them together. In a relatively large retail environment experiencing 10 million transactions per day, the static fact table would be pretty big. Assuming that each transaction grain record is 40 bytes wide (seven dimensions plus three facts, all packed into four byte fields), you accumulate 400MB of data each day. Over a year, this would amount to about 150GB of raw data. Such a fact table would be heavily indexed and supported by aggregates. But the daily slice of 400MB (the realtime partition) could be pinned in memory. Forget indexes, except maybe a B-tree index on the primary key to support rapid data loading! Forget aggregations! Your realtime partition can remain biased toward very fast loading performance but at the same time provide speedy query performance. Because you send identical queries to the static fact table as well as the realtime partition, you can relax and let the aggregate navigator sort out whether either of the tables has supporting aggregates. In the case just described, only the big static table needs them. PERIODIC SNAPSHOT REALTIME PARTITIONIf the static data warehouse fact table has a periodic grain (say, monthly), then the realtime partition can be viewed as the current hot rolling month. Suppose you are a big retail bank with 15 million accounts. The static fact table has the grain of account by month. A 36-month time series would result in 540 million fact table records. Again, this table would be extensively indexed and supported by aggregates in order to provide good performance. The realtime partition, on the other hand, is just an image of the current developing month, updated continuously as the month progresses. Semi-additive balances and fully additive facts are adjusted as frequently as they are reported. In a retail bank, the "core" fact table spanning all account types is likely to be quite narrow, with perhaps four dimensions and four facts, resulting in a realtime partition of 480MB. The realtime partition again can be pinned in memory. Query applications drilling across from the static fact table to the realtime partition have slightly different logic compared to the transaction grain. Although account balances and other measures of intensity can be trended directly across the tables, additive totals accumulated during the current rolling period may need to be scaled upward to the equivalent of a full month to keep the results from looking anomalous. Finally, on the last day of the month, the accumulating realtime partition can, with luck, just be loaded onto the static data warehouse as the most current month, and the process can start again with an empty realtime partition. ACCUMULATING SNAPSHOT REALTIME PARTITIONAccumulating snapshots are used for short-lived processes like orders and shipments. A record is created for each line item on the order or shipment. In the main fact table, this record is updated repeatedly as activity occurs. You create the record for a line item when the order is first placed, then update it whenever the item is shipped, delivered to the final destination, paid for, and maybe returned. Accumulating snapshot fact tables have a characteristic set of date foreign keys corresponding to each of these steps. In this case it is misleading to call the main data warehouse fact table static because this is the one fact table type that is deliberately updated, often repeatedly. But let's assume that for query performance reasons, this update occurs only at midnight when the users are offline. In this case, the realtime partition will consist of only those line items that have been updated today. At the end of the day, the records in the realtime partition will be precisely the new versions of the records that need to be written onto the main fact table either by inserting the records if they are completely new or overwriting existing records with the same primary keys. In many order and shipment situations, the number of line items in the realtime partition will be significantly smaller than the first two examples. For example, the biggest dog and cat food manufacturer in the United States processes about 60,000 shipment invoices per month. Each invoice may have 20 line items. If an invoice line has a normal lifetime of two months and is updated five times in this interval, then we would see about 7,500 line items updated on an average working day. Even with the rather wide 80-byte records typical of shipment invoice fact tables, we only have 600KB of data in our realtime partition. This will obviously fit in memory. Forget indexes and aggregations on this realtime partition. Queries against an accumulating snapshot with a realtime partition need to fetch the appropriate line items from both the main fact table and the partition and can either drill across the two tables by performing a sort merge (outer join) on the identical row headers or perform a union of the rows from the two tables, presenting the static view augmented with occasional supplemental rows in the report representing today's hot activity. In this column I have made what I hope is a strong case for satisfying the new realtime requirement with specially constructed, but nevertheless familiar, extensions of existing fact tables. If you drop nearly all the indexes and aggregations on these special new tables and pin them in memory, you should be able to get the combined update and query performance that you need. Ralph Kimball coinvented the Star Workstation at Xerox and founded Red Brick Systems. He has three best-selling data warehousing books in print, including The Data Webhouse Toolkit (Wiley, 2000). He teaches dimensional data warehouse design through Kimball University and critically reviews large data warehouse projects. You can reach him through his Web site, www.rkimball.com.
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
| ||||||||||||||||||||||||||||||||









