Principles of data processing seem as incontrovertible as gravityWhat Didnt HappenRalph Kimball
We are very good at formulating queries to find out what did happen. These are the so-called directed queries that constitute nearly all our data warehouse use. But how do we ask our data warehouses what didnt happen? How do we ask our data warehouse what didnt sell today or which combinations of products didnt sell together in the market basket? How do we ask for mixtures of events that did happen together with events that didnt happen, such as what was on promotion that didnt sell or which products had options 1, 5, and 9 but not 2, 7, or 29? Modeling the nonexistence of events is interesting because, in some cases, we can explicitly make records for the nonevents. But in many cases, its ridiculous to make records for things that didnt happen. If we are tracking student classroom attendance, its reasonable to make records both for students who do attend as well as for students who dont. Presumably, the no-shows are a small fraction of the total list of students registered for the class. But if we are tracking frequent flyer flights on an airline, we certainly dont build records for each of the flights not taken by each frequent flyer. Lets examine the main techniques for modeling what didnt happen.
Coverage TablesA classic sales fact table in a retail environment might have foreign keys representing the calendar day, product, store, and promotion. We assume that the grain of the sales fact table is daily sales totals for each product in each store. We place a record in this fact table at the end of each day only if a particular product has been sold. We almost never put records in the fact table for zero sales of a product on a day because, in most retail settings, only a small fraction of the total product portfolio sells in each store each day. When we ask, What products didnt sell? we have to decide whether we are asking, What products were in the store that didnt sell? or, What products should have been in the store but are out of stock? To answer either of these questions, we need an accurate inventory measure for each product in each store on that day. We may not have a physical inventory table for each product in each store on each day. Perhaps we must calculate daily inventory levels by starting with a known physical inventory at the start of the month and then working forward by taking into account sales, shrinkage, and deliveries. In any case, we end up with either a real or virtual inventory fact table with the dimensions of Calendar Day, Product, and Store. To answer the question What products were in the store but didnt sell? we must perform a set difference between this inventory table and the sales table. This takes the form:
{select all products with nonzero inventory today} MINUS {select any products that sold today}.
The MINUS operation in relational systems is simply the set difference operation. In this application, the inventory table plays the role of a coverage table because its effectively a foundation for all events that could take place on that inventory. In this case, we have subtracted the sales events to answer the question of what products didnt sell. Another question we might ask is, What products were on promotion today but didnt sell? Again, we need a coverage table to work against. Lets say the store manager decides on the promotions to run each week. On Saturday night, the store manager updates the promotion coverage table with each product scheduled for promotion the following week. The dimensions of this coverage table are Week, Product, Store, and Promotion. Again, we can answer our question by constructing a set difference query:
{select all products on promotion today} MINUS {select any products that sold today}.
Explicit Records for NonbehaviorThe coverage tables we built in the previous examples were separate from the main fact tables that recorded actual events. But in some cases, it is appropriate to record nonevents in the same table as events. This is a matter of designer discretion, but usually this approach is appropriate if the final fact table does not grow alarmingly and if the nonevents have the same dimensionality as the events. A good example is student attendance. If we are tracking student attendance in each course each day, then the dimensions of each event record in the attendance fact table might be Calendar Day, Student, Professor, Course, and Location. It might be hard to think of many additive numeric facts that you would measure at this grain. However, if we make a fact called Attendance, the value of which is either one or zero, then we can easily ask which students didnt attend the class. This approach is appropriate in this case because the extra records showing what didnt happen are (you hope) a small fraction of the total. Now, our query asking which students didnt attend the class is a simple single query: {select student_name where Attendance = 0}.
Searching for Nonexistent Facts with NOT EXISTSBoth of the previous approaches required a degree of preplanning. Either we built a coverage table describing inventory levels or promotion coverage, or we explicitly put zeros into an attendance fact. But usually, we cant anticipate the What Didnt Happen query. Our analysts may come up with all sorts of What Didnt Happen queries. In some sense, there are a lot more of these questions than What Did Happen questions. The NOT EXISTS construct in SQL is a powerful, general-purpose mechanism for identifying records that dont exist in a database. However, its still not a panacea. Even though we can ask what doesnt exist, paradoxically we must ask very specifically what doesnt exist by framing the NOT EXISTS within a larger query. We have no good way to ask what things unknown to us dont exist. To see this general statement illustrated, study the following example. To generate a single query using NOT EXISTS to determine what products were on promotion in the San Antonio main outlet on January 15, 1999, that didnt sell, you would have to issue SQL such as:
SELECT P1.PRODUCT_DESCRIPTION FROM SALES_FACT F1, PRODUCT P1, STORE S1, CALENDAR_DATE D1, PROMOTION R1 WHERE F1.PROD_KEY = P1.PROD_KEY AND F1.STORE_KEY = S1.STORE_KEY AND F1.DATE_KEY = D1.DATE_KEY AND F1.PROMO_KEY = R1.PROMO_KEY AND S1.STORE_LOCATION = San Antonio Main Outlet AND D1.MONTH = January, 1999
AND NOT EXISTS (SELECT R2.PROMO_KEY FROM SALES_FACT F2, PROMOTION R2, CALENDAR_DATE D2 WHERE F2.PROMO_KEY = P2.PROMO_KEY AND F2.PROD_KEY = F1.PROD_KEY AND F2.STORE_KEY = F1.STORE_KEY AND F2.DATE_KEY = F1.DATE_KEY AND F2.DATE_KEY = D2.DATE_KEY AND R2.PROMOTION_TYPE = Active Promotion AND D2.FULL_DATE = January 15, 1999)
The first six lines in the WHERE clause define the products that were on sale in the San Antonio main outlet in January 1999. The next nine lines specifying the query inside the NOT EXISTS construct asks for each of those products sold on promotion on January 15, 1999. The ones that dont exist (that is, that were sold sometime in January but not sold on promotion on January 15) are returned by the overall query. This is a SQL correlated subquery. Although this approach avoids the necessity for an inventory or promotions coverage table, it may still miss some products that didnt sell at all in January, and the query probably runs slowly because of its complexity.
Using NOT EXISTS to Find NonattributesA final challenging variation of the What Didnt Happen problem is finding a set of products that have some options but dont have other options, where each products option list is open ended. Suppose we sell cars, each of which has many options drawn from a list of 100 possibilities. How do we ask for all the cars with options 1, 11, and 21 but not options 2, 12, 22, or 32? We start by building a fact table with one record for each option on each car. The dimensions of this fact table are Car and Option. For simplicity, we will assume a CAR_KEY identifies a unique car, and an OPTION_KEY is the option number. Using NOT EXISTS, we can ask the question posed in the previous paragraph:
SELECT F1.CAR_KEY FROM FACT F1 WHERE (SELECT COUNT(F2.CAR_KEY) FROM FACT F2 WHERE F2.CAR_KEY = F1.CAR_KEY AND F2.OPTION_KEY IN (1, 11, 21)) = 3 AND NOT EXISTS (SELECT * FROM FACT F3 WHERE F3.CAR_KEY = F1.CAR_KEY AND F3.OPTION_KEY IN (2, 12, 22, 32))
In this case, we can use the MINUS operation to write the equivalent logic: {select cars with all the options 1, 11, and 21 as in the first clause of the preceding query} MINUS {select cars with any of the options 2, 12, 22, or 32}.
Notice that asking for cars with all the options 1, 11, and 21 is a little tricky. You cannot just ask for the cars with options 1, 11, or 21 because that wont get the cars with all the options simultaneously. In this article, we have explored four different techniques for answering What Didnt Happen. These examples span all of the challenging situations I have encountered. But What Didnt Happen takes on some tricky forms. I would be interested in hearing about other situations or approaches you may have encountered beyond these described here. Ralph Kimball, Ph.D., works as an independent consultant designing large data warehouses. He is the author of The Data Warehouse Lifecycle Toolkit (Wiley, 1998). You can reach him through his Web page at www.rkimball.com. |





















