Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home


Principles of data processing seem as incontrovertible as gravity

What Didn’t Happen

Ralph Kimball


Our data warehouses are filled with data telling us what happened in our businesses. Frequently the lowest level of each of our data marts is the most atomic data we can gather from our production systems. Each button click, transaction, and product sale is a record. When an event takes place, we make a record.

We are very good at formulating queries to find out what did happen. These are the so-called directed queries that constitute nearly all our data warehouse use. But how do we ask our data warehouses what didn’t happen?

How do we ask our data warehouse what didn’t sell today or which combinations of products didn’t sell together in the market basket? How do we ask for mixtures of events that did happen together with events that didn’t happen, such as what was on promotion that didn’t sell or which products had options 1, 5, and 9 but not 2, 7, or 29?

Modeling the nonexistence of events is interesting because, in some cases, we can explicitly make records for the nonevents. But in many cases, it’s ridiculous to make records for things that didn’t happen. If we are tracking student classroom attendance, it’s reasonable to make records both for students who do attend as well as for students who don’t. Presumably, the no-shows are a small fraction of the total list of students registered for the class. But if we are tracking frequent flyer flights on an airline, we certainly don’t build records for each of the flights not taken by each frequent flyer.

Let’s examine the main techniques for modeling what didn’t happen.

Coverage Tables

A classic sales fact table in a retail environment might have foreign keys representing the calendar day, product, store, and promotion. We assume that the grain of the sales fact table is daily sales totals for each product in each store. We place a record in this fact table at the end of each day only if a particular product has been sold. We almost never put records in the fact table for zero sales of a product on a day because, in most retail settings, only a small fraction of the total product portfolio sells in each store each day.

When we ask, “What products didn’t sell?” we have to decide whether we are asking, “What products were in the store that didn’t sell?” or, “What products should have been in the store but are out of stock?” To answer either of these questions, we need an accurate inventory measure for each product in each store on that day.

We may not have a physical inventory table for each product in each store on each day. Perhaps we must calculate daily inventory levels by starting with a known physical inventory at the start of the month and then working forward by taking into account sales, shrinkage, and deliveries. In any case, we end up with either a real or virtual inventory fact table with the dimensions of Calendar Day, Product, and Store.

To answer the question “What products were in the store but didn’t sell?” we must perform a set difference between this inventory table and the sales table. This takes the form:


{select all products with nonzero inventory today}

MINUS

{select any products that sold today}.


The MINUS operation in relational systems is simply the set difference operation. In this application, the inventory table plays the role of a coverage table because it’s effectively a foundation for all events that could take place on that inventory. In this case, we have subtracted the sales events to answer the question of what products didn’t sell.

Another question we might ask is, “What products were on promotion today but didn’t sell?” Again, we need a coverage table to work against. Let’s say the store manager decides on the promotions to run each week. On Saturday night, the store manager updates the “promotion coverage” table with each product scheduled for promotion the following week. The dimensions of this coverage table are Week, Product, Store, and Promotion. Again, we can answer our question by constructing a set difference query:


{select all products on promotion today}

MINUS

{select any products that sold today}.


Explicit Records for Nonbehavior

The coverage tables we built in the previous examples were separate from the main fact tables that recorded actual events. But in some cases, it is appropriate to record nonevents in the same table as events. This is a matter of designer discretion, but usually this approach is appropriate if the final fact table does not grow alarmingly and if the nonevents have the same dimensionality as the events. A good example is student attendance.

If we are tracking student attendance in each course each day, then the dimensions of each event record in the attendance fact table might be Calendar Day, Student, Professor, Course, and Location. It might be hard to think of many additive numeric facts that you would measure at this grain. However, if we make a fact called “Attendance,” the value of which is either one or zero, then we can easily ask which students didn’t attend the class. This approach is appropriate in this case because the extra records showing what didn’t happen are (you hope) a small fraction of the total.

Now, our query asking which students didn’t attend the class is a simple single query:

{select student_name where Attendance = 0}.

Searching for Nonexistent Facts with NOT EXISTS

Both of the previous approaches required a degree of preplanning. Either we built a coverage table describing inventory levels or promotion coverage, or we explicitly put zeros into an attendance fact. But usually, we can’t anticipate the What Didn’t Happen query. Our analysts may come up with all sorts of What Didn’t Happen queries. In some sense, there are a lot more of these questions than What Did Happen questions.

The NOT EXISTS construct in SQL is a powerful, general-purpose mechanism for identifying records that don’t exist in a database. However, it’s still not a panacea. Even though we can ask what doesn’t exist, paradoxically we must ask very specifically what doesn’t exist by framing the NOT EXISTS within a larger query. We have no good way to ask what things unknown to us don’t exist. To see this general statement illustrated, study the following example.

To generate a single query using NOT EXISTS to determine what products were on promotion in the San Antonio main outlet on January 15, 1999, that didn’t sell, you would have to issue SQL such as:


SELECT P1.PRODUCT_DESCRIPTION

FROM SALES_FACT F1, PRODUCT P1, STORE S1, CALENDAR_DATE D1, PROMOTION R1

WHERE F1.PROD_KEY = P1.PROD_KEY

AND F1.STORE_KEY = S1.STORE_KEY

AND F1.DATE_KEY = D1.DATE_KEY

AND F1.PROMO_KEY = R1.PROMO_KEY

AND S1.STORE_LOCATION = ‘San Antonio Main Outlet’

AND D1.MONTH = ‘January, 1999’

AND NOT EXISTS

(SELECT R2.PROMO_KEY

FROM SALES_FACT F2, PROMOTION R2, CALENDAR_DATE D2

WHERE F2.PROMO_KEY = P2.PROMO_KEY

AND F2.PROD_KEY = F1.PROD_KEY

AND F2.STORE_KEY = F1.STORE_KEY

AND F2.DATE_KEY = F1.DATE_KEY

AND F2.DATE_KEY = D2.DATE_KEY

AND R2.PROMOTION_TYPE = ‘Active Promotion’

AND D2.FULL_DATE = ‘January 15, 1999’)


The first six lines in the WHERE clause define the products that were on sale in the San Antonio main outlet in January 1999. The next nine lines specifying the query inside the NOT EXISTS construct asks for each of those products sold on promotion on January 15, 1999. The ones that don’t exist (that is, that were sold sometime in January but not sold on promotion on January 15) are returned by the overall query. This is a SQL “correlated subquery.” Although this approach avoids the necessity for an inventory or promotions coverage table, it may still miss some products that didn’t sell at all in January, and the query probably runs slowly because of its complexity.

Using NOT EXISTS to Find Nonattributes

A final challenging variation of the What Didn’t Happen problem is finding a set of products that have some options but don’t have other options, where each product’s option list is open ended. Suppose we sell cars, each of which has many options drawn from a list of 100 possibilities. How do we ask for all the cars with options 1, 11, and 21 but not options 2, 12, 22, or 32?

We start by building a fact table with one record for each option on each car. The dimensions of this fact table are Car and Option. For simplicity, we will assume a CAR_KEY identifies a unique car, and an OPTION_KEY is the option number. Using NOT EXISTS, we can ask the question posed in the previous paragraph:


SELECT F1.CAR_KEY

FROM FACT F1

WHERE

(SELECT COUNT(F2.CAR_KEY)

FROM FACT F2

WHERE F2.CAR_KEY = F1.CAR_KEY

AND F2.OPTION_KEY IN (1, 11, 21))

= 3

AND NOT EXISTS

(SELECT *

FROM FACT F3

WHERE F3.CAR_KEY = F1.CAR_KEY

AND F3.OPTION_KEY IN (2, 12, 22, 32))

In this case, we can use the MINUS operation to write the equivalent logic:

{select cars with all the options 1, 11, and 21 as in the first clause of the preceding query}

MINUS

{select cars with any of the options 2, 12, 22, or 32}.


Notice that asking for cars with all the options 1, 11, and 21 is a little tricky. You cannot just ask for the cars with options 1, 11, or 21 because that won’t get the cars with all the options simultaneously.

In this article, we have explored four different techniques for answering What Didn’t Happen. These examples span all of the challenging situations I have encountered. But What Didn’t Happen takes on some tricky forms. I would be interested in hearing about other situations or approaches you may have encountered beyond these described here.

Ralph Kimball, Ph.D., works as an independent consultant designing large data warehouses. He is the author of The Data Warehouse Lifecycle Toolkit (Wiley, 1998). You can reach him through his Web page at www.rkimball.com.





IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo Jitter
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet Evolution
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space