Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Digital Library
Subscribe
Home


April 20, 1999, Volume 2 - Number 6

Peeking into your customer's market basket can ease decision making

The Market Basket Data Mart


Ralph Kimball


Data warehouse architects have long been recording retail store purchases in large atomic fact tables. With the increasing use of customer loyalty cards and in-house credit cards, these fact tables would seem to contain everything our marketing and merchandising managers need to know about customer behavior.

It is certainly true that our atomic fact table tells us in exquisite detail what the stores sold, which stores sold it, when they sold it, and who the customer was. The large atomic fact table is great for promotion analysis. But one of the most interesting analyses of customer behavior — market-basket analysis — is served very poorly by this table.

The most basic form of market-basket analysis seeks to understand what meaningful product combinations were sold together in individual market baskets. When the store manager or headquarters merchandising manager understands the meaningful market baskets, that manager can make some interesting decisions, including:

• Placing products near each other in the store for shopper convenience and to increase the likelihood that customers will purchase multiple products

• Conversely, placing products far apart from each other to force the customer to traverse certain aisles or go past certain displays

• Pricing and packaging combinations of products, perhaps to piggyback a new brand with an established brand.

The primary goal of market-basket analysis is to understand the “meaningful combinations” of products sold together. On closer examination, however, this goal often becomes much more complicated. More sophisticated forms of the market basket goal include:

• Understanding combinations of brands, subcategories, and categories, rather than focusing just on lowest-level SKUs (stock-keeping units at the bar-coded level).

• Looking for mixed aggregate results that are more meaningful than results at the same hierarchical level. For instance, maybe an interesting result is that Coca-Cola’s 12-ounce cans sell well with frozen pasta dinners. In this case, we have matched an SKU with a subcategory.

• Looking for groups of three — or even more — products sold together.

• Understanding which items do not sell well together. This is the so-called missing market basket.

• Looking for products (or brands or categories) that characteristically sell together, but not always in the same specific market basket. In this case, we continue to focus on the individual customer, but we relax the definition of the market basket to include purchase events within a span of time. If the customer buys peanut butter, we would expect to see a jelly purchase at some point in time close by. At this point, we are asking for full-fledged affinity grouping, as described by data mining specialists.

We cannot easily use the basic atomic fact table describing purchases to perform any of these analyses. Part of the problem is the near impossibility of constraining and grouping across records in SQL. Each item in a given basket is in a separate record, and SQL was never designed to express cross-record constraints. Another problem is the combinatorial explosion of products. If a big retail store has 100,000 low-level products, then just enumerating all the two-way combinations involves 10 billion possibilities.

The solution for the market-basket analysis problem is to create a new fact table (see the right side of Figure 1) and to populate this fact table in a very specific, disciplined way. We will see that much of the hard work of market-basket analysis takes place in the data staging area during the extract, transform, and load (ETL) processes, thus simplifying the query and presentation phases of the analysis.

FIGURE 1 Market-basket data mart fact tables.

The market-basket fact table in Figure 1 is a periodic snapshot representing the important pairs of products sold in various market baskets. (I described periodic-snapshot fact tables in my March 30 article, “Fundamental Grains.”) The time key represents the end-of-period marker. The facts (basket counts and various dollars and units sold) represent the total activity during the period.

There are two generalized product keys in this design that must be able to represent individual products (SKUs), brands, subcategories, categories, and departments — all in the same fact table. These choices would come from the traditional merchandise hierarchy in the retail business. Other generalized product key choices could represent flavors or package types, although in my experience, most market-basket analysis takes place only across the levels of the traditional merchandise hierarchy.

The facts include the total number of market baskets with the named combinations of products: the total number of dollars and units sold of product one in this set of baskets, and the total number of dollars and units sold of product two in this set of baskets.

Now if this table in Figure 1 were magically filled in for all possible combinations of generalized products, we could navigate this table to answer most of the questions we posed earlier. We would probably first look for records with high basket counts. After all, these are the frequently observed baskets and should be the most meaningful. Second, we would look for baskets where the dollars or units were in reasonable balance; finding baskets where the dollars or units were far out of balance would be rather dull because all we’ve done is find high-selling products coupled with insignificant second products. We probably wouldn’t base major merchandising or promotion decisions around such findings.

Realistically, we can’t expect the market-basket table in Figure 1 to be magically filled in. We still haven’t dealt with the combinatorial explosion. Somehow, we have to fill in the market-basket fact table, generating all the records we hoped to see in the previous paragraph, but nimbly avoiding the generation of billions of bogus records we never want to analyze.

The Progressive Pruning Algorithm

The solution is a top-down progressive pruning algorithm, implemented during the ETL phase in the back room. We start from the atomic sales fact table shown on the left of Figure 1 and run a series of queries that extract data into the market-basket fact table shown on the right. The logic goes like this:

1. Start at the top of the merchandise hierarchy by enumerating all the combinations of department-to-department market baskets. If there are 20 departments, this first step yields 400 records.

2. Prune this list severely by choosing only records with a threshold number of baskets and where the dollars or units are within a specified balance. Experimentation will tell you what the threshold should be and what the criteria for the specified balance should be. (I recommend starting with 20 percent of the total basket count and with products that are balanced within 30 percent of each other.) Stop the algorithm if the pruning step removes all the records or if you run out of storage space or patience.

3. Drill down on product one by descending the merchandise hierarchy one step and enumerating all the new combinations, say category (product one) by department (product two). Repeat similarly by drilling one step down on product two, as long its level is equal to or higher than the level of product one. Go to step 2.

This algorithm is guaranteed to stop because each step that descends the hierarchy necessarily produces records with smaller basket counts. Eventually, the system finds no records with basket counts greater than the minimum threshold for relevance. Also, it is permissible to stop at any time. At each point, the records found thus far are the records that could have the most impact on the analysis. Notice that this top-down approach is superior to a bottom-up approach in several ways. A bottom-up approach starting with individual atomic products could easily miss important patterns. If a store frequently sells Coca-Cola 12-ounce cans with the pasta dinner subcategory, we probably would miss this result if we paired Coca-Cola directly with individual pasta dinner SKUs. And as I stated earlier, the top-down progressive pruning algorithm starts finding relevant results immediately without generating an astronomical number of records. Running the pruning algorithm progressively provides more focus to already relevant results.

Perhaps the most subtle design step in setting up this market-basket data mart is choosing the generalized product keys. If we have set up the original sales fact table in Figure 1 with standard dimensional modeling techniques, the product keys are integer (surrogate) keys with no structure and no recognizable semantics. Aggregate fact tables built for query performance reasons accompanying this sales fact table would have one or more shrunken dimensions representing the degrees of aggregation. Typically, a shrunken product dimension would exist at the category level — and possibly other levels as well. The interesting twist to the design of the market-basket data mart is that for the first time in our data warehouse, we have to build a single product dimension that contains entries at multiple levels of the hierarchy. This procedure requires creating a special variant of the product dimension table with a smaller number of rather generic attributes (because, for example, none of the most detailed product attributes would exist except at the lowest SKU level) and assigning the keys from the various levels so that they do not overlap.

We can generalize the progressive pruning algorithm to handle three items in the basket in a straightforward way. The market-basket fact table now needs three generalized product keys, three-dollar sums, and three-unit sums. If the first step of the algorithm is computed with the same rules, it creates 8,000 department combinations. We handle the pruning steps in a way analogous to the recipe given earlier.

We can also generalize the progressive pruning algorithm to handle market baskets that extend over several purchase events: You only need to process all the baskets for a given customer within a specified time span as if these baskets actually were one big basket.

It is difficult to use the techniques described in this column to answer what was not in the market basket. Much of the problem is the sheer logical weirdness of the question. After all, every market basket is missing most of the items in the store. Also, if an item is missing, it doesn’t contribute to the dollar impact or the units’ impact of the basket, making it difficult to use our progressive pruning logic. However, there are a few variations on the question of what didn’t happen that are probably worth answering with special queries on the sales fact table. These include: What items are sold most often in single-item market baskets? In two-item market baskets? Which combinations of items occur less often in a statistically significant way than you would expect from their overall presence in all possible market baskets? For other thoughts on this tricky “what didn’t happen” question, see my column by that name in the February 16 issue.

Although the preparation and extract processing for the market-basket data mart are significant steps, the resulting market-basket fact table and the associated special product dimension lets us perform nearly all the interesting and useful types of market-basket analysis in a bounded and controllable way.



Ralph Kimball, Ph.D., co-invented the Star Workstation at Xerox and founder of Red Brick Systems, works as an independent consultant designing large data warehouses. He is the author of The Data Warehouse Toolkit (Wiley, 1996) and the newly published The Data Warehouse Lifecycle Toolkit (Wiley, 1998). You can reach him through his Web page at www.ralphkimball.com.





IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo JitterPlug Into The Cloud
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet EvolutionPyramid Research
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space