|
December 21, 1999, Volume 2 - Number 18
The Perfect Handoff
|
What characteristics of individual customers predict whether they will be good or bad customers
Whether the way people tour a site reveals their interest in your products
How to dynamically modify the Web site experience, based on the webhouses assessment of the viewers behavioral similarity to thousands of other customers
What pages on the site attract visitors or are session killers.
Although data warehousing and data mining have coexisted in various forms since the mid-1980s, the two communities have not frequently collaborated. All too often, the data miners have end-run the data warehouse and directly sourced the data usually because they want extremely granular data, which they call observations. Any aggregations data warehousers built were poison to data mining.
For some time, the data warehouse, and especially the webhouse, have routinely stored the lowest-level transactions and behavior. After all, it is this low-level data that is the most dimensional.
FIGURE 1 A template for a handoff from a data webhouse to a data mining application.
We webhouse teams must stand up and say we are the source of data for all forms of reporting, analysis, forecasting, scoring, and data mining. We have teased the data out of its hiding places, built data extract and transformation pipelines, know how to conform data from multiple sources, and are a professional platform for storing used data for all kinds of purposes.
The data mining community needs to use our data rather than source it independently. The data miners job is analysis; we are data wranglers.
The Perfect Observation
Consider the meta-SQL shown in Figure 1. Imagine that this data specification was able to deliver millions of records with this content. Each record is an elaborate description of customer behavior. There is one record for each customer.
Most data miners would kill for a set of observations with this content. Believe me, a data miner would much rather analyze this data than prepare it! In my experience, data miners often do very limited data extracts by data warehouse standards, ending up with much less than Figure 1 shows.
Study Figure 1 closely. This data couldnt possibly come from one source; it must be conformed from multiple sources, probably by implementing fuzzy matches on the customers name, address, and other fields, and combining multiple data sources. Even when you conform the separate sources of data, the resulting drill-across application runs too slowly for a data miner, who wants up to thousands of these observations per second!
We are now beginning to see our destiny as webhousers.
The purpose of the webhouse is to gather, store, and present data in the best possible way to the data-mining tool, and not to actually perform the data mining. Data mining is more of an analytic application than a database. Historically, we have duplicated too much effort and lost too many opportunities because the responsibilities of the webhouse and data-mining operation were inadequately defined.
In the overall flow of data from its original source to the final step of data mining, I recommend the following division of responsibilities. (Key steps are highlighted in bold.)
The webhouse data-mining responsibilities are:
Original extraction from all internal legacy sources and third party sources
Data content validation and cleaning
Combining of disparate data sources into fact and dimension tables of uniform granularities
Creation of derived facts and attributes of interest to data mining tools
Assignment of all foreign and primary keys in fact and dimension tables
Creation of complex drill across reports which are ready-to-go observation sets
Storage of the ready-to-go observation sets for high performance access by the data mining tools
Optionally accepting and storing the results of data mining tool runs.
The data-mining tool responsibilities are:
Reading the ready-to-go observation sets, perhaps repetitively, directly into the data-mining tools
Providing on-the-fly data transformation steps where not provided by the webhouse
Performing the data-mining analyses
Handing the results the data-mining tool runs off to the webhouse for storage
The creation of the complex drill-across reports is the most valuable step in the process, because it draws upon the strengths of the webhouse and is what the data miners are least prepared to do.
Many data warehouse developers are inwardly focused on data available from their organizations production systems, perhaps unaware of the rich sources of data available from third-party data providers. With the increased focus on customer behavior and customer demographics, the webhouse team needs to become more familiar with the data sources and the companies providing this data. It would be a mistake to turn this data sourcing over to the data mining group because that leaves all the issues of sourcing data, conforming keys, combining tables, representing time series, and providing data access to the end users whose main interest is analyzing the data. These jobs, and hence demographics data acquisition, belong to the webhouse team. Jesus Mena has an excellent discussion of the third-party demographics data industry in his book, Data Mining Your Website, which is well worth reading even though his discussion of the data warehouse is very abbreviated.
Referring back to Figure 1, you can see that the various behavioral measures come from many different webhouse tables and are expressed in different granularities. A webhouse producing this set of applications might consist of more than a dozen separate queries to different fact tables, all of which are combined under the customer identifier row heading. The webhouse could not provide this set of observations directly from the original data sources with the speed the data-mining tool requires.
The webhouse needs to produce this set of observations once and then store it for high-performance, repeated access by the data-mining tools. A decision tree or a neural net might read the data only once, but a memory-based reasoning tool may want to read it repeatedly.
The highest-performance access may well be through a flat file. It would be reasonable for the webhouse to hand off the ready-to-go observation set as one or more flat files, which can be read repeatedly. The webhouse then steps back and lets the data-mining tool process the observations at high speed.
All data mining is a repetitive cycle of cutting and trying. It would be very typical of the data mining project to want more behavioral data measurements, or desire numerical or categorical transformations of existing measurements. In some cases, the data-mining tool will provide the final transformations efficiently, but it is very likely the webhouses data-delivery environment will augment and extend the observation sets. Most webhouse tool suites can easily take the flat-file outputs described previously and augment them with further columns of data for each customer.
The webhouse team can work with the data miners to reduce the amount of data handed across. After all, not every demographic indicator provides useful insight or has useful predictive value. Some data inputs the webhouse provides may be expensive to compute or buy. It would be helpful to eliminate these variables. By using a neural network tool in auto-associative mode, the data miners can test to see whether the data inputs describing the customer can predict themselves.
This technique can eliminate some data elements because they literally are not consistent with the rest of the customer profile information. Similarly, a neural network tool can eliminate other variables when it is configured in the normal mode of predicting or recognizing desired output variables from the input variables. In this case, the data miner compares the changes in neuron weights from the beginning to the end of the neural network training phase. Input variables with neuron weights that change very little in the training phase clearly have not affected the model very much and the data miner may choose to drop them from consideration.
Implications for Database Architecture
I used to think that database vendors would absorb data mining by providing data mining in the inner loop of the DBMS answer set generator, but have since changed my mind. Figure 1 is not the inner loop of a query, but the final result of a complex drill-across application, probably generated well above the DBMSs inner loop. Maybe the way to say it is that detailed behavior needs to be described within a comprehensive context, not within a narrow query. In any case, I think the architecture is shaking out to be more of a handoff. The webhouse produces the observations at a very granular level and embellishes them with detail, then hands them off to the data miner as a flat file for high-performance, repeated access.
Learning More
Be sure to check out Michael Berry and Gordon Linoffs new book, Mastering Data Mining Techniques (Wiley, 2000). As in their first book, they really focus on the delivered value of data mining rather than the details of the individual tools. I reviewed it recently, and I think it will become the seminal book on data mining. It also meshes very nicely with my view of the handoff described in this column.
Ralph Kimball co-invented the Star Workstation at Xerox and founded Red Brick Systems. He also wrote the best-selling books The Data Warehouse Toolkit (Wiley, 1996) and The Data Warehouse Lifecycle Toolkit (Wiley, 1998). Ralph teaches dimensional data warehouse design through Kimball University and critically reviews large data warehouse projects. You can reach Ralph through his Web site at www.ralphkimball.com.
|
|
|
| ||||||||||||||||||||||||||||||||









