Guide to the TechWeb Network

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home


December 21, 1999, Volume 2 - Number 18

The Perfect Handoff

mined behavioral data strategically beneficial to webhousing

By Ralph Kimball



The new name I use for the data warehouse, data webhouse (or just “webhouse”), reflects the profound effect the Web is having on the content and structure of the data warehouse. For instance, the Web provides many new data sources, mostly related to customer behavior; such information is essential to customer relationship management (CRM), which is gaining popularity. Behavioral data is important to consider also when designing a Web interface, and we’re increasingly required to create Web-based back- and front-room interfaces for our webhouses so that employees, business partners, and customers can access the same interface from anywhere in the world. The behavior patterns that so many new webhouse applications record come in many flavors, including residential and customer shopper, end-user analyst, and Web visitor. All these types denote how and in what sequence people buy products over time, how they tour and use a Web site, and how they react to a customized experience.

Behavior analysis is the province of data mining, which is essentially a complex analytic client that seeks patterns in behavioral data. You can discover such things as:

• What characteristics of individual customers predict whether they will be good or bad customers

• Whether the way people tour a site reveals their interest in your products

• How to dynamically modify the Web site experience, based on the webhouse’s assessment of the viewer’s behavioral similarity to thousands of other customers

• What pages on the site attract visitors or are session killers.

Although data warehousing and data mining have coexisted in various forms since the mid-1980s, the two communities have not frequently collaborated. All too often, the data miners have end-run the data warehouse and directly sourced the data — usually because they want extremely granular data, which they call “observations.” Any aggregations data warehousers built were poison to data mining.

For some time, the data warehouse, and especially the webhouse, have routinely stored the lowest-level transactions and behavior. After all, it is this low-level data that is the most dimensional.



FIGURE 1 A template for a handoff from a data webhouse to a data mining application.


We webhouse teams must stand up and say we are the source of data for all forms of reporting, analysis, forecasting, scoring, and data mining. We have teased the data out of its hiding places, built data extract and transformation pipelines, know how to conform data from multiple sources, and are a professional platform for storing “used data” for all kinds of purposes.

The data mining community needs to use our data rather than source it independently. The data miner’s job is analysis; we are data wranglers.

The Perfect Observation

Consider the meta-SQL shown in Figure 1. Imagine that this data specification was able to deliver millions of records with this content. Each record is an elaborate description of customer behavior. There is one record for each customer.

Most data miners would kill for a set of observations with this content. Believe me, a data miner would much rather analyze this data than prepare it! In my experience, data miners often do very limited data extracts by data warehouse standards, ending up with much less than Figure 1 shows.

Study Figure 1 closely. This data couldn’t possibly come from one source; it must be conformed from multiple sources, probably by implementing fuzzy matches on the customer’s name, address, and other fields, and combining multiple data sources. Even when you conform the separate sources of data, the resulting drill-across application runs too slowly for a data miner, who wants up to thousands of these observations per second!

We are now beginning to see our destiny as webhousers.

The purpose of the webhouse is to gather, store, and present data in the best possible way to the data-mining tool, and not to actually perform the data mining. Data mining is more of an analytic application than a database. Historically, we have duplicated too much effort and lost too many opportunities because the responsibilities of the webhouse and data-mining operation were inadequately defined.

In the overall flow of data from its original source to the final step of data mining, I recommend the following division of responsibilities. (Key steps are highlighted in bold.)

The webhouse data-mining responsibilities are:

• Original extraction from all internal legacy sources and third party sources

• Data content validation and cleaning

• Combining of disparate data sources into fact and dimension tables of uniform granularities

• Creation of derived facts and attributes of interest to data mining tools

• Assignment of all foreign and primary keys in fact and dimension tables

Creation of complex drill across reports which are “ready-to-go observation sets”

Storage of the ready-to-go observation sets for high performance access by the data mining tools

• Optionally accepting and storing the results of data mining tool runs.

The data-mining tool responsibilities are:

• Reading the ready-to-go observation sets, perhaps repetitively, directly into the data-mining tools

• Providing on-the-fly data transformation steps where not provided by the webhouse

• Performing the data-mining analyses

• Handing the results the data-mining tool runs off to the webhouse for storage

The creation of the complex drill-across reports is the most valuable step in the process, because it draws upon the strengths of the webhouse and is what the data miners are least prepared to do.

Many data warehouse developers are inwardly focused on data available from their organizations’ production systems, perhaps unaware of the rich sources of data available from third-party data providers. With the increased focus on customer behavior and customer demographics, the webhouse team needs to become more familiar with the data sources and the companies providing this data. It would be a mistake to turn this data sourcing over to the data mining group because that leaves all the issues of sourcing data, conforming keys, combining tables, representing time series, and providing data access to the end users whose main interest is analyzing the data. These jobs, and hence demographics data acquisition, belong to the webhouse team. Jesus Mena has an excellent discussion of the third-party demographics data industry in his book, Data Mining Your Website, which is well worth reading even though his discussion of the data warehouse is very abbreviated.

Referring back to Figure 1, you can see that the various behavioral measures come from many different webhouse tables and are expressed in different granularities. A webhouse producing this set of applications might consist of more than a dozen separate queries to different fact tables, all of which are combined under the customer identifier row heading. The webhouse could not provide this set of observations directly from the original data sources with the speed the data-mining tool requires.

The webhouse needs to produce this set of observations once and then store it for high-performance, repeated access by the data-mining tools. A decision tree or a neural net might read the data only once, but a memory-based reasoning tool may want to read it repeatedly.

The highest-performance access may well be through a flat file. It would be reasonable for the webhouse to hand off the ready-to-go observation set as one or more flat files, which can be read repeatedly. The webhouse then steps back and lets the data-mining tool process the observations at high speed.

All data mining is a repetitive cycle of cutting and trying. It would be very typical of the data mining project to want more behavioral data measurements, or desire numerical or categorical transformations of existing measurements. In some cases, the data-mining tool will provide the final transformations efficiently, but it is very likely the webhouse’s data-delivery environment will augment and extend the observation sets. Most webhouse tool suites can easily take the flat-file outputs described previously and augment them with further columns of data for each customer.

The webhouse team can work with the data miners to reduce the amount of data handed across. After all, not every demographic indicator provides useful insight or has useful predictive value. Some data inputs the webhouse provides may be expensive to compute or buy. It would be helpful to eliminate these variables. By using a neural network tool in “auto-associative mode,” the data miners can test to see whether the data inputs describing the customer can predict themselves.

This technique can eliminate some data elements because they literally are not consistent with the rest of the customer profile information. Similarly, a neural network tool can eliminate other variables when it is configured in the normal mode of predicting or recognizing desired output variables from the input variables. In this case, the data miner compares the changes in neuron weights from the beginning to the end of the neural network training phase. Input variables with neuron weights that change very little in the training phase clearly have not affected the model very much and the data miner may choose to drop them from consideration.

Implications for Database Architecture

I used to think that database vendors would absorb data mining by providing data mining in the inner loop of the DBMS answer set generator, but have since changed my mind. Figure 1 is not the inner loop of a query, but the final result of a complex drill-across application, probably generated well above the DBMS’s inner loop. Maybe the way to say it is that detailed behavior needs to be described within a comprehensive context, not within a narrow query. In any case, I think the architecture is shaking out to be more of a handoff. The webhouse produces the observations at a very granular level and embellishes them with detail, then hands them off to the data miner as a flat file for high-performance, repeated access.

Learning More

Be sure to check out Michael Berry and Gordon Linoff’s new book, Mastering Data Mining Techniques (Wiley, 2000). As in their first book, they really focus on the delivered value of data mining rather than the details of the individual tools. I reviewed it recently, and I think it will become the seminal book on data mining. It also meshes very nicely with my view of the handoff described in this column.



Ralph Kimball co-invented the Star Workstation at Xerox and founded Red Brick Systems. He also wrote the best-selling books The Data Warehouse Toolkit (Wiley, 1996) and The Data Warehouse Lifecycle Toolkit (Wiley, 1998). Ralph teaches dimensional data warehouse design through Kimball University and critically reviews large data warehouse projects. You can reach Ralph through his Web site at www.ralphkimball.com.





IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







techweb
Online Communities TechWebInformationWeekLight ReadingIntelligent EnterprisebMightyNetwork ComputingDark ReadingDigital LibraryWall Street & Technology
Byte & SwitchNo JitterInternet EvolutionLight Reading's Cable Digital NewsContentinopleUnStrungBank Systems & TechnologyAdvanced TradingInsurance & Technology
Face-to-Face Events
InteropWeb 2.0 ExpoWeb 2.0 SummitVoiceConBlack HatCSISoftwareEntrprise 2.0 ConferenceGTEC
Mobile Business Expo
InformationWeek 500 ConferenceBuy Side Trading XchangeBuy Side Trading SummitBank Executive SummitInsurance Executive SummitTelcoTVEthernet ExpoOptical Expo
Magazines  
InformationWeekWall Street & TechnologyInsurance & TechnologyBank Systems & TechnologyAdvanced TradingMSDNTechNetSmart EnterpriseThe Architecture JournalDatabase Magazine
 
Research & Analyst Services  
Heavy ReadingInformationWeek ReportsInformationWeek Analytics