Validation VictoryWhich of many data mining algorithms and options are best for your solution? With validation techniques you can make a solid decisionContinued from Page 1 Interestingly, a new company, Sightward Inc., offers software that automatically uses this method to pick the best algorithm for several applications from a wide range of popular data mining algorithms. Because of validation methods, it is so confident it can make model improvements that it audaciously guarantees an increase in its clients' ROI. Narrowing It Down to One ModelThe output of some popular algorithms is really not one model at all, but a range of models. Figure 2 shows an example of this in a decision-tree process. Using the primary data set, consecutively selected, independent variables are sliced at break points to produce increasingly focused definitions of who bought product A vs. who didn't. This slicing commonly can go way beyond reality, with rules that generalize poorly but fit the points of the primary data set well. The set of business rules up to any point can be considered a model. So where do you stop? In Figure 2, the numbers in brackets represent acceptance rates the left for the generating data and the right for a validation sample. But by looking at acceptance rates in the validation sample, you can see that, as of Rule 3, the rules start becoming less general. Acceptance rates fall from 37 percent to 29 percent. Rule 3 advantages are only happenstance aspects unique to the generating (primary) data. Validation indicates that a superior model would include Rules 1 and 2, but not 3. As another example, memory-based reasoning techniques can in fact produce an infinite number of models. Figure 3 illustrates a foundational perplexing issue in learning and modeling what does it mean for two events to be like each other? The diagram shows a primary data set mapped into a space defined by the independent variables (W, X). The white Ys and Ns represent known successes or failures respectively. A forecast for a new case is determined by its proximity to these values. The concentric ovals indicate the wide range of ways to define proximity. With the addition of a validation sample (the lowercase Ys), the light-green oval becomes the extension of choice from the white Ys. The Right ChoiceThis situation is no trivial matter: Actions based on correct models can be highly lucrative, and conversely, actions based on incorrect views of the world can result in bankruptcy. One of the reasons for the vast diversity of CRM initiative results (and, therefore, opinions about their value) is differing analysts'ability to understand the implications of options such as those I've mentioned. Data mining can rapidly generate intricate models but, unfortunately, way too many of them. Which is right? Validation techniques offer a set of powerful methods to help researchers make the right choices. Barry Grushkin is chairman and CTO of The Machine Intelligence Development Co., a group specializing in sophisticated data mining and constantly improving data mining techniques and methods. RESOURCESSAS Institute Inc.: www.sas.com Sightward Inc.: www.sightward.com Silicon Graphics Inc.: www.sgi.com SPSS Inc.: www.spss.com Related Articles at IntelligentEnterprise.com: "Connect the Dots," March 1, 2000: www.intelligententerprise.com/000301/decision.jhtml "The Quest for Speed," Nov. 12, 2001: www.intelligententerprise.com/011112/417decision1_1.jhtml
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
| ||||||||||||||||||||||||||||||||









