Validation VictoryWhich of many data mining algorithms and options are best for your solution? With validation techniques you can make a solid decisionYou want the past to inform the future, right? But each expert has a different claim as to what wisdom you can glean from this past. The dilemma is true in data mining as well. You want the data you collect to make your next actions smarter. But don't fool yourself. Data mining uses learning algorithms and models that, just like any expert, project forward differently depending on how past experiences are understood, selected, categorized, organized, and encapsulated. Many data mining packages, such as those offered by SAS Institute Inc., SPSS Inc., and Silicon Graphics Inc., provide many differing algorithms to choose from for any potential application, offer many dialog box options within each algorithm and have algorithms that can essentially generate a wide range of differing models from which you have to select. So how do you pick the best one for any given solution? Selecting the Best ModelIn the best of all possible worlds, having a well-delineated and tested theory such as Newtonian physics is the strongest way to make a choice. Newtonian physics, for example, specifically tells you to use a linear model to plot the relation between distance and time for an object moving at a constant velocity. But rarely in business are things so easy. Having solid knowledge of real-world issues and how they implicitly or explicitly implicate modeling options is an ideal second choice, but all too infrequently, except in some areas such as finance and engineering, do you have something nearing the full picture. Barring this full disclosure, validation samples can come to the rescue. A validation set is a data sample different from the one used to generate models that helps you pick the better model or model-generating algorithm. For example, Figure 1 shows two competing models of the same primary data set (green dots). The squiggly blue line is a sixth-degree polynomial. The straighter black line is a third-degree polynomial. Which is right? With another data sample, called a validation set (yellow squares), the black line is clearly seen to be the better explainer. Selecting the Best AlgorithmBut how do you choose among the many algorithms offered to classify data (such as decision trees and neural nets) or even choose among the many options within any algorithm (such as a user's option to pick any number of hidden nodes in a neural net)? But every choice represents a substantially different modeling form oftentimes more distinct than the black and blue lines in Figure 1. These seemingly arbitrary choices can result in models that can recommend vastly different actions in real-world situations. How can you decide which of the many data mining algorithms or options have the best chance of producing the best forecasting model? Figure 1 shows how you can use a validation sample to decide between two models. But to decide among a number of model-generating algorithms, I recommend a technique called N-fold cross validation. In this method, rival algorithms produce models based on all but N data points, which are then measured for success on the remaining N. You perform this process for all selections of size N. This procedure gives you a rich set of examples of the differing algorithms' forecast performance so you can choose the best one for your solution needs.
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
|
|











