Guide to the TechWeb Network

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home




March 8, 2002

/020308/505decision1_1.jhtml">

To Believe or Not to Believe

You can smoke out false data mining prophets with a solid empirical approach

By Barry Grushkin

Continued from Page 1

You then collect data to test that hypothesis. Some well-reasoned methods allow researchers to decide whether the measured data indicates confirmation or rejection of the initial hypothesis. The mathematics also indicate what percentage of the time they should come to the right conclusion — technically called the "level of significance."

Models ideally should be subjected to the same sort of hypothesis testing. After all, a model is a hypothesis about a set of relations you believe exists between variables. For example, rather then stating sales as a fixed value, a model might hypothesize a linear relationship between sales and the number of total hours the sales forces are on the road. Usually an error term is added to represent what the model cannot account for.

You can read a lot from the error terms, such as if another modeling form is warranted or if a variable should be added. Although having a suggestion as to how to improve a model can be valuable, it can be a danger if not done right.

If you start with competing modeling hypotheses, such as whether a linear or quadratic model is better, you can collect data to perform a comparative test. However, in practice, the statistical models often are chosen by first looking at the data. Only after looking at the errors is a differing modeling form tried. In fact, looking at the data first is also what data mining does: An algorithm uses data to create a model. But don't let the fancy exterior fool you: Data-mining algorithms are still only methods for picking modeling hypotheses! Furthermore, algorithm complexities limit opportunities for exterior verification of assumptions (both in the model creation process and the model itself) with known properties of the data and the real world. Very often, different algorithm options and variations are tried as well.

WHICH TO CHOOSE?

In both the statistical and data mining variations, you end up with sets of hypothesis candidates. You still need to choose which hypothesis to test.

If you base your choice on the data set that generated the model, you have no way of knowing which model might be the better generalization about reality. If you run numerous model variations on a single data set, you might only find the one that fits only that given set the best.

Generally, the best thing to do is use a few previously unused data samples to pick which model to recommend. This is called the validation step.



Rate This Article

Comments:

Optional e-mail address:

Once you have a final candidate, you still need additional test samples to do a real test. But don't forget that if after the test you decide to tweak or improve your model, as model development is commonly an iterative process, you will need additional data to test it.

So if you really want to have confidence in your model, if you want your methods to be on a solid foundation, as is particularly required in data mining where exterior concept validation is much harder to come by, you need to plan. You need to divide out production, validation, and testing samples for each expected cycling of the process.

Companies have big data sets. Now is the time to use them to do data mining the right way. Why make decisions based on garbage models when with a little data planning you can have confidence that you are modeling the future interaction, not just stating in mathematical terms the arbitrary happenstance of the past? With or without voices from the heavens, planning is always a key ally to mortal success.


Barry Grushkin is chairman and CTO of The Machine Intelligence Co., specializing in sophisticated data mining and comparative analysis of business intelligence technologies.








IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







techweb
Online Communities TechWebInformationWeekLight ReadingIntelligent EnterprisebMightyNetwork ComputingDark ReadingDigital LibraryWall Street & Technology
Byte & SwitchNo JitterInternet EvolutionLight Reading's Cable Digital NewsContentinopleUnStrungBank Systems & TechnologyAdvanced TradingInsurance & Technology
Face-to-Face Events
InteropWeb 2.0 ExpoWeb 2.0 SummitVoiceConBlack HatCSISoftwareEntrprise 2.0 ConferenceGTEC
Mobile Business Expo
InformationWeek 500 ConferenceBuy Side Trading XchangeBuy Side Trading SummitBank Executive SummitInsurance Executive SummitTelcoTVEthernet ExpoOptical Expo
Magazines  
InformationWeekWall Street & TechnologyInsurance & TechnologyBank Systems & TechnologyAdvanced TradingMSDNTechNetSmart EnterpriseThe Architecture JournalDatabase Magazine
 
Research & Analyst Services  
Heavy ReadingInformationWeek ReportsInformationWeek Analytics