|
Erik Thomsen
December 1998, Volume 1 Number 3 It's an Uncertain WorldThe ability to manage uncertainty goes a long way toward intelligent decision making
Possessing accurate information (facts, beliefs, knowledge, or reports) is one of the essential requirements for good decision making. While philosophers from Aristotle to Wittgenstein have pondered the degree to which we can accurately know anything and the assumptions upon which any notions of accuracy are based, statisticians have attempted to lay out rules for quantifying the uncertainties of different types of measurements and calculations. Predicted measures of response for direct mailings, estimates of default rates for loans, and other data mining-based derivations have associated uncertainties. This is common knowledge. No analyst would consider presenting a data mining-based relationship within the context of a support that had no measure of uncertainty. In a decision tree, uncertainty is usually estimated in terms of the ratio of misclassified records to the total number of records; for regression-based models, it may be the correlation coefficient. But what about other beliefs not based on data mining; are they uncertain, too? Where does uncertainty fit into a DSS framework? Anywhere theres data. Consider a typical sales report. How many times have you seen one without any further qualification assigned to the numbers? After all, were just looking at numbers, and numbers wouldnt lie. If this was your report to your superior and your boss asked you if you were willing to bet your lifeor your jobon the veracity of the numbers, and if all you had done is create the query, would you answer yes? I hope not. If it were me, the doubts would start to seep into my brain. What if just one of our 250 order-entry operators had made an error in recording even one sale? What if one of the local retail operations point-of-sale data didnt upload to the regional warehouse? What if any missing data in the warehouse was erroneously treated as a zero? What if there was an error in the OLAP aggregations? Its one thing to doubt the truth of these sales numbers; its another thing to quantify that doubt. You need to be able to measure the uncertainty in order to make intelligent choices regarding data quality improvement. If you cant measure the uncertainty, its hard to be anything other than a skeptic or a naïve realist. Sources of UncertaintyThere are two different sources of data uncertainty: predictions and historical measures. (The present is an illusion.) Because there is more awareness of the uncertainties associated with data mining-style predictions, I will focus on the uncertainties associated with other common predictions typically found in budgets and plans for which uncertainty estimates are not commonly associated. With respect to historical measures, I will look at the uncertainties that arise from the data and the processing. Point Estimates of the FutureFor many budgeting, marketing, and other typical analytical activities routinely performed using spreadsheets or OLAP tools, uncertainties about the future incorporated into these analytical models are done with point- or single-valued estimates. For example, a budgeting exercise may include a single estimate for next years unit sales or headcount. A financial application may include a point estimate for interest or unemployment rates. A healthcare application may include point estimates for physician productivity. Consider a typical effort. Imagine you want to run a promotion on razors, and you need to estimate which of three possible promotional strategies is best: radio, newspaper, or direct mail. Assume you incorporated differences in unit cost as a function of how many products you would purchase as well as the cost of under- and overstocking. A simple promotional strategy model would point to newspaper ads as the most profitable strategy. But what about the competition? Surely, the effectiveness of the promotional campaign is partly a function of whether your competitors are promoting razors at the same time. You dont know whether or not there will be competing promotions, but you can assign a probability and estimate how well you would do under three different scenarios: low competition, normal competition, and high competition. The way to incorporate the possibilities of different amounts of competition is by adding some probabilistic notion to your model, and the most flexible way to add this is by using a decision analysis tool (especially for more realistic and complicated models with many sources of probabilistic uncertainties). Figure 1 shows an influence diagram for the promotional decision-making process. Youre trying to choose a promotional strategy that maximizes the profits of the promotion. Profit is a function of number of units sold, the cost of those units (including any returns or rain checks), the price charged per product, and the cost of the promotion. The major unknown left out of the spreadsheet model, which youre incorporating in the influence diagram, is the likelihood of competing promotions and their effect on sales, which is represented in Figure 2 by the competition node. When you have assigned the likely results of each promotional strategy as a function of the level of competition, a decision analysis tool will simulate a number of scenarios in accordance with the probabilities youve specified and determine the decision that has the greatest likelihood of maximizing the value of the output variable (in this case, the most likely profit). Figure 3 shows that by taking the probabilities of differing levels of competition into account, the best decision proves to be the direct mail strategy. Thus, you can improve the accuracy of your predictions and the quality of the decisions you base upon those predictions by replacing single-valued estimates of the future with probabilities or distributions. Historical MeasuresSo much for representing uncertainty about the future. How about uncertainty about the past? Uncertainty enters the picture from the past both in the way that data is processed and in the data itself. Application ProcessingMany of you have read about the ways that data quality can deteriorate within legacy systems, as well as data warehouses that dont address data quality issues explicitly. For example, the same customer information is encoded in different ways in different systems, which results in the appearance of more customers than there really are and, thus, in the inability to analyze customers properly. Or the same customer has his or her name spelled different ways with each order, again making one person look like multiple persons. Or data may be replicated from one system to another incorrectly. We typically place this basket of issues under the category of data transformation, cleansing, or reengineering. While there are many products that can help improve the quality of your data, there is still a lack of accepted methods for adequately describing the quality of the existing data. You could use a quality indicator, for example, to estimate the difference in effectiveness between running a data mining routine on poorer quality data vs. higher quality data. Such an indicator would let you perform a cost-benefit analysis on the value of cleaning the data in terms of, say, changes to the response rate of a promotion based on cleansed vs. uncleansed data. Other issues related to application processing uncertainty that cannot be addressed through data reengineering stem from analytical calculations. Everything from estimating regional sales to allocating costs across business units to calculating profitability by product category are subject to uncertainty. We may calculate the variables differently from place to place, and the formulas may not reflect them correctly. The application may treat missing sales values as zeros, which would affect aggregate sales values. Addressing these areas requires some form of application auditing capabilities, which means you have the ability to trace calculations. And addressing these areas implies that you have some way of grading derived data in terms of its expected quality. Processing-related uncertainty also stems from the use of sampling to speed up query performance. I recently performed an architectural review of a relational OLAP-style product, Metacube from Informix, that is unique because it offers sampling capabilities. When it is in sampling mode, the product refers to a sample table rather than a source table to answer queries and returns the error associated with each value as a function of the ratio in size between the sample table and the original table. You might see, for example, that sales for New England were $100,000 6 $5,000. The system administrator gets to pick the size of the sample table, which in turn determines the speedup in the queries and the size of the errors. There is a glitch, however. The sampling algorithm assumes that the errors are distributed evenly across all combinations of dimension elements. Unfortunately, this assumption is not warranted. In other words, just because the variance for all products for all times for all stores is 10 percent doesnt mean that the variance for any particular store-product combination is going to be 10 percent. In fact, given the way that sampling works, there may be some store-product combinations for which there are no samples. The company is now working to factor in all the conditional probabilities (that is, the probabilities across all combinations of dimension elements) without creating such a combinatorial explosion that it would be more efficient to go after the source tables. At least one OLAP vendor is addressing it. Data CollectionFinally, there are many data collection-based errors that creep into a DSS. Some of these errors, such as putting the right data (such as a name) in the wrong field (such as an address field), are addressed by the same crop of data reengineering tools that handle problems in data processing. The same kind of data transformations and reengineering cant address other data collection problems. How, for example, would you determine just by looking at the data that the sales amount for a cash transaction was incorrect, the response card filled out by a customer had intentional errors, a single product ID in an order containing many products was an ID for a different product, or the number of units listed on an invoice for supplies received did not match what was actually received? You couldnt. At some point in the data quality testing process, you need to make an appeal to reality. Such a data collection audit helps establish the mean error rate with which you collect different types of data from customer orders to shipping information. Then, you assign quality indicators to your measurements. Although traditional DSSs do not provide robust uncertainty management yet, in the environmental management and policy research world (a very complex DSS arena), a number of people are working on this challenge. I look forward to the day when the fruits of these and other research efforts reach the world of finance, sales and marketing, and other more common DSSs, which would give me some measure of confidence to associate with every data element in my system and an idea of the likelihood that any reported number is off by more than some threshold amount.
Erik Thomsen is an author, lecturer, researcher, and consultant focusing on OLAP and decision-support applications. He is cofounder of the Cambridge, Mass.-based consultancy Dimensional Systems and author of the book OLAP Solutions (John Wiley & Sons, 1997). He wrote the Decision Support column for Database Programming & Design. You can reach him via email at erik@dimsys.com. |
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
| ||||||||||
| |||||||||||||||||||||||||||||||




















