Is Your Data Correct?Simple statistical techniques can help you ensure that users have accurate information at their fingertipsby Ralph KimballOne common problem in the data warehouse back room is verifying that the data is correct before you release it to your users. Is the warehouse an accurate image of the production system? Was this morning's download complete? Could some of the numbers be corrupted? There is no single technique for validating a data load because so much variability exists in the data sources. If you are downloading an unchanged image of a production source (preserving the original granularity), you can probably run a "flash report" on the production system with up-to-the minute totals, then recapitulate the same report on the data warehouse. In this case, you "know" the answer before running the report, and the two results should match to the last decimal place. But more often you will not have a known baseline of data. For example, perhaps you are receiving the individual sales transactions from 600 retail stores every night. You can certainly perform a gross count on the number of stores reporting, but how can you apply some additional judgment to determine the probability that the data is correct? Continuing with the 600 stores as an example, let's look at the sales totals for each department within each store for each morning and ask if today's new numbers are reasonable. We will decide that today's sales total is reasonable if it falls within three standard deviations of the mean of the previous sales totals for that department in that store. I chose three standard deviations because in a "normal" distribution, 99 percent of the values lie within three standard deviations above or below the mean. If all the data values are valid, approximately one percent of them will still fail our test. Perhaps after reviewing these outliers, we will decide that the overall data load seems reasonable. I'll describe the process of checking the data using this simple technique just in words. After that I'll include some SQL, but you can skip the SQL and still get the basic idea. In order to make this process run quickly, you want to avoid looking at the complete time history of old data when you are calculating the standard deviation. You can do this by keeping three accumulating numbers for each department in each store in a special table used only in the data validity pass. You need to keep the number of days you are accumulating, the accumulating sum of each day's sales (by department in each store) and the accumulating sum of the Using this accumulating department table, look at all 12,000 department totals in the morning's data load, and kick out the morning's numbers that are more than three standard deviations from the mean. You can choose to examine the specific numbers with unusual values if there aren't too many, or you can reject the entire load if you see more than one percent of the data values being flagged as out-of-bounds. If the morning's load passes muster, release the data to your end users and update the accumulating department table to get ready for the next day's load. Here's some untested SQL that might work in this scenario. Recall that the standard deviation is the square root of the variance. The variance is the sum of the squares of the differences between each of the historical data points and the mean of the data points, divided by N21, where N is the number of days of data. Normally the calculation of the variance requires us to look at the entire time history of sales, which although possible, makes the computation unattractive. But if you have been keeping track of So if you abbreviate your variance formula with "VAR," then your data validity check looks like this:
First, take care of joins between tables:
Then, constrain the time to today to get the newly loaded data:
Expand Embellishments on this scheme could include running two queries: one for the sales more than three standard deviations above the mean, and another for sales less than three standard deviations below the mean. Maybe there is a different explanation for these two situations. Running two queries would also get rid of the ABS function if your SQL doesn't like this in the If you normally have significant daily fluctuations in sales (let's say Monday and Tuesday are very slow compared to Saturday), you could add a Judging Data Quality With No HistoryThere are some special situations in which you can calculate the expected mean and variance of incoming data even if you have not been accumulating history. Suppose you are collecting data from a large number of individual respondents within the age range of 40 to 49 years. If you have no reason to believe that there is a bias that preferentially selects one gender over another, you can use the ratio of males in the population at large (assume 47/100) and the ratio of females (therefore 53/100) as the basis of your statistics. A statistician would say that your data represents a set of Bernoulli trials, and the resulting data should fit a binomial distribution. Consulting your statistics textbooks, you know that if you have recorded 1,000 respondents, you would expect the mean number of males to be N*P where N= 1,000 and P=47/100. This works out to 470 males in your sample. Seems reasonable. From the textbooks, the variance is N*P*(12P), which is 1,000* 0.47*0.53=249.1. The standard deviation is the square root of the variance, which in this case is 15.78. Again, applying the three standard deviation criteria, you will worry about any data element that is more than three standard deviations away from the mean. In other words, if you see that the number of males reported in a sample of 1,000 is less than 47023*15.78=422.6 or is greater than 47013*15.78=517.3, stop to ask if the reported data is reasonable. Consider whether you are comfortable assuming that your data is generated with a constant probability. If you are, you can use this technique without accumulating a complex past history. Other examples in which the constant probability assumption is plausible might include the yield percentages of manufacturing processes such as wafer fabs, paper production, or steel mills. Compensating for Predictable ChangesMany of the measures (facts) that occur in a data warehouse come from marketplaces that exhibit predictable seasonal fluctuations. The holiday season and the summer are predictable, high-volume periods. Remove this seasonality as a source of variance when you judge your incoming data. Similarly, maybe business has been growing at an accelerating rate for the past several years; this acceleration could even be nonlinear. Finally, there may be specific days in the week or month during which you know your business volume is predictably high or low. You can remove these sources of variance in your data in order to judge whether your data is reasonable, but you need some high-powered statistical help. You will want to project your previous data forward in time to today, then compare this projection with the data that has just come in. Professional statisticians for many years have used a technique known as X-11-ARIMA to remove these effects from time series data. ARIMA is an acronym standing for auto-regressive integrated moving average. The X-11 algorithm takes your existing data and projects it forward. In other words, it tells you what it expects today's data value to be. The venerable X-11 algorithm, which statisticians have been using in various forms for more than 30 years, has been superseded recently by X-12-ARIMA, which offers more flexibility for dealing with constant (nonseasonal) trends and short term effects such as trading day fluctuations. All the serious statistics software vendors, such as SAS, SPSS, SYSTAT, and others, will have X-12 modules that you can use to filter out unreasonable data values. Most of the serious extract-transform-load (ETL) vendors such as Informatica Corp., Sagent Technology Inc., Ardent Software Inc. (now part of Informix), and Evolutionary Technologies International (ETI) offer "transform" modules with advanced statistics processing, including X-11-ARIMA and X-12-ARIMA. Have a talk with your ETL vendor and ask them how to attach one of the serious statistics packages to your backroom data flow. Ralph Kimball co-invented the Star Workstation at Xerox and founded Red Brick Systems. He has three best-selling data warehousing books in print, including the newly released The Data Webhouse Toolkit (Wiley, 2000). Ralph teaches dimensional data warehouse design through Kimball University and critically reviews large data warehouse projects. You can reach Ralph through his Web site at www.ralphkimball.com. |
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
|
|











