|
Let's be levelheaded about data mining simultaneously at many levels. (By "levels," I mean levels in a database hierarchy.) Many businesses know that the massive data sets they collect can be mined for business gems of great value. But so much business data these days is hierarchical in nature - with dimensions, for example, depicting levels of time (year, month, and day), geography (country, state, and city), nested categories of customers, and different levels of detail in product descriptions.
On the contrary, most data-mining methods, even sophisticated ones, generally require flat files, with each record comprising a case. This inconsistency is not as much of a problem as you might think. In fact there is a lot of potential to bring these data sets and mining methods together, if you can sort through the amazing number of possibilities. One of the many options is to mine at many levels at the same time.
In the past I have written about other related issues and options. In "Predicting Movement" (August 3, 1999) I gave an example of a fairly common problem - the level of aggregation affects your understanding of the data: You get different results, fit curves differently, or discover decision rules differently. You can, for example, think of online analytic processing (OLAP) or multidimensional queries as offering options for preaggregation before data mining starts. However, in "Tale of Two Worlds" (August 24, 1999) I cited a number of examples of the reverse - where data mining results can help you construct cube dimensions that look like promising lenses for discovery.
Here, let's look at half a dozen solutions where the interaction between data mining and database cubes is even more integral - where information at different levels is combined or mined at the same time. Data mining's reputation for dealing only with flat files made of sets of n-tuples or case records starts to look less constraining than you might at first expect. Nearly every way you can pivot a database cube (DB-cube) reveals a data mining cube (DM-cube).
A DB-cube consists of a meaningful combination of sets of hierarchical dimensions in a database, the kinds that, for example, could make up an OLAP cube. The rectilinear subset of Euclidean space (dimension without explicit hierarchies), which many data-mining algorithms are thought to operate on, we can call a DM-cube.
(See the sidebars "Mental Block" and "One-to-Many Data Mining" in the online version of this column at www.IntelligentEnterprise.com for more about the relationship between DM-cubes and DB-cubes.)
A Half-Dozen Solutions
Following are examples of six kinds of data mining and hierarchy integration. There is a lot more being done than you might at first imagine.
1. Optimization at many levels simultaneously. PolyVista Inc.'s PolyVista is a cube, data mining, and visualization software package that communicates with Microsoft SQL Server via MDX. What makes it particularly interesting to this discussion is that it contains a multilevel optimizer.
This optimizer can search over all levels until it finds cell partitions of interest. It can look for views that most clearly separate profitable from unprofitable, geographic regions, periods, customers, and so forth, at any level of the hierarchy where the differences might be most pronounced. It functions as a way to draw your attention to significant differences that deserve further investigation and that could very well be missed during manual inspection of standard pivot table variations.
PolyVista also contains a built-in decision-tree algorithm that lets you watch how measures change as splits in existing hierarchies and variables are used to optimize some objective, such as differentiating loyal customers from short-term customers. Many data-mining programs offer decision-tree algorithms. InforSense Ltd.'s Kensington Space, for example, shows in highly polished graphical form how measures change as a decision tree evolves. But PolyVista is the only product I know specifically designed to operate from a DB-cube rather than a DM-cube. Note that Microsoft's next-generation OLAP product will do the opposite - allow amending a DB-cube with the results of a decision tree.
2. Data mining at one level being informed by another. Using Visual Insights' Visual Insight Advisor 2000, you can perform a regression at any level of a cube. Take, for example, a country-level analysis with a point for each state plotted on the two axes "sales" and "profit margin." If you notice an outlier, you can click on it to drill down and investigate. For instance, if California is way off the line, you can look at the statewide totals and then further drill down to find details on which city within is driving the deviation.
3. Methods for integrating models or decision rules at different levels that may not all agree. The last presidential election was full of strange disparities between the results a popular-vote decision rule would have brought vs. the Electoral College decision method. Vote counting rules at the county level affected the balance, with a lot of legal wrangling required to decide the final rules at all levels.
In a business mining effort, if results are similarly disparate between the forest and tree levels, added rules may be needed for reconciliation. For example, it may be that frontline salespeople see increasing interest at the same time economic forecasters predict an economic slowdown.
This example brings to mind something that you can call "vertical data mining" - combining different and potentially conflicting forecasts and indicators at various levels. You can do far better than some simple-majority voting rule. For example, if you have historical information that includes the right answer as well as what the individual models at different levels suggested at different times, you could create a super model more accurate than any of the components. Last year in this column, I spent considerable time discussing models that optimize models, in a series culminating with "More for Less" (May 15, 2000).
Another way models at differing levels might not line up is in the timeframe of the forecasts. In a financial forecasting application, I had tick-by-tick data from the exchanges for numerous kinds of financial instruments and created separate forecast models for intervals into the future based on similar interval data from the past - hourly, half daily, daily, weekly, biweekly levels, and so forth. In this case, I used an optimized combination of these models to create a graph of the expected path of price movement to some future date. This was very useful to traders, as it not only gave the ability to pick longer term winners, but also gave short-term execution rules, which can be essential to whether you can profit from longer term understandings.
|
|
|
|
|
|
|











