August 24, 1999, Volume 2 - Number 12
Tale of Two Worlds
Bringing OLAP and data mining together can produce something worth more than the sum of its parts
For at least as long as Ive been working in the multivariate statistical world 20 years now the files have been flat but the process names colorful. In data mining, an heir to this world, the files are still flat and now the process names vividly suggest intelligent life on this planet: neural nets, decision trees, and memory-based reasoning, for example.
In contrast, the online analytical processing (OLAP) world, created in a different technological big bang, contains database files rich in complex hierarchies, intriguing relations, and multimembered dimensions, along with tables laden with bountiful facts to aggregate. But, alas, the analytic methods generally seem flat. I can find only sums and averages on most of the space-time intersections. Today, the vast majority of data analysis applications use either data mining or OLAP, not both. But bringing the two worlds together well can make it easy to solve and understand complex business and research problems that either approach cant do on its own. Why not have the best of both worlds?
To indicate some of the power that full OLAP and data-mining integration could offer for solving complex business problems, I will discuss various categories of data-mining and multivariate statistical methods, look at how they might marry with the OLAP world, and indicate a valuable business application. To give you a preview, I see four main types of data analytic methods tossing, dicing, slicing, and fitting or, more technically, dimensional reorganization, subset discovery, dimensional reduction, and function fitting.
Tossing: Dimensional Reorganization
As Web portals offer OLAP-like information to large groups of subscribers, displaying information in the categories or concepts most sensible to the user can offer the most beneficial interactions. I do not mean just keeping statistics on the predefined dimensions commonly invoked by a user, but rather creating dimensions. Data-driven categorizations and organizations have vast potential. The data can come from customer segments, use patterns, buying patterns, life choices, and so on.
To illustrate how data-driven categorizations differ from the definitions of the experts, I looked at how analysts at major brokerage houses organize stocks into industry categories. Many analysts appear to disagree; the category to which a stock belongs depends on whether you talk to Paine, Merrill, or Barney. What good are sector forecasts if you cannot even know what sector your investment is really in? To find an answer, I turned to information imputed from customer data, the customers being anyone who trades. I generated hierarchical clusters based on covariances of stock movements. Lets see how the investors really think! Many analysts talk about a chemicals group. But investors do not seem to think there is one. These groupings based on investor-driven stock-value changes show that oil companies break down into subgroupings of service, exploration, and small and large companies an important distinction most analysts glossed over. Investors even had a conglomerate grouping, a category the analysts missed totally. IBM showed a dual identity: a big company and a computer company. Many more relations did not appear in analysts writings. You could use such data-driven hierarchies as dimensions in an OLAP environment to look at past or predicted, segmented market revaluation over chosen time periods, perhaps crossing them with selected valuation criteria. You could also easily generate other categorizations attuned to still other user characteristics. Data mining gives you the correct breakdowns and even the forecasts. OLAP gives you the flexibility to view the implications and take action.
Dicing: Subset Discovery
In my August 3 column (Predicting Movement), I illustrated how you could construct strategic action classes to optimize a corporate goal. There, we found subsets of a one-dimensional independent variable (a last-period yen-dollar trend) that told under exactly what circumstances we should buy, sell short, or not hold a position in a security so as to optimize trading profits a measure that took into account costs, revenue, and risks. Many corporate processes and problems are far more complex: more variables to combine in order to state goals, and many more issues affecting these components. For example, a major auto manufacturer deciding which and how many 2001 model-year cars to send to which countries may have a rich store of information: income and wealth statistics, past patterns of related product and service purchases, costs, suppliers, sales, cultural issues, and currency changes, to name a few. However, the search for geographic aggregates that represent what actions to take with respect to model classes so as to maximize profits has a very similar problem structure. The complexity and size of the data requires an OLAP context, but the search for the right regional aggregates would be greatly facilitated by the use of data mining or optimization methods.
Decision trees are another tool for identifying subsets. They generate the break points of the independent variables that best forecast a dependent variable. Michael Berry, in Mining the Wallet (Decision Support, June 22), demonstrated the use of decision trees to forecast customers most likely next buy what bank services an existing account holder is most likely to want, for example. The decision-tree hierarchies that display the customer-grouping rules offer a valuable additional dimension for an OLAP cube. When the algorithm shows that family size is the key forecaster of the need for personal loans, it would be very convenient to immediately have at hand a cube view with family size as one breakdown against which you can judge other variables: savings, credit card debt, speed of payment, and so on. Next-tier key explanatory variables such as age and income should continue to be interesting as you use further decision-tree variables and breaks to create OLAP dimensions. Data mining illuminates the key issues. OLAP lets you fully negotiate details, large and small, that a completely automated analysis will miss.
You can use a similar OLAP negotiation method to study the nature of the triggering subsets of neural nets (a hybrid of dicing and fitting) so the solutions that these useful, but mysterious, black boxes give can more easily be understood. Finally, canonical analysis breaks a many-dimensional space into two predictive subset classes by generating a best splitting dimension. You can use this splitting dimension in a cube to break out a forecast of whether each customer will respond to a solicitation. Other dimensional choices tell you more about who these individuals are.
Slicing: Dimensional Reduction
Many multivariant statistical methods work by finding a subspace of a large number of independent variables so you can home in on a few highly relevant dimensions. For example, principle component analysis (PCA) transforms the full set of input dimensions to find a reduced set of dimensions that explain the maximum amount of the sample variance: Is it season, distance to a highway, neighborhood characteristics, average rainfall, or a combination of those variables that best explains profit margin variations of a retail furniture chains outlets? Implemented in an OLAP environment, PCA could directly transform a cube so that you first view the three transformed dimensions that most explain the profit margin variations.
You discover that the generated variable median neighborhood income plus population size is the most explanatory dimension, with rainfall plus distance to a highway taking second place. You can construct hierarchies on ranges of the transformed variables for example, low, medium, or high allowing the analyst to drill down and up to look for the most meaningful breakdowns. Paging on products (that is, viewing sheets that represent movement in this third dimension), you might notice which furniture classes are the largest contributors to these profit-margin variations. Often the obvious jumps out to remind you to take action on what you already knew, such as focusing on upscale loft furniture for the city in the fall and patio furniture for the suburbs in the summer. You might even use this information to decide where to open new stores.
Function Fitting
Function fitting is a tried and true statistical technique. As I indicated last time, a function often fits best at a given level of aggregation, and linear functions may not always be the best way to go. One could image an integrated environment where any of a number of chosen classes of functions are tried against any of a number of aggregations and variable transformations until the ones with the greatest significance in the viewable dimensions are found. For example: Find me a function that best displays sales margin variations for different cars in different aggregations of countries, using the most important combination of tax, income, fuel costs, shipping costs, and currency risk issues. Display the results as a 3D surface over a horizontal 2D view of a cube. You might find Eastern Europe shows a bump in sales margin for mid-sized cars while Western Europe shows a bump for small cars for tax, income, and fuel cost issues.
Once you recognize that both OLAP and data mining offer operations on cubes, a whole realm of valuable interoperability opens. Data mining offers intelligent and automated restructurings of variables, forecasts, views, and breakdowns. OLAP offers the ability to rapidly operate on, organize, and negotiate user-defined views of complex corporate databases. So why bridge these two seemingly disparate worlds? Simply to increases the smarts of the intelligent enterprise astronomically.
For more information, see "A case study: Data mining and OLAP combined to facilitate a cross sell campaign" at www.dsslab.com
Guest author Barry Grushkin is a researcher at the DSS Lab (www.dsslab.com) in Cambridge, Mass. You can reach him at bgrushkin@dimsys.com
|