Intelligent Enterprise Subscribe Article Index Contacts Resources Write the Editor

 
BARRY    GRUSHKIN  

 


Search Powered
by Thunderstone:

Intelligent Enterprise
DBPD Online
DBMS Archives
 


decision Support
 
 

 
 
August 24, 1999, Volume 2 - Number 12


Tale of Two Worlds


Bringing OLAP and data mining together can produce something worth more than the sum of its parts

For at least as long as I’ve been working in the multivariate statistical world — 20 years now — the files have been flat but the process names colorful. In data mining, an heir to this world, the files are still flat and now the process names vividly suggest intelligent life on this planet: neural nets, decision trees, and memory-based reasoning, for example.

In contrast, the online analytical processing (OLAP) world, created in a different technological big bang, contains database files rich in complex hierarchies, intriguing relations, and multimembered dimensions, along with tables laden with bountiful facts to aggregate. But, alas, the analytic methods generally seem flat. I can find only sums and averages on most of the space-time intersections. Today, the vast majority of data analysis applications use either data mining or OLAP, not both. But bringing the two worlds together well can make it easy to solve and understand complex business and research problems that either approach can’t do on its own. Why not have the best of both worlds?

To indicate some of the power that full OLAP and data-mining integration could offer for solving complex business problems, I will discuss various categories of data-mining and multivariate statistical methods, look at how they might marry with the OLAP world, and indicate a valuable business application. To give you a preview, I see four main types of data analytic methods — tossing, dicing, slicing, and fitting — or, more technically, dimensional reorganization, subset discovery, dimensional reduction, and function fitting.

Tossing: Dimensional Reorganization

As Web portals offer OLAP-like information to large groups of subscribers, displaying information in the categories or concepts most sensible to the user can offer the most beneficial interactions. I do not mean just keeping statistics on the predefined dimensions commonly invoked by a user, but rather creating dimensions. Data-driven categorizations and organizations have vast potential. The data can come from customer segments, use patterns, buying patterns, life choices, and so on.

To illustrate how data-driven categorizations differ from the definitions of the “experts,” I looked at how analysts at major brokerage houses organize stocks into industry categories. Many analysts appear to disagree; the category to which a stock belongs depends on whether you talk to Paine, Merrill, or Barney. What good are sector forecasts if you cannot even know what sector your investment is really in? To find an answer, I turned to information imputed from customer data, the “customers” being anyone who trades. I generated hierarchical clusters based on covariances of stock movements.

Let’s see how the investors really think! Many analysts talk about a chemicals group. But investors do not seem to think there is one. These groupings based on investor-driven stock-value changes show that oil companies break down into subgroupings of service, exploration, and small and large companies — an important distinction most analysts glossed over. Investors even had a “conglomerate” grouping, a category the analysts missed totally. IBM showed a dual identity: a big company and a computer company. Many more relations did not appear in analysts’ writings. You could use such data-driven hierarchies as dimensions in an OLAP environment to look at past or predicted, segmented market revaluation over chosen time periods, perhaps crossing them with selected valuation criteria. You could also easily generate other categorizations attuned to still other user characteristics. Data mining gives you the correct breakdowns and even the forecasts. OLAP gives you the flexibility to view the implications and take action.

Dicing: Subset Discovery

In my August 3 column (“Predicting Movement”), I illustrated how you could construct strategic action classes to optimize a corporate goal. There, we found subsets of a one-dimensional independent variable (a last-period yen-dollar trend) that told under exactly what circumstances we should buy, sell short, or not hold a position in a security so as to optimize trading profits — a measure that took into account costs, revenue, and risks. Many corporate processes and problems are far more complex: more variables to combine in order to state goals, and many more issues affecting these components. For example, a major auto manufacturer deciding which and how many 2001 model-year cars to send to which countries may have a rich store of information: income and wealth statistics, past patterns of related product and service purchases, costs, suppliers, sales, cultural issues, and currency changes, to name a few. However, the search for geographic aggregates that represent what actions to take with respect to model classes so as to maximize profits has a very similar problem structure. The complexity and size of the data requires an OLAP context, but the search for the right regional aggregates would be greatly facilitated by the use of data mining or optimization methods.

Decision trees are another tool for identifying subsets. They generate the break points of the independent variables that best forecast a dependent variable. Michael Berry, in “Mining the Wallet” (Decision Support, June 22), demonstrated the use of decision trees to forecast customers’ most likely next buy — what bank services an existing account holder is most likely to want, for example. The decision-tree hierarchies that display the customer-grouping rules offer a valuable additional dimension for an OLAP cube. When the algorithm shows that family size is the key forecaster of the need for personal loans, it would be very convenient to immediately have at hand a cube view with family size as one breakdown against which you can judge other variables: savings, credit card debt, speed of payment, and so on. Next-tier key explanatory variables such as age and income should continue to be interesting as you use further decision-tree variables and breaks to create OLAP dimensions. Data mining illuminates the key issues. OLAP lets you fully negotiate details, large and small, that a completely automated analysis will miss.

You can use a similar OLAP negotiation method to study the nature of the triggering subsets of neural nets (a hybrid of dicing and fitting) so the solutions that these useful, but mysterious, black boxes give can more easily be understood. Finally, canonical analysis breaks a many-dimensional space into two predictive subset classes by generating a best splitting dimension. You can use this splitting dimension in a cube to break out a forecast of whether each customer will respond to a solicitation. Other dimensional choices tell you more about who these individuals are.

Slicing: Dimensional Reduction

Many multivariant statistical methods work by finding a subspace of a large number of independent variables so you can home in on a few highly relevant dimensions. For example, principle component analysis (PCA) transforms the full set of input dimensions to find a reduced set of dimensions that explain the maximum amount of the sample variance: Is it season, distance to a highway, neighborhood characteristics, average rainfall, or a combination of those variables that best explains profit margin variations of a retail furniture chain’s outlets? Implemented in an OLAP environment, PCA could directly transform a cube so that you first view the three transformed dimensions that most explain the profit margin variations.

You discover that the generated variable “median neighborhood income plus population size” is the most explanatory dimension, with “rainfall plus distance to a highway” taking second place. You can construct hierarchies on ranges of the transformed variables — for example, low, medium, or high — allowing the analyst to drill down and up to look for the most meaningful breakdowns. Paging on products (that is, viewing sheets that represent movement in this third dimension), you might notice which furniture classes are the largest contributors to these profit-margin variations. Often the obvious jumps out to remind you to take action on what you already knew, such as focusing on upscale loft furniture for the city in the fall and patio furniture for the suburbs in the summer. You might even use this information to decide where to open new stores.

Function Fitting

Function fitting is a tried and true statistical technique. As I indicated last time, a function often fits best at a given level of aggregation, and linear functions may not always be the best way to go. One could image an integrated environment where any of a number of chosen classes of functions are tried against any of a number of aggregations and variable transformations until the ones with the greatest significance in the viewable dimensions are found. For example: “Find me a function that best displays sales margin variations for different cars in different aggregations of countries, using the most important combination of tax, income, fuel costs, shipping costs, and currency risk issues. Display the results as a 3D surface over a horizontal 2D view of a cube.” You might find Eastern Europe shows a bump in sales margin for mid-sized cars while Western Europe shows a bump for small cars — for tax, income, and fuel cost issues.

Once you recognize that both OLAP and data mining offer operations on cubes, a whole realm of valuable interoperability opens. Data mining offers intelligent and automated restructurings of variables, forecasts, views, and breakdowns. OLAP offers the ability to rapidly operate on, organize, and negotiate user-defined views of complex corporate databases. So why bridge these two seemingly disparate worlds? Simply to increases the smarts of the intelligent enterprise astronomically.



For more information, see "A case study: Data mining and OLAP combined to facilitate a cross sell campaign" at www.dsslab.com

Guest author Barry Grushkin is a researcher at the DSS Lab (www.dsslab.com) in Cambridge, Mass. You can reach him at bgrushkin@dimsys.com

 

Copyright © 2004 CMP Media Inc. ALL RIGHTS RESERVED
No Reproduction without permission
   


Most Popular This Week

IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo JitterPlug Into The Cloud
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet EvolutionPyramid Research
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space