CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise





April 5, 2003

Data Mining: A Call To Action

Businesses can no longer afford to let data warehouse teams serve as passive onlookers in the data mining process

by Michael L. Gonzales

We've all heard and read about business intelligence (BI). And we all agree that traditional data warehousing concepts such as data marts, cubes, star schemas, and atomic level warehouses all play an important part in the BI process. The operative word is part: BI is much bigger than just traditional warehousing. It spans concepts such as portals, dashboards, spatial analysis, and especially, data mining.

Data mining is uniquely qualified to inspire informational insight from massive amounts of detailed data, not unlike that found in atomic-level warehouses. For that reason, it's incumbent on warehouse planners and data architects to ensure that they're active participants in the mining effort, as opposed to passive onlookers. If the warehouse simply dishes up raw data to mining teams, then your organization loses in at least two important respects: timeliness of decision making and maximum return on investment from corporate information assets.

Doling raw data to mining projects ensures that substantial cleansing and transformation will be required before mining can occur, causing significant delays before any actual results are found and implemented. Moreover, because the mining effort is outside the mainstream of warehousing, mining results aren't often fed back into the warehouse. Consequently, critical insight found in your warehoused data is never made known to warehouse users. For example, if you score customers as high or low risk, then that information must be made part of your warehoused customer data so that it can be analyzed with any warehouse-centric tool, including ad hoc reports and online analytic processing (OLAP).

Data Mining: A Quick Tour

Data mining is complemented by a variety of preprocessing functions, statistics, and data visualization tools. As the technology becomes easier to use, it has moved from the esoteric applications driven only by a select few, to a broader user community benefiting from various mining applications.

On a very high level, each data mining effort includes the following steps: 1. Define a precise definition of the business issue being addressed. This definition should include:

  • A clear description of the problem
  • An understanding of the relevant data needed
  • A vision for how you're going to use the mining results.

2. Map issues to mining data. As with virtually all warehouse iterations, data mining requires you to map your business requirement to data sources necessary to address the issue.

3. Source and preprocess data. This step is not unlike that in the warehouse extract, transform, load process. The four objectives here are to:

  • Identify the necessary data
  • Collect the identified data
  • Filter the collected data
  • Transform the data as necessary.

4. Explore and evaluate data. This step often begins by browsing the data with visualization tools. The objectives are to verify data completeness, exactness, and relevance.

5. Choose your mining technique. In this step you must decide on the technique to be applied. Several mining techniques are available, including clustering, classification, association, and value prediction.

6. Interpret results. You want to translate the mining results into a business context with the assistance of subject matter experts.

7. Deploy results. To fully exploit the value of data mining, the results must be deployed to the broadest audience. If you see data mining only as an analytic tool, you'll fail to realize its full potential. Data mining provides deeper insight into your business that can be deployed in other business processes — for example, your CRM systems.

As with any BI project, each step must be supported by strong project management as well as a conscious approach to knowledge transfer.

But as critical as mining is in BI efforts, most warehouse teams seem unable or unwilling to support this aspect of BI. To the detriment of their organization and its information assets, these teams seem intent only on providing a passive repository serving up data to be mined by other teams, departments, or even third-party vendors.

Mining And The BI Environment

Today's BI solutions must grapple with the rising flood of data, both in terms of the number of records as well as their size. For example, not only do businesses keep information about existing customers, but more and more they also keep information about previous customers for win-back campaigns and about prospective customers for acquisition models. Many businesses are attempting to analyze incredibly detailed data, as well. For example, telecommunications providers need to analyze all their call detail records. These logs are not unlike Web logs in their sheer volume and messiness. Then there's the growth of the records themselves — not only in number, but in attributes per record.

Although the rising volume and size of data sets definitely compromises the effectiveness of many warehouse-centric tools, it's a characteristic that mining is uniquely suited to address. For example, OLAP often preaggregates data in order to deal with large data sets. With each aggregation, important detail is lost. Alternatively, mining feasts on the detail data and can crunch through mountains of it. Mining thrives in this environment, which is one of the critical reasons that mining must be an integral part of your BI process.

Finally, data mining must be integrated into the business process to let information flow across the organization. The integration of data mining into a warehouse gives the analyst fast and efficient access to the data. The integration of data mining into end-to-end solutions, such as CRM, makes data mining results quickly available to a much wider group of knowledge users.

Work Distribution Of Mining Efforts

With this massive data growth challenge in mind, one aspect of the mining effort in particular stands out: Data preprocessing, acquisition, and cleansing represent 80 percent of the overall mining project, which is an extraordinary amount of time spent just preparing the right data (see Figure 1). Fortunately, this aspect of data mining can be actively supported by the warehouse team.

Traditionally, mining teams must go directly to production systems for source data if the warehouse doesn't contain the necessary data. But what's especially alarming is that even when the warehouse has the needed data, mining teams must still perform excessive data transformation and cleansing to prepare data for two reasons. First, data mining, not unlike other BI applications, often requires special transformation of the data. For example, instead of just having the customer's age in the record, miners might want to store a value signifying Middle Aged as well. The second reason, unfortunately, is due to the simple fact that much of the data stored in warehouses is of poor quality. Consider, for instance, how some warehouse teams address sparsity: It's not unusual to see NA, 0, or 99999 stuffed into a field during the transformation process, never knowing if the field should have been NULL or was actually missing data.

In essence, as opposed to being an active participant in the mining effort, most warehouse environments simply serve up what can only be defined as raw mining data. The question we data architects must ask ourselves is "Why?"







IE Weekly Newsletter
Subscribe to the newsletter
    Email Address