Analysts' DarlingUnder SPSS, Clementine continues to earn favor
By Greg James
In this Issue:
SPSS Inc.'s Clementine is the most popular data mining package on the market, according to the KDnuggets Web site (www.kdnuggets.com). In its early days (pre-1998) Clementine's interface pioneered the visual programming approach to data mining. This modality proved so intuitive and successful that both IBM and SAS Institute adopted it for their data mining tools as well. Integral Solutions Ltd. (ISL) in the United Kingdom originally launched Clementine in 1994 to be a reusable data mining workbench. It was designed to provide easy access to sophisticated data mining algorithms along with necessary support functions such as data access, preprocessing, graphing, and reporting. SPSS purchased ISL on December 31, 1998 and has enhanced Clementine with additional analytic functionality, scalability, integration, and support. I examined the most recent major release, version 6.0, for this review. A QUICK TOURClementine launches into a desktop that includes the ubiquitous row of pull-down menus at the top, a project workspace in the middle, and a set of "palette" windows on the bottom. (See Figure 1.) The available data processing and analysis operations are represented as icons in the palette windows. There are icons for defining data sources, performing record and field operations, graphing, modeling, and generating a variety of reports and outputs. The palette windows are scrollable, so some of the available operations are not visible in Figure 1. To construct a data mining procedure, simply select the appropriate icons from the palettes, place them in the project workspace, connect them together with data flows, and set execution parameters within pop-up property windows. Clementine calls these diagrams "streams." You can gain an appreciation for Clementine's full life-cycle support by examining the Record and Field Operations palettes. Record operations include select, select distinct, merge, sort, aggregate, append, sample, and balanced sample. Field operations include define, select, derive, replace (as in replacing missing values), create rolling aggregates (as in calculating year-to-date fields), and derive Boolean flags from categorical fields (as in creating dummy variables for regression). Other features of Clementine's interface are not as obvious. There are many interactive capabilities sprinkled throughout the product. For example, given a histogram, Clementine lets you interactively define regions from which it will automatically generate selection operations. The same is true for scatter plots where you can define rectangular regions for the same purpose. Web diagrams let you select sets of relationships or variables, the rule and tree browsers let you choose rules and subtrees, and so on. Altogether, these interactive capabilities translate into significant productivity gains. The process of creating a model is quite simple. Select and define the appropriate data access, processing, and modeling operations; define the sequence and data flows through them; and direct Clementine to execute the stream up to, and including, the modeling node. Clementine will place a new, diamond-shaped icon in the Generated Models panel to the right of the project workspace when the entire process is complete. Interpreting the structure and characteristics of model is a fundamental activity of data analysis. In many cases it is the whole point of the exercise! Clementine displays a model's structure when you select the properties item of its icon. This action brings up a "model browser" that displays the model's internal structure or logic. For example, it displays decision trees as hierarchical outlines. Like their graphical counterparts, Clementine's model browsers are interactive. You can collapse or expand decision trees and association rule outlines; and you can select portions of the model for generating submodels. Other models, such as neural networks, do not lend themselves to direct interpretation or manipulation. Even so, Clementine provides a complete rundown of each model's performance metrics. Optional procedures, such as sensitivity analysis in the case of neural networks, are present for added insight. Generated models are available for use within a data stream. The Generated Models panel can be thought of as a palette of prediction or pattern recognition nodes. When a generated model is placed in a data stream, the model's logic is applied to each incoming data record. The results of the model's logic, or scores, are appended to each outgoing data record. Model building (or training), testing, and validation are easily specified through judicious construction of data streams and operations. MODELING TECHNIQUESModels are pattern recognition or prediction logic created by statistical and machine learning algorithms. The exact form of a particular model depends upon what algorithm was used and, in some cases, on modeling parameters the analyst sets. Clementine's two association rule nodes, Generalized Rule Induction and Apriori, are straightforward implementations of the classical algorithms. Therefore, it is easy to research their operation from the literature. Clementine's rule generators produce "unrefined" rule sets that may contain rules for multiple outputs. Unrefined rule sets are good for discovering association patterns, but not appropriate for making predictions. Clementine can generate "refined" rule sets for predicting single outputs and these rule sets can be used in data mining streams. For clustering, Clementine offers the traditional Kohonen and K-means techniques along with a new TwoStep technique. TwoStep is a hierarchical clustering algorithm that begins by automatically generating a set of low-level subclusters, then recursively merging them into larger, more generalized clusters until the process can no longer be performed without sacrificing the internal cohesiveness of the higher-level cluster.
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
|
|











