CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise





October 4, 2001



Hard Core Mining

This data mining tool is not for the meek or frail

By Greg James

In this Issue:

  • Hard Core Mining
  • Pipeline


    PRODUCT SPEC SHEET

    Enterprise Miner 4.1

    SAS Institute Inc.
    SAS Campus Dr.
    Cary, NC 27513-2414
    800-727-0025
    www.sas.com

    PRICING: Roughly $100,000 to $400,000 the first year for a server with five included clients, depending on configuration. Available in client/server and desktop configurations.

    MINIMUM REQUIREMENTS: OS - (for client) Microsoft Windows 98, 2000, or NT, (for server) Windows NT or 2000, MVS ESA or prior releases including all OS/390 releases, Intel Linux, or 32- or 64-bit versions of Sun Solaris, IBM AIX, Compaq Tru64, or HP-UX. Disk space - 95MB. Memory - (for server) 512MB, (for Windows client) 48MB.

    There is hardly a debate that SAS Institute Inc., now celebrating its 25th anniversary, is a widely recognized and respected business intelligence and data analysis vendor. The company's core, statistical products are a de facto industry standard that many organizations rely upon for decision support and mission-critical tasks. It was only a matter of time for it to emerge as a major player in the data mining field, too.

    For many years, SAS didn't offer important, non-statistical algorithms that many data miners take for granted. It also lacked an integrated development environment to support the data mining process. That's all changed. In the mid-1990s SAS saw its opportunity. By 1997, it was extensively beta testing Enterprise Miner and officially released it in early 1998. The current release, 4.1, includes enhancements to its core functions as well as exciting new modules that greatly expand its use into new data mining applications.

    Lay of the Land

    Enterprise Miner is one member of a family of data analysis packages available from SAS. It plugs into SAS's top-level Display Manager and launches into a completely integrated child window. The Enterprise Miner interface has three main components: a Tool Bar across the top, a tree-oriented Project Navigator down the left, and a Diagram Workspace in the remaining area (see Figure 1).

    Enterprise Miner's myriad data mining operations are encapsulated as nodes. The Tool Bar, which is customizable, includes icons for the most common data mining project operations. Selecting the "Tools" tab at the bottom of the Project Navigator brings up a complete list of all available operations. To build a data mining project, you select nodes from the Tools List (or Tool Bar), drag them onto the Diagram Workspace, connect them into the desired sequence, and specify operational parameters within pop-up property windows.

    The Tool List categorizes the majority of Enterprise Miner's operations according to SAS's SEMMA data mining methodology. SEMMA is an acronym for Sample, Explore, Modify, Model, and Assess. The remaining operations fall into Scoring and miscellaneous Utility functions. SEMMA is an intuitive, functional decomposition of the typical data processing and analytic activities a data miner will perform. It is not, however, a comprehensive project management template like CRISP-DM.

    Data Mining Operations

    Enterprise Miner's functions are so thoroughly oriented around SEMMA that attempting to isolate and discuss just its modeling algorithms would be a mistake. Unfortunately, sometimes this categorization is a bit forced and unintuitive. Thus, finding what you want will require excursions through the online Enterprise Miner reference manual until you get the knack for the SEMMA breakdown.

    "Sample" functions include Input Data Source, Sampling, and Data Partition. The Sampling node performs simple random sampling, nth-observation sampling, stratified sampling, first-n sampling, and cluster sampling. The Data Partition node will take the target data set and split it into "train," "test," and "validate" subsets - a routine operation necessary for many modeling techniques. Combined with the Input Data Source node, these nodes provide an easy-to-use array of data access operations.

    "Explore" functions include the interactive Distribution Explorer, SAS Insight, and the new (experimental) Link Analysis nodes. Noninteractive Explore functions include Multiplot, Association, and Variable Selection. Multiplot is a simple, noninteractive graphics node that generates histograms and bar charts. The Association node generates traditional association rules and sequence chain rules. Variable Selection is used to select input variables by manual or automatic selection. It will calculate R-square and Chi-square tests to automatically identify the most important input variables when trying to build models that predict interval or binary targets.

    The Distribution Explorer generates multidimensional histograms. It is optimized to handle large data sets efficiently and lets 3D charts be rotated and moved interactively. The Insight node is a link to the SAS Insight product. SAS Insight is an interactive data exploration and analysis tool that comes with its own 577-page manual! Both of these nodes are complementary: Use Distribution Explorer to explore very large data sets with a limited graphical pallet and Insight to dig deeply into smaller, more refined subsets of data.

    Modify functions include the ability to maintain Data Set Attributes, Transform and Replace Variables, Filter Outliers, create cluster variables using the Cluster or SOM/Kohonen nodes, or transform transactional data into time series data using the new (experimental) Time Series node. The Data Set Attribute node is your window into the composition and use of a data mining data set. It maintains the metadata "glue" that all of Enterprise Miner's modules use. The Transform node is where you create new variables out of existing ones, and the Replace node is where you convert existing values or replace missing values. The Filter Outliers node handles both categorical and interval variables and provides automatic and manual methods for outlier removal.

    SEMMA treats clustering as a "modify" function, not a "modeling" function. Perhaps the rationale is that clustering is used most often to segment a data set into groups that are, in turn, subjected to further analysis. Enterprise Miner provides two nodes for clustering: the recommended Cluster node and the SOM/Kohonen node. The Cluster node is merely a link to the SAS Fastclus procedure that efficiently produces mutually exclusive clusters over very large data sets. The SOM/Kohonen node is principally used for feature selection and dimension reduction, especially when variables exhibit a high degree of nonlinear relationships.







  • IE Weekly Newsletter
    Subscribe to the newsletter
        Email Address