Hard Core MiningThis data mining tool is not for the meek or frailBy Greg James
In this Issue:
There is hardly a debate that SAS Institute Inc., now celebrating its 25th anniversary, is a widely recognized and respected business intelligence and data analysis vendor. The company's core, statistical products are a de facto industry standard that many organizations rely upon for decision support and mission-critical tasks. It was only a matter of time for it to emerge as a major player in the data mining field, too. For many years, SAS didn't offer important, non-statistical algorithms that many data miners take for granted. It also lacked an integrated development environment to support the data mining process. That's all changed. In the mid-1990s SAS saw its opportunity. By 1997, it was extensively beta testing Enterprise Miner and officially released it in early 1998. The current release, 4.1, includes enhancements to its core functions as well as exciting new modules that greatly expand its use into new data mining applications. Lay of the LandEnterprise Miner is one member of a family of data analysis packages available from SAS. It plugs into SAS's top-level Display Manager and launches into a completely integrated child window. The Enterprise Miner interface has three main components: a Tool Bar across the top, a tree-oriented Project Navigator down the left, and a Diagram Workspace in the remaining area (see Figure 1). Enterprise Miner's myriad data mining operations are encapsulated as nodes. The Tool Bar, which is customizable, includes icons for the most common data mining project operations. Selecting the "Tools" tab at the bottom of the Project Navigator brings up a complete list of all available operations. To build a data mining project, you select nodes from the Tools List (or Tool Bar), drag them onto the Diagram Workspace, connect them into the desired sequence, and specify operational parameters within pop-up property windows. The Tool List categorizes the majority of Enterprise Miner's operations according to SAS's SEMMA data mining methodology. SEMMA is an acronym for Sample, Explore, Modify, Model, and Assess. The remaining operations fall into Scoring and miscellaneous Utility functions. SEMMA is an intuitive, functional decomposition of the typical data processing and analytic activities a data miner will perform. It is not, however, a comprehensive project management template like CRISP-DM. Data Mining OperationsEnterprise Miner's functions are so thoroughly oriented around SEMMA that attempting to isolate and discuss just its modeling algorithms would be a mistake. Unfortunately, sometimes this categorization is a bit forced and unintuitive. Thus, finding what you want will require excursions through the online Enterprise Miner reference manual until you get the knack for the SEMMA breakdown. "Sample" functions include Input Data Source, Sampling, and Data Partition. The Sampling node performs simple random sampling, nth-observation sampling, stratified sampling, first-n sampling, and cluster sampling. The Data Partition node will take the target data set and split it into "train," "test," and "validate" subsets - a routine operation necessary for many modeling techniques. Combined with the Input Data Source node, these nodes provide an easy-to-use array of data access operations. "Explore" functions include the interactive Distribution Explorer, SAS Insight, and the new (experimental) Link Analysis nodes. Noninteractive Explore functions include Multiplot, Association, and Variable Selection. Multiplot is a simple, noninteractive graphics node that generates histograms and bar charts. The Association node generates traditional association rules and sequence chain rules. Variable Selection is used to select input variables by manual or automatic selection. It will calculate R-square and Chi-square tests to automatically identify the most important input variables when trying to build models that predict interval or binary targets. The Distribution Explorer generates multidimensional histograms. It is optimized to handle large data sets efficiently and lets 3D charts be rotated and moved interactively. The Insight node is a link to the SAS Insight product. SAS Insight is an interactive data exploration and analysis tool that comes with its own 577-page manual! Both of these nodes are complementary: Use Distribution Explorer to explore very large data sets with a limited graphical pallet and Insight to dig deeply into smaller, more refined subsets of data. Modify functions include the ability to maintain Data Set Attributes, Transform and Replace Variables, Filter Outliers, create cluster variables using the Cluster or SOM/Kohonen nodes, or transform transactional data into time series data using the new (experimental) Time Series node. The Data Set Attribute node is your window into the composition and use of a data mining data set. It maintains the metadata "glue" that all of Enterprise Miner's modules use. The Transform node is where you create new variables out of existing ones, and the Replace node is where you convert existing values or replace missing values. The Filter Outliers node handles both categorical and interval variables and provides automatic and manual methods for outlier removal. SEMMA treats clustering as a "modify" function, not a "modeling" function. Perhaps the rationale is that clustering is used most often to segment a data set into groups that are, in turn, subjected to further analysis. Enterprise Miner provides two nodes for clustering: the recommended Cluster node and the SOM/Kohonen node. The Cluster node is merely a link to the SAS Fastclus procedure that efficiently produces mutually exclusive clusters over very large data sets. The SOM/Kohonen node is principally used for feature selection and dimension reduction, especially when variables exhibit a high degree of nonlinear relationships.
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
|
|











