Guide to the TechWeb Network

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home




November 10, 2000




A Meeting of Minds


The Knowledge Discovery and Data Mining Conference demonstrated commercial progress in the field

by Barry Grushkin

What if hidden somewhere in the terabytes of data produced daily by your company was the Holy Grail of your business - information that would make your company far more efficient, profitable, and successful?

The believers are marching in, as the biggest-ever Knowledge Discovery and Data Mining Conference just occurred this August. More than 900 industrial practitioners, data miners, computer researchers, and academics came from around the world to attend more than 100 presentations.

At the conference, known as "KDD 2000" for short, it was clear that the industrial world is using ever more complex systems for commerce and production, and simultaneously creating ever larger mountains of data in ever shortening intervals - data potentially of very high value.

The conference highlighted for me four issues clearly marked in knowledge discovery today:

  • Quantitative methods are offering real, big-dollar solutions.
  • Multiple technological fronts are advancing.
  • Highly trained practitioners are key.
  • In many areas, leaps in understanding, technology, and methods are just waiting to happen.

Real Solutions

Diverse commercial areas are cashing in on knowledge discovery techniques now, including e-commerce, manufacturing, and biotechnology. (See Resources for this article at IntelligentEnterprise.com for more details.) Here's just a taste of the applications that were presented at the conference:

  • Neural nets are predicting the temperature of pig iron blast furnaces. A hard to control and expensive uncertainty before, temperature directly affects how costly it will be to produce steel and the types of steel that can be made.
  • Semi-Markov models are telling when silicon chips are "cooked" by recognizing a subtle pattern unpredictable in time. If you miss this tiny window of opportunity, expensive wafers become trash. (This ability to recognize patterns, though stretched and distorted, bears a computational relationship to human perception and is undoubtedly a tool useful for recognizing financial market patterns.)
  • Decision trees are forecasting protein functions from DNA and RNA sequences, as well as predicting when jet engines will fail.
  • A new, commercially available classification software is predicting communication network loads and system breakdowns, as well as identifying Russian tanks from their sounds.
  • A whole new technology called genetic programming (GP) that recombines pieces of programming code in the way natural selection applies to DNA is forecasting carbon dioxide emissions in waste incinerators and is already in use in many process control problems. One paper suggested that GP offers an effective mechanism for finding models that satisfy multiple, potentially conflicting, objectives simultaneously such as a marketer's desire to maximize both the response to a solicitation and sales revenue.
  • Categorization and related word mappings are major issues in improving the effectiveness of knowledge management methods with faster ways of accessing information of major economic importance. On this front Lycos Inc. presented a novel idea demonstrating ways of inferring additional knowledge about texts by analyzing user actions. Web sites can be categorized without performing computations or even reading the texts themselves: Simply, two sites can be considered related if a user clicks on both of them after a single query. Furthermore, two queries can be considered related if different users navigate to the same site from both of them, thus you get a query-to-query thesaurus-like mapping.

Advancing Methods

The methods supporting these results are advancing rapidly, driven by the need to quickly analyze massive data sets more precisely, efficiently, and integrally with large existing informational infrastructures.

Many data mining algorithms were first created back when data sets were small enough to fit in main memory and no one had even heard of data warehousing or online analytic processing. Now powerful revisions of such data mining methods as association rules, classification, clustering, visualization, and high-dimensional indexing are being developed for very large data infrastructures - the kind used to store, say, millions of daily bank transactions or thousands of daily online users.

For example, one paper demonstrated a decision-tree method that produced results better than C4.5 (the most widely used decision-tree algorithm), extending way beyond what C4.5 can work on (about 100,000 records), reaching greater than 88 percent accuracy on 100 million records with time costs rising only linearly with the data set's size.

Applications of classification algorithms range from developing trading rules, predicting which customers will buy and which parts of a complex piece of equipment will fail. For really big data sets, one paper demonstrated the value of a classification algorithm called a "Support Vector Machine." No matter how big the data set, this method finds a very small subset of records that gives the values needed to best categorize all other records.

This concept of finding minimal requirements for maximal effect was a repeated theme. Finding a minimal set of most useful association rules, most interesting patterns, or minimal set of dimensions to efficiently index a high-dimensional data set that still maximally preserves record relationships or distance measures are examples.

Another example appeared in a paper that advocated using the minimal amount of costly yet essential human time to maximal effect. In a text categorization application, a jury of classifying algorithms tried to extend the classification from a few human-provided examples. Only when there is substantial disagreement in the automated panel does it come back and ask for a human opinion. This procedure continues iteratively until the whole corpus is classified.

Among other interesting papers were ones about using the fractal dimension to find clusters of just about any possible shape in data sets, and on the visualization front, an interactive way of exploring very large relational data sets through 3D dynamic projections (slicing and dicing at any angle and systematically shifting in real time) and textured spatting volumes (a way of seeing high-dimensional scatter plots in 3D by rendering layers of transparent multi-colored volumes).







IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo Jitter
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet Evolution
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space