Guide to the TechWeb Network

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home




December 5, 2001

A Closer Look

Giving customer emails the same analytic scrutiny you give transactions could be a valuable competitive advantage

By Dan Sullivan

Continued from Page 1

The final product of information extraction process is either a fully structured representation of an email in a relational representation or a semistructured document in XML that can be easily parsed and loaded into a database.

When in the database, extracted elements can be mapped to existing dimensional models and hierarchies and aggregate measures calculated about the extracted facts. At that point, users are ready to pose questions such as, "How many complaints were received about consumer electronics equipment in the last three months?"

EXTRACTION TOOLS FOR EMAIL

Compared to their text analysis siblings (search, classification, and taxonomy generators), information extraction tools are far less common. Tools from three vendors — Solutions-United, Temis, and WhizBang Labs — are representative examples of this technology.

Solution-United's MetaMarker provides a full range of lexical, syntactic, semantic, and pragmatic analysis. The product includes a set of distinct modules for individual tasks allowing developers to choose processing features a la carte through a Java and XML- based API. From a system architecture perspective, MetaMarker is a transformation tool that takes free-form text as input and generates a formatted data stream ready to load into the database.

MetaMarker works with three basic objects: resources, tasks, and controls. There are two types of resources: rule sets, which work on a single problem such as lexical analysis or noun phrase identification, and knowledge bases that support analysis, such as a lexicon. Tasks are sets of resources that are applied to a text in a particular order. For example, the part-of-speech tagging task is specified by:


<MTASK NAME='POS TAGGER' 
TYPE='POS_TAGGER'>
<MRES NAME='LEXICON'> 
LEXICON.DAT </MRES>
<MRES NAME='LEXICAL RULES'> 
LEXICON.RUL </MRES>
<MRES NAME='CONTEXT RULE'> 
CONTEXT.RUL </MRES>
</MTASK>

Other predefined tasks include cleaner, tokenizer, sentence detector, stemmer, classifier, and formatter: You can group tasks into profiles for easier reference when processing documents. Controls specify which tasks are applied to a particular section of the email; for example, the classification task might be applied to the body of an email only, instead of including the header information as well.

Objects are specified at the environment level using an XML configuration file and at the document level where the specifications are embedded into the text stream. (A typical message ready for analysis would look something like what you see in Listing 1.) This example assumes we have defined two profiles, core and classify, using the appropriate tasks which in turn are used to specify the resources required to complete the operation.

MetaMarker is initiated through a Java program in several steps. First, a document object is created, the source file is parsed, a MetaMaker object is created, and then the processDocument method is invoked. The basic template is:


COM.SOLUTIONSUNITED.METAMARKER.
DOCUMENT DOC = 
NEW COM.SOLUTIONSUNITED.
METAMARKER.DOCUMENT();
DOC.FROMURI(FILENAME);
COM.SOLUTIONSUNITED.
METAMARKER.METAMARKER MM =
NEW COM.SOLUTIONSUNITED.
METAMARKER.DOCUMENT();
MM.PROCESSDOCUMENT(DOC);
JAVA.LANG.STRING XMLDOC = MM.
GETXML();

The resulting string in the xmldoc object is the fully augmented message with tags such as <PRODUCTMODELID>, <PRODUCTMODEL>, and <PRODUCTMANUFACTURER>. Similar types of information extraction are available with Temis' suite of products.

Temis also provides a set of tools for analyzing customer emails. Its Insight Discoverer Extractor combines morphological and syntactic analysis to produce an intermediate representation of a message that is then further analyzed using a set of grammatical rules and thesauri that are stored in "skill cartridges." The skill cartridges contain both lexical, word-based information and sentence patterns for identifying key features. The Opinions Analysis cartridge identifies particular opinions and their relationship to specific objects, such as a customer is dissatisfied with a particular product. Industry-specific cartridges, such as the Banking Cartridge, are also available.

Although not specifically targeted at email analytics, WhizBang Labs is another vendor in the information extraction market. WhizBang's products use machine learning techniques to develop extraction rules from example texts labeled by users. In addition, extracted information is assigned confidence measures that can be used to route texts that fall below a predefined threshold to a human for verification. These tools have been used in a range of related applications, from extracting targeted information from resumes to adding XML structures to unstructured texts.



Rate This Article

Comments:

Optional e-mail address:

TO THE LETTER

Not all the information we need to manage customer relations is neatly packaged and ready for analysis. Customer emails can act as early indicators of trends, such as dissatisfaction with a shipping policy, that will not show up in structured databases for some time or with as much detail. Emerging information extraction tools are bridging the gap between free-from texts and analytic tools.


Dan Sullivan [dsullivan@redmontcorp.com], author of Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing and Sales (John Wiley, 2001), is CTO of Redmont Corp., a firm specializing in the design and development of content management and business intelligence applications.


Read All About It: The Right Tool for the Job

Information extraction tools are beginning to make their mark in e-business analytics, but like other contemporary text analysis tools, they have their limitations.

First, it is difficult to develop rule sets for identifying interesting patterns. Some progress has been made in applying machine-learning techniques to discover these rules (for example, WhizBang products), but you should assume results would still require some manual review. A second shortcoming is that pattern-matching techniques are not foolproof. Without the correct context, order numbers can be confused with account numbers, and noun phrases may not be parsed correctly. Finally, specialized thesauri and knowledge bases may be required for accurate information extraction in domains with specialized terminology.

However, if you need to extract a relatively fixed set of information from free-form text, an information extraction tool might meet your needs. When evaluating these tools, consider these factors:

  • How are extracted features defined?
  • Does the product provide a predefined set of features, such as product names, model numbers, and order numbers? If so, to what extent?
  • Do date and currency matching patterns meet all your international needs?
  • How are new patterns defined?
  • Initially, information extraction programs depended on hand-crafted rules, but developers are now making progress toward automating this step with machine-learning techniques. If an automated method is available, how many training instances are required?
  • Are industry-specific modules available? Both terminology and phrase patterns will vary across industries.
  • Does the tool include a classification module? You could use such a module to categorize the message and then apply type-specific patterns.
  • Finally, applying multiple levels of analysis will take time. Will the tool meet your performance requirements?


RESOURCES

Solutions-United: www.solutions-united.com

Temis: www.temis-group.com

WhizBang Labs: www.whizbang.com







IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







techweb
Online Communities TechWebInformationWeekLight ReadingIntelligent EnterprisebMightyNetwork ComputingDark ReadingDigital LibraryWall Street & Technology
Byte & SwitchNo JitterInternet EvolutionLight Reading's Cable Digital NewsContentinopleUnStrungBank Systems & TechnologyAdvanced TradingInsurance & Technology
Face-to-Face Events
InteropWeb 2.0 ExpoWeb 2.0 SummitVoiceConBlack HatCSISoftwareEntrprise 2.0 ConferenceGTEC
Mobile Business Expo
InformationWeek 500 ConferenceBuy Side Trading XchangeBuy Side Trading SummitBank Executive SummitInsurance Executive SummitTelcoTVEthernet ExpoOptical Expo
Magazines  
InformationWeekWall Street & TechnologyInsurance & TechnologyBank Systems & TechnologyAdvanced TradingMSDNTechNetSmart EnterpriseThe Architecture JournalDatabase Magazine
 
Research & Analyst Services  
Heavy ReadingInformationWeek ReportsInformationWeek Analytics