CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise



Eye on the Competition

Why text mining is the key enabler of automated competitive intelligence

By Dan Sullivan

Mastering competitive intelligence (CI) — the art of gathering information about competitors’ activities and general market trends — is now as important as tapping business intelligence techniques to understanding internal operations. Why? Because e-business lowers or demolishes barriers to entry and fuels the continual emergence of new business models. To stay abreast of these changes, your organization must give its analysts timely, concise information about your competitive environment such as changes in regulatory environments, major capital expansions by competitors, and changes in business partnerships and alliances.

However, the type of information required in CI goes beyond the numeric data typical of decision-support systems. Rather, CI analysts need to monitor news stories, government reports, press releases, patents, trademarks, public records, conference proceedings, and SEC filings. Many of these sources provide primarily textual information or a mix of textual and numeric data. Thus, the key to unlocking the CI value of these sources is text mining.

In this article, I’ll discuss the CI process with an emphasis on text-mining techniques that automate the process. Then I’ll describe the tools used in automated CI systems as well as the architecture of those applications. Finally, I’ll examine the privacy and ethical considerations you must consider when undertaking CI operations.

Needles in the Haystack

Automated CI has three distinct parts: gathering raw information, processing the raw information, and delivering targeted, concisely organized details. Finding information is not a problem on the Internet; you could easily inundate yourself with data on virtually any topic with little effort. The real challenge in CI is focusing on a limited number of topics and filtering out as many irrelevant documents as possible.

Search engines and Web robots, or crawlers, are the basic tools for this first stage in CI. Although we are all familiar with the inaccuracy of search engines — queries often return a large number of irrelevant pages, without returning many relevant ones — they are still the best starting point for general inquiries. Large, keyword-oriented search engines such as Alta Vista and Google attempt to index large portions of the Web and are good starting points for comprehensive searches. Directories such as Yahoo sacrifice the size of their index for improved precision by organizing Web sites into a classification hierarchy. These directories are useful for discovering companies and organizations involved in particular markets or business sectors. Metasearch engines, such as Metacrawler, query multiple search engines and merge the results, improving recall. (A single search engine fully indexes perhaps only 50 percent of the Web, at best.) In general, search engines are useful for discovering new data sources, but when you have found a reliable source of information, Web robots—which can act as agents to execute specific Web-oriented tasks, such as retrieving Web pages on a regular basis — are more efficient mechanisms for regularly retrieving news and other updated information.

For example, you could stay abreast of your competitors’ activities on an ongoing basis by regularly monitoring general business and industry-specific Web sites as well as combing for new information sources with search engines. In addition to competitors’ Web sites, CI applications can find a wealth of information at the U.S. Security Exchange Commission’s (SEC’s) Edgar database, from press release services such as Business Wire, from online news services, and from regulatory agencies such as the U.S. Food and Drug Administration. (See the sidebar, “Where to Look?” on page 32 for more sources.) To find new sources, Web robots—in combination with scripting languages such as Python and Perl — provide the means to regularly query, search, and retrieve the resulting pages for analysis.

Because Web robots can place high demands on a Web server, many sites use the Robot Exclusion Protocol to limit the use of automated retrieval tools. In practice, Web sites place a robots.txt file in the site’s root directory specifying which areas of the site are off limits to different robots. Web robots in turn voluntarily abide by these restrictions. (A good introduction to the Robot Exclusion Protocol is available at info.webcrawler.com/mak/projects/robots/exclusion-admin.html.)

Gathering documents is just the first part of the process. Next, you need to reduce the documents to essential information and get it to those who need it.

Pack It Up, Ship It Off

Even with targeted retrieval programs, automated CI applications are still likely to retrieve irrelevant documents. In addition, because the documents will probably find their way into a text repository, document warehouse, or database that supports multiple CI activities, classifying document topics is an early processing step.

During the classification process, you add metadata to these documents using a standard set of terms defined in a subject hierarchy. For example, news stories and other articles are frequently classified by either their dominant theme—“merger” or “terrorism,” for example — or by a number of generic dimensions such as location, industry type, political event, and economic issue. MyCNN.com, for instance, uses Oracle Intermedia Text’s built-in knowledgebase to classify stories based on frequently used words in a story. If it has a rich vocabulary, a classification tool will correctly classify a document discussing stocks and equities as stories about investment vehicles, while categorizing stories about Malaysian trade policies as articles about Southeast Asian politics and economics. Classification programs will distinguish documents to varying degrees of granularity, depending on the depth and breadth of the subject hierarchy.

When documents are classified they can be routed to interested readers. Defining and maintaining user interest profiles is an essential part of the routing process. Again, subject hierarchies play a central role. Depending on the granularity and breadth of the hierarchy, users can define specific filtering criteria. For example, someone interested in political risk analysis in Asia may want stories about Malaysia’s trade, local elections, and law enforcement, but not about the country’s Olympic team. Although manual maintenance is the most easily implemented means of managing profiles, in some cases machine-learning techniques may offer a better mechanism for automated profile adjustment.

Filtering by classification can significantly reduce the number of documents an analyst receives, but automatic summarization can reduce the workload even more. By some estimates, the meaning of the average document can be conveyed in approximately 20 percent of the original text. There is no single algorithm for summarizing text, but in general, it can be executed by extracting sentences containing the most frequently occurring words in a document. That heuristic is usually modified to exclude commonly used words that do not help distinguish the meaning of a text, such as “the,” “but,” and “which.” Some summarization tools weight words used in titles more heavily than others and treat phrases such as “in conclusion” or “more importantly” as good indicators of important sentences. Overall, between classification and summarization, a CI analyst can effectively reduce the amount of text he or she must read to a fraction of the original document set.

When dealing with documents that mix text and numeric data, such as the U.S. Agriculture Department’s crop production reports or the SEC’s Form 10-K on publicly traded companies, standard summarization techniques alone will not suffice. Before summarizing, you should extract the numeric data, either with a custom program for the particular report type or with an XML parser if a common standard, such as extensible business reporting language (XBRL), is involved. As XML becomes more widely adopted, extracting target information will become easier, especially with regularly generated information such as mandated regulatory reports and financial statements.

Competitive Intelligence Tools

Implementing a CI system requires several different types of tools in addition to search engines and Web robots. First, you must store the retrieved texts in a repository. The most common options are a textbase such as Lotus Notes, database system with text support such as Oracle 8i, or a combination of the two, such as IBM’s Content Manager and DB2. Ideally, repositories should support triggers and other programmatic hooks that let you route documents, track metadata, cluster similar documents, and perform conditional text processing.

While many steps in the CI process are standard, others occur only conditionally. For example, a text may need to be translated during the load phase depending on its source language, or if the text is routed to an analyst who speaks a language that differs from the document’s language.

A complete suite of text processing and mining applications will include tools to:

• Summarize text

• Identify language

• Extract document metadata

• Extract features (company names, locations, and so on)

• Categorize and classify text

• Cluster similar documents

• Build subject hierarchies

• Translate documents.

Note that not all the tools available for these tasks are equally mature. Summarization, for example, is a well-understood problem, and even desktop tools such as Microsoft Word sport summarization features. Clustering and feature extraction also work well in most cases. However, the state of the art in building subject hierarchies and translating documents still cannot support fully automated processes, so manual intervention is usually required.

Architecture of CI

The CI architecture is highly distributed. Like business intelligence and data warehousing environments, data is drawn from multiple sources: Document management systems, file servers, and knowledge management systems provide internal documents; Web sites dominate external sources; and newsgroups, FTP sites, and WAIS servers are rich sources of text information as well. (See Figure 1.)

FIGURE 1 Architecture of an automated CI system.


TEXT HEAVIES
TOOLS OF THE TRADE FOR AUTOMATED CI

Building an automated CI system requires several different types of tools, including text analyzers, Web crawlers, and a scripting language to tie applications together. Here’s a sample of some tools you might want to consider:
IBM Intelligent Miner for Text: A suite of text mining tools including language identification, summarization, and feature extraction. Also includes a Web crawler that works with DB2.
Oracle Intermedia Text: Text processing option included in Oracle8 and 8i that supports summarization, SQL operators for text, and thematic indexing
Megaputer TextAnalyst: A Windows NT-based tool for summarization and keyword searches. Available as a standalone client/server application or a set of COM objects that you can integrate into other applications. The next release will feature a classification tool as well.
Python: An open source programming language with many Web-friendly features. A high-level scripting language such as Python or Perl is essential when using several different tools together.
Semio Taxonomy: Aids competitive analysts in the development of subject hierarchies for classifying and retrieving documents.
Verity Tool Set: A leader in text analysis, Verity offers tools including an information server, Web crawler, document navigator, and classification tool.
Wget: An open source Web crawler available from the GNU Project. Wget supports many command-line options allowing users to effectively manage document retrieval.

Many Web agents and related information, including reviews, are available at BotSpot (www.botspot.com).

A few points are worth noting about this architecture. First, text-mining operations are computationally demanding, and depending on load, multiple servers may be required to accomplish all text-processing operations in reasonable time. Second, a structured repository, such as a document warehouse, is required to exploit the full potential of text mining. Integrating documents and maintaining rich metadata will provide far more value than ad hoc querying over the Internet followed by on-the-fly summarization and classification. Third, we need to search for information in several ways, such as by searching HTML documents, scanning FTP directories, and querying WAIS servers. The range of demands on the document retrieval processes is yet another reason to adopt a high-level, Web-enabled programming language such as Python or Perl that can significantly reduce the time to develop retrieval applications.

Privacy and Ethical Considerations

In the New Economy, the need to protect individual privacy is a generally accepted principle in public policy circles. But do business and other organizations have a right to privacy? In the past, piecing together bits of information about competitors was a time-consuming, haphazard process. No longer: automated CI now provides the means to create summarized snapshots of competitive activity, the state of the market, and the regulatory environment in which businesses operate. So the question becomes, what types of CI activities are ethical?

Fortunately, that question arose long before the advent of automated CI, and organizations such as the Society for Competitive Intelligence Professionals have formulated codes of conduct for their members. These codes, which are available at www.scip.org/ci/ethics.html, are an excellent starting point for anyone delving into CI. For a negative example, we need to look no farther than Oracle’s recent attempts to gather intelligence on Microsoft activities during the antitrust proceedings by poring through its corporate trash. The backlash from that incident should serve as a warning that it is possible to overstep the line.

Work to Be Done

As I explained, the core resource of CI is text: The ability to collect, analyze, filter, and distribute a large number of documents determines the quality and value of CI information. The text mining and information retrieval operations I describe here are the building blocks of an automated CI application.

Although many of the pieces for implementing automated CI are in place, challenges remain. First, although the Internet is a vast resource, finding high-quality data will demand the attention of knowledgeable analysts.

Second, the principles that guide your conduct CI are as important as those that guide your other professional dealings. Ethical standards are in place for CI as a general practice, but automated CI raises new issues, such as whether ignoring a Web site’s robot exclusions is ethical.

Third, how will we deal with the volumes of text and other content out there? Today, the Web contains about 100TB of information. Online data in businesses and other organizations totals around 1,000 petabytes (one petabyte equals 1,024TB) and additional offline information is estimated at 20 exabytes (one exabyte equals 1,024 petabytes). These figures do not include the 300 exabytes of content found in books, journals, and other nonelectronic media. Storing, searching, and indexing large volumes of text continues to be an active area of research and experimentation.

Last, as I described, some text-processing tools still require manual intervention. Machine translation and automated creation of subject hierarchies are two extremely difficult operations, and 100-percent automation is unlikely in the near future.

In spite of the remaining challenges, automated CI is possible using commercial and open source tools. (See sidebar, “Text Heavies.”) In fact, your competitors may be doing it right now.



Rate This Article

Comments:

Optional e-mail address:



Dan Sullivan (desullivan@crtinc.com) is director of data warehousing at Computer Resource Team Inc. (www.crtinc.com), a software development firm specializing in custom e-business and business intelligence application development.




 





IE Weekly Newsletter
Subscribe to the newsletter
    Email Address