Guide to the TechWeb Network

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home




August 31, 2001



Five Principles of Intelligent Content Management

Developing a clear technology strategy is vital to the success of your B2B projects and initiatives

By Dan Sullivan

Continued from Page 1

PRINCIPLE 4: SUPPORT RICH SEARCHING

Efforts to improve searching have led to a variety of techniques for representing documents, enhancing user queries, and finding correlations among terms. Nevertheless, most information retrieval systems still hit a wall around the 60 to 70 percent range of precision and recall because they depend on primarily statistical techniques instead of linguistic understanding (which is not yet feasible on a general scale). Although we could continue to squeeze marginal gains from keyword searching, a better approach is to combine three techniques: keyword searching, clustering, and visualization.

SCATTER/GATHER

The Scatter/Gather algorithm, which was first described by Xerox PARC researchers Douglass Cutting, David Karger, Jan Pedersen, and John W. Tukey in a 1992 paper, uses text clustering to group documents according to the overall similarities in their content. Scatter/gather is so named because it lets the user "scatter" documents into groups, "gather" a subset of these groups, and then rescatter them to form new groups.

The most effective keyword search techniques expand the user's query. A thesaurus automatically adds synonyms to queries, so a search for "stocks" becomes a search for "stocks or equities." "Stemmers" are used to account for inflections and derivations, so searches for "African" will also find "Africa," and "banks" will check for "bank." Soundex and fuzzy matching are also useful for compensating for misspellings

But even with relatively high precision and recall, keyword searches can yield seemingly unmanageable number of hits. Clustering is an effective way to address this problem.

Hierarchical clustering is the process of building a tree structure in which the root of the tree contains all documents, internal nodes contain groups of similar documents, and the size of the groups decrease as you move farther from the root until the leaf nodes contain only single documents. These clusters provide a familiar taxonomy-like structure that let users navigate from broad collections of topics to more narrowly focused texts.

Another technique that has proven quite effective in reducing the time it takes users to find relevant content is the scatter/gather algorithm. With scatter/gather, a result set is clustered into a small, fixed number of groups. (Five seems to be a good size.) Users then select the most relevant of the groups and the documents within that group are then clustered into the same number of groups. Again, the user can drill down into the most relevant cluster and each time the elements are grouped into a number of semantically related clusters. The advantage of this approach is that users can dynamically direct the clustering process as they focus on the most relevant topics.

Clustering effectively groups documents based upon content, but sometimes users need to explore the areas around hyperlinked documents. For example, a member of a geographically distributed sales team might look into sales to a consumer electronics retail chain and find several types of documents ranging from meeting notes of other team members to news feeds from Comtex News, Factiva, or other business content aggregators.

Course-grained navigation tools, such as Inxight's Tree Studio, display hyperbolic trees where each node in the tree represents a document labeled with a title or other descriptive text. Instead of clicking through to each linked page individually, a user can quickly assess a neighborhood of hyperlinked documents and focus in on topics of particular interest.

When users are looking for targeted information, a more fine-grained navigation tool is appropriate. For example, if an account manager is looking for information about sales of mobile phone service and needs to distinguish key marketing terms among a variety of plans, then a tool such as Megaputer Intelligence Inc.'s TextAnalyst allows users to quickly focus on particular terms and discover their relationship to other terms in the text.

Of course, searching, clustering, and navigation all presume a significantly large repository of relevant content. That fact leads us to the final principle.

PRINCIPLE 5: KEEP CONTENT TIMELY, AUTOMATICALLY

Some aspects of content management should be automated to keep pace with the available supply of potentially useful content. First of all, you can use harvesters, crawlers, and file retrieval programs to gather documents for inclusion in the content repository. These programs are themselves driven by metadata about which sites to search and which directories or document management systems to scan for relevant content. In many cases, only metadata about documents and indexing detail need to be stored in the portal or document warehouse, and the documents themselves can be retrieved on an as-needed basis.

Automatically gathered documents may require file format or character set conversion before indexing, clustering, metadata tagging, and other text analysis tools can go to work. Automatically managing content is as difficult, or more so, than the extraction, transformation, and load process in data warehousing because the structure, format, and range of topics is more varied. This process will require a series of filters, transformations, and analysis steps as text moves into the content repository.

HIGH FIVE

  1. Make metadata king.
  2. Know the user.
  3. Control access to content.
  4. Support rich searching.
  5. Keep content timely, automatically.

Unlike data warehouses that tend to keep historical data, portal content should be purged. Again, metadata about document types and sources will drive this process. For example, analysts' predictions about earnings reports become irrelevant when an actual earnings report is issued (unless, of course, you want to track the accuracy of the past predictions). In other cases, we might want to keep only the summary of a text, such as a product recall notice or a competitor's press release more than two years old.

Tracking when documents arrived, where they came from, who created them, and other attributes will provide the grist for a number of content management processes.

REMEMBER THE FIVE

Free-form text is often called unstructured, but that term is a misnomer. Language's rich structure succinctly represents complex concepts and relationships, but to effectively access that information requires techniques that account for that structure and let users bridge the gap from their interests to information retrieval.

Your organization can realize intelligent content management by adhering to these five basic principles. All are based on the realization that users need to find small amounts of targeted information from sprawling repositories of enormous scale such as the aptly named World Wide Web.



Rate This Article

Comments:

Optional e-mail address:

As long as we use language, we will always use words with multiple meanings and concepts expressed with a variety of words, as well as confront constantly less than perfect precision and recall. Making the implicit explicit through metadata; modeling user interests; protecting access to content; supporting search, organization, and navigation tools; and keeping content up to date all chip away at the inherent structural problems of dealing with large volumes of unstructured texts.



Dan Sullivan [DSullivan@RedmontCorp.com] is CTO of Redmont Technologies, a consulting firm specializing in business intelligence and content management systems.


SEARCH ENGINES: THEY ARE NOT ALL THE SAME

Although search engines on their own are insufficient for reaching high levels of precision and recall in information retrieval systems, they are a solid starting point. Some search engines work from the most basic principle - indexing words that appear in documents - while others analyze patterns without regard to language specifics, or exploit syntactic and semantic knowledge of language to identify concepts represented in texts.

  • Verity Inc. was among the first vendors to provide full-text search capabilities. Verity's Portal One application provides standard searching as well as personalization, navigation, and classification tools.
  • Autonomy Inc. provides tools similar to Verity's but takes greater advantage of pattern recognition, Bayesian inference, and information theory. By detecting recurring patterns, much as compression programs do, Autonomy can build a model of how word patterns correlate and determine how distinctive these patterns are within a large document collection. Autonomy does not assume any language-specific features - white space delimits words, for example - so the tool is language independent.
  • Oracle's Open Text (formerly Oracle interMedia Text) uses the InXight LinguistX Platform and a proprietary knowledge base to support thematic or concept-based searching. Instead of just looking for patterns, Open Text finds distinctive terms and determines their general categories. For example, searching for "financial institution" can find documents that never mention the term but do refer to "banks."
  • Semio Corp. takes a different approach. It creates browseable taxonomies based on document similarity. Relevance ranking is used within categories, and for large taxonomies, the ability to search the taxonomy categories can reduce the time spent navigating relevant areas.


RESOURCES

Sullivan, Dan. Document Warehousing and Text Mining (Wiley, 2001) Available at the IntelligentEnterprise.com bookstore.

Autonomy: www.autonomy.com

Inxight: www.inxight.com

Klarity: www.klarity.com au

Megaputer Intelligence: www.megaputer.com

Oracle: www.oracle.com

Semio: www.semio.com

Solutions-United: www.solutions-united.com

Verity: www.verity.com

Related Articles on IntelligentKM.com:

"Extracting Knowledge," May 7, 2001: www.intelligentkm.com/feature/010507/feat1.jhtml

"Word Wranglers," Jan. 1, 2001: www.intelligentkm.com/feature/010101/feat1.jhtml







IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo Jitter
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet Evolution
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space