CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise





January 14, 2002

An Eye for the Needle

Accurately representing knowledge workers' domain expertise in a corporate portal's taxonomy is one of the greatest challenges to the development of portal-based content management applications

By Philip Russom

Like a giant haystack, countless documents and other text sources pile up on hard drives in the average office. And the words of wisdom we seek are like needles buried deep in the stack, nearly impossible to find among the irrelevant hay.

EXECUTIVE SUMMARY

Philip Russom

Knowledge workers want content management applications to impose order on document chaos. The order imposed must model the business domain they work in. They see the taxonomy of a corporate portal as the key mechanism for managing content according to domain-relevant topics. The taxonomy — a structure for categorizing text content by topic — is the piece of the content management application that knowledge workers depend on most and, therefore, the piece they use for measuring its success.

For many knowledge workers, finding needles of knowledge in the haystack of content is a daunting but mission-critical task, one that has led them to seek state-of-the-art software tools that can draw knowledge out. Many have turned to software for content management — a broad term that includes tools and technologies for document management, knowledge management, collaboration, search, text mining, topic categorization, taxonomy generation, and so forth. What all these have in common is the assumption that so-called unstructured data — that is, text and the linguistic content it expresses — can be given structure so that knowledge workers can more easily find and retrieve relevant documents and text passages.

Many lessons have emerged from the quest for this better technology. One of the leading lessons is that the structure imposed on textual data must map directly to the functional domain in which a particular knowledge worker operates. For instance, a staff writer for a magazine needs to see text sources sorted into topics covered by the magazine, possibly topics that map straight to the "beats" of individual writers. Likewise, financial analysts need to see a structure of topics representing corporate and monetary entities that their investment firms track.

Hence, the best tool for the job is a content management application that models the domain of its intended knowledge worker and automates the categorization of content as much as possible. A focused application (not a generic tool) greatly facilitates the search, where structuring content reveals valuable needles and hides irrelevant hay.

KNOWLEDGE WORKER REQUIREMENTS

What do knowledge workers want and need from content management applications? With surprising unanimity, they typically articulate the same four requirements:

Impose order on chaos. Chaos prevails in the haystack, which is an undifferentiated and unordered collection of textual source documents. For most knowledge workers, the high-level goal is to impose order on this chaos, by associating each document (or passage of a document) with a topic (or subtopic) that the text is about. Before a knowledge-driven organization can catalog its content, it must design a taxonomy — a collection of relevant topics and subtopics arranged in a hierarchical structure. There are many synonyms for taxonomy, such as document directory, catalog, classification, and categorization. Whatever you call it, the taxonomy is the order imposed on document chaos, which makes it the most critical component of a content management application.

Model the knowledge worker's domain. Many organizations, long before turning to content management software, put business processes and informational structures in place that impose order on chaos to some degree. Such organizations are unlikely to discard or alter these structures for the sake of a content management application. Instead, the application must leverage this valuable work.

For instance, most market research firms publish reports that segment (by product or service) the market they research; this segmentation can easily inspire a taxonomy's design. Organizations of many types have created folder and subfolder structures in Lotus Notes or Microsoft Exchange; others store documents in a labyrinth of directories on shared network drives. These structures, too, can contribute to taxonomy design.

Automate categorization with software. The greater the volume of documents, the harder it is for knowledge workers to keep up with categorizing them — that is, associating each with a topic in the taxonomy. In such circumstances, scaling up a content management application requires automated categorization. Tools for categorization (which can be embedded in a content management application) can parse a document or other text source, understand it semantically well enough to deduce the topics it discusses, and make entries in the taxonomy associating the source with relevant topics.

Present the application via a portal. For a variety of reasons, knowledge workers see the corporate portal as the platform of choice for content management applications:

  • The taxonomy is the piece of the application that knowledge workers rely on most, and the user interface commonly found in corporate portals includes a frame on the left side of the browser for a taxonomy.
  • A content management application aggregates information from documents that may be subject to restricted access. A corporate portal can enforce security authorization, despite the aggregation.
  • One knowledge worker's needle is another's hay. A corporate portal can personalize the taxonomy and other content presentation.

Summing up the four requirements, knowledge workers want to impose taxonomic order on document chaos, but only if the taxonomy models their domain accurately. And they want software to assist with categorizing, as long as it respects the taxonomy they created. Plus, the content management application should be presented via a portal, largely for the sake of accessing the taxonomy. As the common concern across all requirements, the taxonomy takes on tremendous importance for an application's success.

WALK THE MIDDLE PATH

In the late 1990s, vendors claimed their text mining and automatic categorization tools were so efficient at taxonomy generation as to provide a "portal in a box." The hype has long since died, and these same vendors now admit that software for automatic taxonomy generation and document classification has nowhere near the average knowledge worker's accuracy. In a parallel development, organizations that rely on knowledge workers to manually classify text sources have found it rather difficult to scale up this human-intensive task to keep pace with burgeoning text volumes.

The failure of these two extremes has led to a best practice for content management that strikes a balance between them. Even the vendors of tools for automating document classification now recommend significant human involvement in the everyday chores of taxonomy tweaking and text tagging. And companies with an army of librarians and taxonomists devoted to manual classification also bring to bear automatic classification software in the battle against infoglut.

The new best practice starts with knowledge workers (or more specialized personnel, such as librarians and taxonomists) who create high-level topics and subtopics, arranged in a hierarchy. Many content management tools today help workers discover topics (via search, query, clustering, or mining technologies), then convert the findings directly into a topic in a taxonomy. Editing tools enable knowledge workers to fine-tune topics and establish business rules for how text is classified to them.







IE Weekly Newsletter
Subscribe to the newsletter
    Email Address