Your enterprise now has millions of documents. The web has billions. Being able to easily find what you need when you need it is the knowledgemanagement issue of the century. The race is on. Hundreds of companies are competing to produce better search methods as solutions to this information explosion. And if they want to test how they stack up, each year the National Institute for Standards and Technology sponsors a kind of search engine Olympics: the annual Text Retrieval Conference (TREC). At TREC, companies can have their engines compared on a standardized set of queries and text bases (a new set each year). Entries are rated by their level of precision (percent of on-topic documents pulled) at various levels of recall (number of documents pulled). However, despite the wide range of retrieval strategies, including Vector Space, Probabilistic, Inference Networks, Extended Boolean, Latent Semantic Indexing, Neural Nets, Genetic Algorithms, and Fuzzy Set, there seems to be a wall somewhere in the high 60th percentile for average precision -- no one seems able to get past it. Is this the minute mile? What is going on? David A. Grossman has an idea. He is one of the main forces behind TREC and now a professor at Illinois Institute of Technology. He says the next step has to be in "text utilities." (See his and Ophir Frieder's book, Information Retrieval: Algorithms and Heuristics, Kluwer Academic Publishers, 1998, to learn all about the retrieval strategies I just mentioned.) But the term "utilities" is too modest. It just does not do justice to the complexities inherent in this technology's purpose. Those complexities include at least the following:
Text Isn't KnowledgeOne barrier to better precision in search engines is that text itself is not knowledge; it's a residual of knowledge. Case in point: Three major intellectual disciplines, hermeneutics, exegesis, and semiotics, offer thousands of books on the complex processes by which text gains meaning. Like stone tools in a museum without provenance, isolated from their context and perhaps even from the user's language, texts can be hard to interpret. As a seed is only half of what's required to make a tree -- it is specifically designed to unpack and grow only in the presence of certain nutrients and temperatures -- text is only half of knowledge. A person must understand the language, discipline, specialty, issues, and approaches associated with a subject to make sense of text about it. Forgetting the immense matrix of prerequisites in the middle of which a given piece of text acquires meaning, the hidden 90 percent of the iceberg may be one of the biggest sources of confusion in 20th century computer science. Knowledge exists way beyond what is in texts. (Which becomes strikingly apparent when you consider whether you would go into the operating room of a brain surgeon who says he learned his trade on the Internet.) The retrieval approaches I listed before (Vector Space and so on) primarily deal with formal calculations on text distributions. Without bringing in other sources of knowledge, they may never get better results than they do now. Just as when you have three variables and two equations, no mathematical method, no matter how creative, is going to solve the problem. Some knowledge management (KM) systems obliquely claim they bring in that necessary context. They bring in some, but are light years away from bringing in all the context and the depth of meaning that humans can gain from text. The most commonly understood paradigm for text search might be described as follows: A person has an idea of what he wants. He types a few words into a query box. Algorithms whorl. Then, out comes a list, which should be ranked so the most useful texts are on top. But where is the knowledge? In this scenario, knowledge resides in two isolated outposts: with the researcher who's searching for the texts, and with the writers who created the text. Between them are only indicators or representatives of knowledge (written symbols). Additionally, there remains the potential obstacle that the researcher may initially not even have a good concept of what he seeks. There are two very rough proxies for knowledge in the process I just described: The few words of the query representing the person's interests, and in the residual text the algorithms work with. (The algorithms do not work directly with the authors' ideas). Thus there are three directions in which to advance: * Getting at more of the intended meaning and interests of the user * Getting more of the meanings and informational relations intended by the author * Bringing the first two together.
The SeekerThere are at least two main tacks used so far to get at the knowledge seeker's intended meaning: 1. Using personal information, including: biographic, previous research directions, texts, and emails saved. (I will elaborate on this subject another time.) 2. Creating a highly flexible interactive interface. Because a person's interests are dynamic, incompletely determined, and can change as the person interacts with the search process, interfaces are more valuable if they are flexible, intuitive, and easy to use. A good interface lets users' thinking evolve, keeps track of their search processes, and records results obtained. Such an interface was the goal of Marti Hearst at the University of California at Berkeley when she created her Cat-a-Cone project, depicted in Figure 1. A ConeTree displays multiple category hierarchies that you can expand and contract in 3D so you can observe multiple paths (dimensions) simultaneously and understand their collective meaning. It does not place documents under the nodes of the hierarchy, as most document classification systems do, thus it leaves them free to be defined by multiple dimensions of the hierarchies. You can select categories of interest via "Boolean paint:" Categories painted the same color are considered combined by an OR; if different colors, combined by an AND. A free-text box lets you also select on given words.
You retrieve Cat-a-Cone documents into a book, as seen in the middle lower right area of Figure 1. The left-hand page shows the title and category labels associated with the document. Clicking on a label rotates the corresponding tree to the foreground. The right-hand page shows the abstract. When you turn the pages, the subtrees rotate, expand, and contract appropriately to describe the document on display. You can label books to be saved on the bookshelf at left.
The WriterGetting at the meanings intended by the author is not a trivial problem. Nonetheless we can start getting a handle on issues by bringing in interpretations (implicit and explicit) of topical organization, meaning relations, and judgments of value from sources exterior to the text. There are many projects to create taxonomies for a wide range of collections. The fact that they are often incompatible indicates the complexity of the multidimensionality of knowledge, an issue I discuss in the next section. Examples of explicit, in-house, specialist taxonomical work include the categorization done at Yahoo (for the Web), The Association for Computing Machinery (for an internal document store), and NorthernLight (for both). Extensive explicit categorization by a community of specialists is the method espoused by the Open Directory Project. There, any Web user can become a specialist and contribute to defining sites. Its categories are now used by popular search engines including Google, Yahoo, Netscape, Lycos, and HotBot. An example of pulling implicit site value information from the actions of many would be found in PageRank, the algorithm that ranks hits in the Google search engine. Web sites implicitly tell us what they think is most important by where they send their visitors. Each Web site receives a "vote" each time another site links to it. PageRank weights these votes by confidence, determined by how many others link to it-- and so on, recursively. This technique is closely related to the idea of collaborative filtering, which is the process of combining the preferences of individuals or experts. Collaborative filtering takes advantage of the total set of recommendations. For example, Paul Kantor and others at Rutgers University are working on ways for individuals to feed back the relevance and value of results of their queries so that others with similar queries can take advantage of these judgments. (See "Capturing Human Intelligence in the Net," Communications of the ACM, August 2000.) Adding information that helps establish rank or category hardly exhausts the concept of bringing in more information so that the structure of the management follows the structure of the knowledge. You can, for example, bring in information from other articles by a work's authors to provide context, construct concept nets, and in general look for many other ways of representing content so that it becomes available for automated analysis and comparison.
The Multidimensional Nature of KnowledgeThere are so many ways to segment, taxonomize, structure, and organize knowledge that it is unclear how a large set of these ways can be optimally combined. Search engines that offer taxonomies of search results often start making ever more frilly hierarchies -- an inefficient way to represent what may be just intersecting dimensions. For example, under both the subjects England and France, we might have a redundant History sub-branch. Using dimensions that intersect in the ways familiar to structured databases has many advantages. However, because there are so many ways to categorize information, this approach makes sense only if the tool is flexible enough to let the user pick the organizational method that makes the most sense (as Cat-a-Cone does). Say we try to define Europe with the following dimensions: Dimension 1: Travel, Economics, Politics, Technology, Arts Dimension 2: Current Events, 20th Century, Renaissance, Medieval, Prehistory Dimension 3: England, France, Spain, Italy.... These dimensions yield an interesting cube, with documents assigned to any of a number of intersections. Allowing users of KM systems to create topical cubes from many options may be a good way to go. A query for medieval English art, for instance, could be set up quite functionally as a three-topic query with documents at the intersection of these three areas of interest: Medieval, England, and Art. Remember, topics are categories and their intersection could be very different and even more exact than a query pulling texts with these three words.
Marti Hearst's Scatter/Gather system is a sort of navigation system that allows many options for potential organization to arise during the process of clustering the data. (See Figure 2.) A standard text query, say on the word "star," retrieves a set of documents that, when clustered, give five groupings that appear variously on: patriotic music and literature, film and TV actors, galactic entities, astronomy, and star-shaped life forms.
The user picks the subsets that focus on topics of interest, for example galactic entities and astronomy, and can recombine and recluster them to perhaps get documents on constellations, astrophysics, galaxies, and stars. The fact that perturbations or recombinations of lists lead to different topics highlighted in the clusters gives you a hint of how complex the set of all taxonomies might be. Many disparate yet meaningful and reasonable partitions can be overlaid on the same fixed set. Furthermore, what clusters you get depends on the measures of intertextual distances you use. Options include various forms of weighted word similarities, co-citation, co-linking, maximal number of similar topics, maximal sharing of topical vocabularies, maximal sharing of internal word relationships, and maximal overlap of metadata (such as date of creation, place of creation, and so forth). All together, these options hint at the need for flexible, iterative search interfaces that could include topics boxes, search word boxes, clusters on various metrics, and several ways of creating lists: combining clusters, restricting by a given word, expanding to related topics, moving toward a topic's centroid, and then reiterating with any of the preceding. But to really make knowledge seeking efficient, methods that mimic the feedback in the human question-and-answer process need to be invoked. ("I have documents of this sort, will they do?" "No, I was thinking more of ...") Having ways to typify the collection you retrieve and redirecting it at a high conceptual level is part and parcel of the ideal process.
Heterogeneity of Discrete Sources
Essential bits of the information you need might be scattered in many locations even within a single document. In the top left of Figure 3, you can see a three-word query entered on three separate lines: osteoporosis, prevention, and research. To the left of the top-ranked query results in the main window are TileBars (also a creation of Marti Hearst). The width of each bar is proportional to the size of the document it represents. Each column represents a consecutive segment of the document (separated using an algorithm called TextTiling, which gives more topically homogeneous segments, or alternatively by paragraphs or pages). Each column has three rows, one for each query word. The darkness of the intersection gives the relative number of mentions of the query word. You can see which segments of which documents have the preponderance of the desired information and click them to go directly to those segments. For example, the second two documents are short but have sections where all three query words appear.
An obvious question is: Are we indexing knowledge by the right units? What level is right: Web sites, Web pages, groups of documents, individual documents, or key meaningful segments? Indexing by meaningful segments produces the greatest number of elements. But these elements are easiest to organize precisely and link directly. And using them reduces human scanning time and increases precision.
Of course, an even greater goal would be to have methods that do more than just retrieve documents: They would construct the right kind of information to answer the user's needs based on existing information. But this vision is the long-range challenge. Until then, we have an excellent goal: Create something that puts together in a booklet the best paragraphs, as indicated by a community of experts, with text highlighted by experts, with easy and flexible navigation to larger wholes and topical collections.
| Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|











