Divide and ConquerPartitioning a document collection to improve retrieval accuracyBy David Grossman and Ophir FriederEdited by Erik Thomsen Most large organizations now have a corporate intranet, which requires some type of document search capability. This need is often an afterthought; many IT shops mistakenly believe that simply adopting something like Google or Alta Vista is the silver bullet. At a cost of less than $2,500, they can download and install several search engines. In no time, a systems administrator is likely to send a global email about the now "searchable" corporate intranet. But this solution is a massive oversimplification of the problems inherent in document retrieval. Quickly finding documents is indeed easy. Finding relevant documents, however, is a challenge that information retrieval (IR) researchers have been addressing for more than 40 years. The numerous ambiguities inherent in natural language make this search problem incredibly difficult. For example, a query about "China" can refer to either a country or dinnerware. THE PROS AND CONS OF A THESAURUSIn magazine articles "explaining" text search to the database crowd, experts often dismiss ambiguity problems by saying that search engines often use a thesaurus. "Ah hah!" the reader thinks, "The thesaurus will solve my ambiguity problems." Unfortunately, no one has ever scientifically proven that using a thesaurus actually improves accuracy. Why doesn't a thesaurus work? Consider a query about dogs. Looking up "dogs" in a thesaurus yields a more specific, nonambiguous term like "canine." If you add "canine" to the query, you may find additional relevant documents. But in the thesaurus, you might also find the synonym "pet," which will retrieve irrelevant documents on cats, goldfish, rabbits, and so on. This phenomenon was repeatedly shown using the most commonly used benchmark collection from the Text Retrieval Conference. The game of text retrieval is about more than simply speed. A SQL query has no notion of result correctness. The result either meets the specification of the SQL request, or a fundamental bug is in the DBMS software. With text retrieval, you can't absolutely define the set of result documents that a given query should retrieve. An estimate is all that is possible, and the game becomes about improving the estimate. THE SHALLOW AND THE DEEPSeveral models to improve this estimate have achieved prominence, namely the vector space and the probabilistic models. In our IR lab at the Illinois Institute of Technology, we have repeatedly found that these two strategies (as well as a few others) frequently return the same set of documents when presented with the same query. This occurrence leads to the question, "Given the commonality among retrieval models, what can be done to improve query accuracy?" A number of large companies are starting to realize the effect on the retrieval accuracy by separating the shallow from the deep part of their corporate intranet. The notion of a shallow and a deep Web is not new. The general idea is that when querying a Web search engine for a very detailed piece of information, you tend to get either irrelevant results or thousands of results. With an information collection that is miles wide but not even an inch deep, you have documents about everything. Consider a hypothetical case of marketing personnel at United Airlines asking for "minutes from focus groups that discuss why some customers who flew United are now flying Southwest." Such requests are likely to yield numerous documents when asked of a general-purpose intranet. Documents about both the weather in the Southwest and the recent decision to change the wording of "united" in a corporate logo are likely to be identified. "Focus groups" is a very nondiscriminating term in the entire document collection. So what should you do? You can partition the documents into subcollections. Once accomplished, each subcollection can have a specialized search engine. You might discover that you can use the same search engine for each subcollection. However, if a particular search engine actually performs better for certain subject domains, use that search engine for that particular domain. Now, you have a nice "focus group" subcollection of documents. For example, assume a document collection has 1,000,000 documents, and a "focus group" partition has 1,000 documents. Table 1 lists sample document frequencies (DFs). The DF of a given term or phrase is the number of documents that contains the term or phrase. As you can see, the term "focus group" is in 2,000 documents globally (.2 percent) and 800 (80 percent) of the documents in the partitioned collection. The "focus group" query can find up to 1,200 documents outside of the specialized subcollection. Similarly, the terms "united" and "southwest" occur 1,000 and 500 times in the general collection, respectively. A query that includes both of these terms could retrieve as many as 1,500 documents. To avoid such a large number of potentially spurious documents, a local "focus group" collection is used. The same query now obtains at most 70 documents, and in all likelihood may contain even fewer documents. This example relies on the assumption that the ranking algorithm uses
an Users may start with a simple query box like most search engines, or with a GUI that shows all the possible subcollections. The user can then select the "focus group" collection prior to issuing the query. The query now has a chance of finding something very meaningful to the user.
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
| |||||||||||||||||||||||||||||||
























