Divide and ConquerPartitioning a document collection to improve retrieval accuracyBy David Grossman and Ophir FriederEdited by Erik Thomsen Continued from Page 1 THE REAL WORLD VS. THE ACADEMICThe research world is fascinated with automatically identifying these subcollections (often referred to as clustering). But automatic clustering is very computationally expensive (you must logically compare each document to every other document) and may be inaccurate. A document unrelated to focus groups might be placed in the focus group cluster. The alternative is to use your organizational knowledge to partition the document collection manually. Many data warehouses are now operational and partitioned into reasonable subject areas, and you can use the experience gained from identifying and cleaning up these subject areas to partition a document collection. Most departments or projects track very domain-specific documents: Simply placing them into meaningful subcollections can have a dramatic influence on both retrieval accuracy and efficiency. Users or an automated technique can forward documents to the right domain, reducing the retrieval of irrelevant documents and the size of the collection to be searched. The downside is that not all products have the ability to easily partition document collections and then choose the appropriate collection. Automatically selecting the right collection is still a research problem, but if you can manually partition and build a simple GUI to let users select the deep collection on their own, you will reap some quick rewards. Finally, some simple heuristics can help you select the appropriate collection, especially if it must be done automatically. By observing that term distribution is often domain specific, you can use the most frequently occurring terms within a domain as indicators as to what the domain is about. For the focus group domain, you can assume that "focus group" will occur very frequently. You can implement a code segment that successfully takes the query, matches it against the set of indicators, and finally makes a guess at the best search engine. Perfection, in this regard, however, remains a research problem. The bottom line: Think about both efficiency and accuracy when deploying a portal, and consider partitioning your portal into domains just like you partitioned your data warehouse into reasonable subject areas. David Grossman [grossman@iit.edu] is an assistant professor of computer science and Ophir Frieder [frieder@iit.edu] is the IITRI professor of computer science at the Information Retrieval Laboratory, Illinois Institute of Technology. RESOURCES Grossman, David and Ophir Frieder. Information Retrieval: Algorithms and Heuristics, Kluwer Academic Press, 1998 Text Retrieval Conference (TREC): trec.nist.gov Related Articles on IntelligentEnterprise.com: "Context Dependency," September 29, 2000: www.intelligententerprise.com/000929/decision.jhtml "This Year Brought to You by the Letter E," January 1, 2000: www.intelligententerprise.com/000101/decision.jhtml "Now It's Personal," November 10, 2000: www.intelligententerprise.com/001110/feat1_3.jhtml
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
| ||||||||||||||||||||||||||||||||










