Guide to the TechWeb Network

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home




August 10, 2001



Divide and Conquer

Partitioning a document collection to improve retrieval accuracy

By David Grossman and Ophir Frieder
Edited by Erik Thomsen

Continued from Page 1

THE REAL WORLD VS. THE ACADEMIC

The research world is fascinated with automatically identifying these subcollections (often referred to as clustering). But automatic clustering is very computationally expensive (you must logically compare each document to every other document) and may be inaccurate. A document unrelated to focus groups might be placed in the focus group cluster.

The alternative is to use your organizational knowledge to partition the document collection manually. Many data warehouses are now operational and partitioned into reasonable subject areas, and you can use the experience gained from identifying and cleaning up these subject areas to partition a document collection. Most departments or projects track very domain-specific documents: Simply placing them into meaningful subcollections can have a dramatic influence on both retrieval accuracy and efficiency. Users or an automated technique can forward documents to the right domain, reducing the retrieval of irrelevant documents and the size of the collection to be searched.

The downside is that not all products have the ability to easily partition document collections and then choose the appropriate collection. Automatically selecting the right collection is still a research problem, but if you can manually partition and build a simple GUI to let users select the deep collection on their own, you will reap some quick rewards.

Finally, some simple heuristics can help you select the appropriate collection, especially if it must be done automatically. By observing that term distribution is often domain specific, you can use the most frequently occurring terms within a domain as indicators as to what the domain is about. For the focus group domain, you can assume that "focus group" will occur very frequently. You can implement a code segment that successfully takes the query, matches it against the set of indicators, and finally makes a guess at the best search engine. Perfection, in this regard, however, remains a research problem.



Rate This Article

Comments:

Optional e-mail address:

The bottom line: Think about both efficiency and accuracy when deploying a portal, and consider partitioning your portal into domains just like you partitioned your data warehouse into reasonable subject areas.

David Grossman [grossman@iit.edu] is an assistant professor of computer science and Ophir Frieder [frieder@iit.edu] is the IITRI professor of computer science at the Information Retrieval Laboratory, Illinois Institute of Technology.


RESOURCES

Grossman, David and Ophir Frieder. Information Retrieval: Algorithms and Heuristics, Kluwer Academic Press, 1998

Text Retrieval Conference (TREC): trec.nist.gov

Related Articles on IntelligentEnterprise.com: "Context Dependency," September 29, 2000: www.intelligententerprise.com/000929/decision.jhtml

"This Year Brought to You by the Letter E," January 1, 2000: www.intelligententerprise.com/000101/decision.jhtml

"Now It's Personal," November 10, 2000: www.intelligententerprise.com/001110/feat1_3.jhtml







IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







techweb
Online Communities TechWebInformationWeekLight ReadingIntelligent EnterprisebMightyNetwork ComputingDark ReadingDigital LibraryWall Street & Technology
Byte & SwitchNo JitterInternet EvolutionLight Reading's Cable Digital NewsContentinopleUnStrungBank Systems & TechnologyAdvanced TradingInsurance & Technology
Face-to-Face Events
InteropWeb 2.0 ExpoWeb 2.0 SummitVoiceConBlack HatCSISoftwareEntrprise 2.0 ConferenceGTEC
Mobile Business Expo
InformationWeek 500 ConferenceBuy Side Trading XchangeBuy Side Trading SummitBank Executive SummitInsurance Executive SummitTelcoTVEthernet ExpoOptical Expo
Magazines  
InformationWeekWall Street & TechnologyInsurance & TechnologyBank Systems & TechnologyAdvanced TradingMSDNTechNetSmart EnterpriseThe Architecture JournalDatabase Magazine
 
Research & Analyst Services  
Heavy ReadingInformationWeek ReportsInformationWeek Analytics