Guide to the TechWeb Network

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home




October 24, 2001

Integrating Structured Data and Text: Part 2

Build relationally integrated systems to fully leverage your warehouse investments

By David Grossman and Ophir Frieder
Edited by Erik Thomsen

Continued from Page 1

TAND

At times users want a threshold AND in a search system to improve selectivity. Consider a case where the user identifies numerous terms associated with recession such as slowdown, unemployment, bankruptcy, and so on. A TAND query lets the user enter the entire list and then specify that only a percentage of this list must exist in a document for that document to be retrieved. A requirement that ANY term must appear is too broad and a requirement that all terms must appear is too stringent. A TAND query offers a middle ground. This functionality is provided as follows.


SELECT a.docid
FROM index a, query b
WHERE a.term = b.term
GROUP BY a.docid
HAVING COUNT(*) >= <tand value>

As we said earlier, any of these queries may be joined with structured data. If we want to find documents written only by employees, we can easily join the selected document list with the employee table. Assume that we have an Employee table with columns LAST_NAME and FIRST_NAME. We also add last name and first name to the DOCUMENT table to identify the first author of the document (yes, we are simplifying and we know that document-author is really a multivalued relationship).

SELECT a.docid FROM index a, query b, employee c, document d WHERE a.term = b.term AND a.docid = d.docid AND d.last_name = c.last_name AND d.first_name = c.first_name GROUP BY a.docid HAVING COUNT(*) >= <tand value>

Relevance Ranking

Multiterm queries work, and existing search engines use them, but they lack the incorporation of term weights. Some terms are simply more important than others. To ensure that unimportant terms don't dominate retrieval, most Web search engines use queries with term weights.

Search engines rank documents based on a relevance measure that computes the relevance of a given document to a given query. A common document ranking strategy, known as the vector space model, represents each document and query as a vector and ranks the documents according to the distance between the vectors. A means of computing this distance is to take the inner product of the two vectors. Various weights may be used for each term; assume each term has a weight that is stored in the TERM table. This weight indicates the strength and frequency of the term across the entire collection. Also assume that another weight attribute exists in the INDEX relation. This weight identifies the strength and frequency of a term in a given document. The following query implements the inner product measure:


SELECT a.docid, SUM(a.weight * b.weight)
FROM index a, query b
WHERE a.term = b.term
GROUP BY a.docid
ORDER BY 2 DESC

This is a very simplistic similarity measure. For a survey of other ranking strategies and SQL to support them check out our book, Information Retrieval: Algorithms and Heuristics, or our Web site at www.ir.iit.edu.



Rate This Article

Comments:

Optional e-mail address:

TUNING

As with any other kind of application, getting it to run correctly is not the same as getting it to run well. So expect to spend some time tuning your text applications once you have them running.

DBMSs that permit a clustered index on term in the INDEX relation provide better performance. A clustered index almost precisely simulates an inverted index, as only one entry exists for a term followed by a pointer to a list of documents that contain the term. General optimization approaches are likewise supported by the relational approach. For example, this approach filters long documents - otherwise, it is highly likely that a long document will be ranked as relevant for any query. It also removes duplicate documents; many algorithms exist for removing exact or near duplicates (see the paper by Chowdhury, et al. in Resources). Finally, for a given term, the list of documents that contain the term can be truncated in instances of very common terms. Consider the term "near"; it will occur in millions of documents and is unlikely to be very useful. By removing occurrences below a frequency threshold, you will also remove numerous rows from the INDEX relation. We have shown that this approach will not have a significant effect on retrieval accuracy, but does have a dramatic effect on performance.


David Grossman [grossman@iit.edu] is an assistant professor of computer science and Ophir Frieder [frieder@iit.edu] is the IITRI professor of computer science at the Information Retrieval Laboratory, Illinois Institute of Technology.


RESOURCES

Chowdhury, A., O. Frieder, D. Grossman, and M. McCabe, "Collection Statistics for Fast Duplicate Document Detection," to appear in ACM Transactions on Information Systems (TOIS)

Frieder, O., A. Chowdhury, D. Grossman, M. C. McCabe, "On the Integration of Structured Data and Text: A Review of the SIRE Architecture," DELOS Workshop on Information Seeking, Searching, and Querying in Digital Libraries, Zurich, Switzerland, December 2000.

Grossman, D. and O. Frieder. Information Retrieval: Algorithms and Heuristics. Kluwer Academic Press, 1998.

Grossman, D., D. Holmes, O. Frieder, D. Roberts. "Integrating Structured Data and Text: A Relational Approach." Journal of the American Society of Information Science, February 1997.







IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







techweb
Online Communities TechWebInformationWeekLight ReadingIntelligent EnterprisebMightyNetwork ComputingDark ReadingDigital LibraryWall Street & Technology
Byte & SwitchNo JitterInternet EvolutionLight Reading's Cable Digital NewsContentinopleUnStrungBank Systems & TechnologyAdvanced TradingInsurance & Technology
Face-to-Face Events
InteropWeb 2.0 ExpoWeb 2.0 SummitVoiceConBlack HatCSISoftwareEntrprise 2.0 ConferenceGTEC
Mobile Business Expo
InformationWeek 500 ConferenceBuy Side Trading XchangeBuy Side Trading SummitBank Executive SummitInsurance Executive SummitTelcoTVEthernet ExpoOptical Expo
Magazines  
InformationWeekWall Street & TechnologyInsurance & TechnologyBank Systems & TechnologyAdvanced TradingMSDNTechNetSmart EnterpriseThe Architecture JournalDatabase Magazine
 
Research & Analyst Services  
Heavy ReadingInformationWeek ReportsInformationWeek Analytics