CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise





October 24, 2001

Integrating Structured Data and Text: Part 2

Build relationally integrated systems to fully leverage your warehouse investments

David Grossman and Ophir Frieder
Edited by Erik Thomsen

We expect that as portals grow, there will be a greater demand to integrate both structured data and text. The knee-jerk reaction is to buy two systems - one for each - but our solution is to use a single relational system to attack the problem. The advantage is that you not only get text retrieval, but you can integrate easily with existing data in the warehouse. Yes, you pay a small price in overhead because you are using a general-purpose tool, but the benefits in integration and query functionality often outweigh the cost of computational resources.

In our last column ("Integrating Structured Data and Text," Sept. 18, 2001), we presented an approach to integrating structured data and text by modeling the text as a relational application. Our presentation was limited to only simple functionality such as a single keyword search. We now extend our discussion to include multiple keyword searches, threshold searches (TAND), and relevance ranking. Recall our two sample documents and our ability to model the multivalued relationship between terms and documents with a TERM, DOCUMENT, and INDEX relation. Essentially, a row is stored for each document in the DOCUMENT table, a row is stored for each term in the TERM table, and the INDEX table contains a row when a given term appears in a given document.

D1: The GDP increased 2 percent this quarter.

D2: The economic slowdown continued this quarter.

MultiTerm Queries

A query such as "Find all documents with the terms slowdown or recession" is an example of a multiterm query. A query that integrates both structured data and text would ask to "identify those documents with either slowdown or recession and occur in a year with a significant change in salaries." This query requires a multiple term search of the unstructured document data as well as a structured search of the human resource (HR) data. To show how the text part of this is done (the structured portion is just a straightforward join on year to the HR tables), you would place all query terms in a QUERY table with a single column, term. You can accomplish an OR query, namely the selection of all documents that contain ANY of the terms, as follows:


SELECT docid
FROM index
WHERE term IN (SELECT term FROM QUERY)

Although you could construct an AND query by defining a self-join on the index relation, you would need to compute a self-join for each of the query terms (actually N-1 of them where N is the number of terms). Such a computation is expensive and eventually exceeds the limits of existing commercial systems.

A two-term query using this approach looks like this:


SELECT a.docid
FROM Index a, Index b
WHERE a.docid = b.docid AND
a.term = 'economic' AND
b.term = 'slowdown'

Thus, we use the following alternative:


SELECT a.docid
FROM index a, query b
WHERE a.term = b.term
GROUP BY a.docid
HAVING COUNT(*) = (SELECT COUNT(*) FROM QUERY)

For many queries, when a clustered index on the INDEX relation is maintained, the performance is quite reasonable. The trick is counting the number of matches between a document and the query, ensuring that the number of matches equals the number of terms in the query. The key difference in this query and others is that you populate a new QUERY table with a row for each word in the query. The QUERY table for economic slowdown will appear as:

QUERY
term
economic
slowdown

Note that this approach assumes that only one entry for an instance of a term in a document is stored in the INDEX relation. Terms in the INDEX relations include a term frequency attribute, so if a term appears multiple times, only a single row exists in the INDEX relation. For a query with 100 words, the same SQL is used. Instead of two rows in the QUERY table, there would now be 100 rows. This certainly is better than the 99 self-joins that would otherwise be required.







IE Weekly Newsletter
Subscribe to the newsletter
    Email Address