CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise





September 18, 2001



Integrating Structured Data and Text

Treating text as a relational application is a viable alternative for many data warehouses

By David Grossman and Ophir Frieder
Edited by Erik Thomsen

Vast volumes of both structured and unstructured data commonly reside in a data warehouse. Data items such as name, address, and phone number are inherently structured and repetitive in nature. Thus, storing such data in relational platforms is easily accomplished. Many organizations have vast repositories of unstructured data such as email, corporate policy documents, internal system documentation, and procedures for dealing with customers. Users want this data in the warehouse, too.

But incorporating these text documents into a data warehouse typically poses a problem: Previously, accessing structured data and text with a single SQL query was not an option - relational systems were too slow to easily access text. The conventional concern with incorporating text into the warehouse typically stems from the belief that relational databases were never designed for text. Thus, many of today's applications incorporate text into a warehouse via Character Large Objects (CLOBs) or by implementing user-defined data types.

The primary difficulty with storing data in CLOBs is that these structures typically support only string-matching operations. Conventional search queries such as "find documents that describe all types of cars" will not find a document about a "Nissan Sentra" unless the document also lists the word "car." User-defined operators, on the other hand, lack standardization across platforms and applications, limiting query optimization. Query optimizers know quite a bit about in-house functions, but training the optimizer to support newly developed, special purpose operators is a nontrivial task.

Given this reality, our Information Retrieval Lab followed the development of relational systems and took a fresh look at the problem in the early 1990s. After several years of testing in the lab, we deployed numerous applications through our consulting relationships using this technology. Most notably, in 1999, a Web search engine running entirely on top of an Oracle database was deployed for the National Center for Complementary and Alternative Medicine. With this system, users are able to combine structured data and text in a single SQL query, making a separate text search system unnecessary.

How It Works

Text is first parsed into a set of tokens or terms. Each term is then stored in a structure called an "inverted index." This structure indicates the documents that contain each term. Consider the following example:

D1: The GDP increased 2 percent this quarter.

D2: The economic slowdown continued this quarter.

For these two documents, a typical search engine would build an inverted index that contains a list of terms and a list of which documents contain each term. A possible inverted index implementation for the previous example is the following:

continued > [D2]

economic > [D2]

GDP > [D1]

increased > [D1]

percent > [D1]

quarter > [D1] > [D2]

slowdown > [D2]

the > [D1] > [D2]

this > [D1] > [D2]

two > [D1]

Typically, you eliminate "stop-words" - words that appear in almost all the documents such as "the" and "this" - from the index because they don't differentiate among the documents. We kept them for illustrative purposes. In our example, the term "quarter" also appears in both documents and thus contains a pointer to a list that contains both D1 and D2.

Essentially, the inverted index is just a many-to-many relationship between terms and documents. One term can be in many documents, and certainly one document can contain many terms. The ER diagram shown in Figure 1 illustrates this relationship.

The "document relation" in the ER diagram contains an entry for each document. This entry is usually a unique document identifier, such as the author, creation date, or title. The "term relation" stores information about each term, such as weights. The weights are ultimately used to rank documents in the order of a computed estimate of their relevance to the query. Numerous weights exist, but for now you only need to know that they can be stored in the term relation.

Once you have this ER diagram, you can build tables and start running queries. A possible keyword search to find all documents that contain a term is the following:

SELECT docid

FROM index

WHERE term = < fill in your favorite keyword>







IE Weekly Newsletter
Subscribe to the newsletter
    Email Address