CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise





November 10, 2000




Fact Tables for Text Document Searching


Using a similarity metric as a measured fact


Ralph Kimball

In the October 20, 2000 issue, I dipped my toe into the large lake of text string searching. I described two approaches for handling a list of keywords describing an archive of documents. But a keyword-based approach to accessing a large number of documents makes some strong assumptions. A really good set of keywords may require human reviewers, a fact that certainly restricts the scale of the document archive. There aren't enough human reviewers to index millions of documents. Some interesting progress has been made in automatically generating keywords for documents, much like you can automatically generate book indexes. But even if you have a pretty good set of keywords describing documents, your end users are still left with some tricky query issues.

The problem with keyword lookups and with most Web search engines is that the end user only types in a few target words to initiate the search. In both relational databases and Web search engines, the hits are generated with a far-too-literal lookup of the target words. Indeed, even my favorite search engine, Google (www.google.com), doesn't know about plurals and alternate word endings - it either matches words exactly or misses them.

The use of a small number of target words to initiate the lookup leaves the end user with two very serious problems that, in my opinion, more clever search engines cannot overcome. First, the short list of words simply doesn't carry enough context by itself to really say what the user wants. And second, merely finding some or all of the words in the target document carries no guarantee that the subject matter is relevant.

Similarity Metrics

Many researchers have recognized these problems with keyword-based systems and have been investigating more powerful techniques for allowing an end user to search very large collections of documents. Although some of the early seminal papers (for example, that of Cornell's Professor Salton) date back to the 1960s, the advent of the Web has more recently lit a brush fire under the text searching community. Text searching is rapidly moving out of the academic realm and into the commercial realm.

In my opinion, the most promising approaches to searching large document archives are based on measuring the similarity between two documents. If you can say that two documents are "similar," and if you can quantify this similarity, then you can avoid more serious problems with keywords. Imagine that one "document" is really the end user's request for information. This request document can be a few sentences long or it can be much longer. Of course, there is probably an optimal length for the request document, so that it doesn't contain too many separate subjects that would make the search unfocused.

You can regard the second document as the target. If you have an accurate quantitative measure of the similarity between the request document and the target document, and if you can rapidly process a large archive of target documents with the same request document, then you are well on your way to building an effective text-retrieval system.

The creation of robust document similarity metrics is an active research topic today, in both academic and commercial settings. To sample a few of the many approaches, go to Google and search for "text distance metric," "topic spotting metric," "document relevance," and especially, "latent semantic analysis." I hope you appreciate the irony of this recommendation. Your search would be more fruitful if you had a similarity-based search engine!

Latent semantic analysis (LSA) is an approach that analyzes word-word, sentence-sentence, and passage-passage relationships. LSA generalizes the exact words and sentences in a candidate document precisely in order to avoid the problems of literal word matching. LSA is an interesting blend of a cognitive model of how humans think, and rigorous mathematical matrix techniques that capture the "dimensionality" of free-form text. A companion technique, latent semantic indexing (LSI), can generate exactly the kind of document similarity measurements that I have thus far discussed in this article.

LSA is probably not the final word on text searching, but it is likely that developers will build the next generation of text-searching systems on document-similarity measures like LSA, rather than on keyword searches.

Fact Tables for Similarity Measures

If you have access to a large archive of documents, then for certain applications you can build a very powerful document search system. Suppose you have millions of medical records describing symptoms, treatments, and outcomes for a large number of patients. This archive of medical records may be extremely complex. Many of the patients may have multiple diagnoses, multiple treatments, or multiple environmental and lifestyle factors. An interesting approach to querying this archive would be to define a few hundred single-subject queries (request documents) that would individually generate similarity scores for every patient record. You could store the results of testing the candidate request queries against the archive in a fact table whose grain is:

target document X request subject X
request document.

In other words, the fact table contains one record for each patient record (target document) for each possible request subject and for each request document that "implements" the query subject. Adding the request document to the grain allows the fact table to contain more than one request document for the same request subject.

Using this grain and generalizing the field names so as not to be medical-records specific, you only need two keys:

target document key
request document key

The target document dimension contains the title, author, creation date, and storage location of each target document. The request document dimension contains the name of the request subject and possibly some hierarchical grouping labels (for example, diseases, treatments, or environmental factors) to help organize and search for request subjects. This dimension table also contains the specific title, creation date, author, and storage location of each request document. Note that there may be more than one request document for a specific request subject. This would allow you to use more than one similarity methodology for document matching.

The numeric measured facts include:

similarity score
target document length

Keep in mind that although this design looks very symmetric between target documents and request documents, you can assume that there are a much larger number of target documents (for example, medical records) than request documents, and these target documents may have very heterogeneous contents. You can also assume that there is a smaller set of request documents, and that these documents have been carefully prepared to adequately express the intended requests.

Although you have prepared the request documents and calculated the similarity scores in advance, you hope to preserve considerable ad hoc flexibility so that the end user can pose complex and unanticipated requests. In my last column, ("The Keyword Dimension," October 20, 2000) I described the AND/OR dilemma that is typical of these kinds of requests. The user may want to see all the patients who have had Treatment A AND Treatment B, or conversely, Treatment A OR Treatment B.

The OR query is the simpler of the two because we don't have to jointly constrain two records from our fact table. Let's assume that a document passes the similarity test if the similarity score is greater than 0.8 on a scale of 0 to 1. The SQL looks something like:

Select T.target_doc_title, T.target_doc_
location
From archive_fact F, target_doc T,
request_doc R
Where F.target_doc_key = T.target_doc_
key
And F.request_doc_key = R.request_
doc_key
And ((R.subject = Treatment A' and
F.similarity > 0.8)
OR 
(R.subject = Treatment B' and
F.similarity > 0.8))

The AND query is more complex:

SELECT T.target_doc_title, T.target_doc_
location
FROM archive_fact F, target_doc T
WHERE F.target_doc_key = T.target_doc_
key
AND ((SELECT COUNT(G.target_key)
FROM archive_fact G, request_
doc R
WHERE G.target_doc_key =
R.target_doc_key
AND G.target_key = F.target_key
AND ((R.Subject = 'Treatment A'
AND G.similarity > 0.8)
OR (R.Subject = 'Treatment B' AND
G.similarity > 0.8)) 
= 2)

Don't be confused by the OR in between the requests for Treatment A and Treatment B in this code. Here you goal is to look for target documents having a count of 2 for the two joint requests you are making. This code finds the target documents correlating with both Treatment A and Treatment B!

Powerful Applications

Although I have descended into the implementation details in this article, it is worthwhile to step back and remind yourselves of the powerful applications that you can support with this approach. Because this approach depends on "precategorizing" and preprocessing the requests, you can look for situations where this is appropriate. Applications that come to mind include:

  • Medical records analysis where the established categories of requests include diagnoses, treatments, life style, and environmental factors
  • Customer responses, customer requests, customer emails, and any kind of questionnaire that includes freeform responses where the established categories include product interest, quality complaints, and payment issues
  • Technical support archives where an end user's questions can be matched against a large body of other users' experiences
  • Research projects, case studies, laboratory results, and environmental impact reports of all kinds where you know the research topics in advance and where there is value in examining many different systematic and correlated requests.

The design I have just described is very extensible. You can add new target documents, new request documents, and new similarity measurement methodologies in a graceful fashion that extends the power of the archive but does not invalidate previous lookups or require the fact table to be dropped or reloaded. A search engine like Google whose spider actually saves all the target documents in readable form (see Google's "cache" command) offers the interesting possibility of being able to continuously compute new similarity measures for new requests in a background process without needing to go back to the remotely stored documents.

 



Ralph Kimball co-invented the Star Workstation at Xerox and founded Red Brick Systems. He has three best-selling data warehousing books in print, including the newly released The Data Webhouse Toolkit (Wiley, 2000). Ralph teaches dimensional data warehouse design through Kimball University and critically reviews large data warehouse projects. You can reach Ralph through his Web site at www.ralphkimball.com.






IE Weekly Newsletter
Subscribe to the newsletter
    Email Address