The Keyword Dimension
Using keywords in a data warehouse of historical letters
Ralph Kimball
I recently had the opportunity to design a data warehouse for a massive archive of historical letters. The archive was fascinating, consisting of hundreds of thousands of letters from the 1500s through World War II. The archive mostly consisted of well-preserved envelopes, showing the origins and destinations of this "postal history." But perhaps 15 percent of the letters actually retained their original contents, ranging from simple business correspondence, to family letters, to mail from soldiers on the front lines.
At first glance, this archive did not seem to be an obvious candidate for a data warehouse, especially using the familiar dimensional approaches. But, after some study, a number of compelling similarities among all the letters emerged that began to look like descriptive dimensions and measured facts. The list of candidate dimensions included:
The natural grain for this postal history archive was one fact-table record per letter. Because a letter could go through a number of stages (acquisition, appraisal, restoration, presentation, and divestment) the fact table's most natural form was an accumulating line-item grain. The previous list of dimensions clearly shows the characteristic "accumulating" list of dates corresponding to the various stages of the item being described on the fact record. The accumulating line-item table is one of the three basic types of fact tables; the other two types are the transaction grain and the periodic snapshot grain. I described the three types of fact tables grains in my March 30, 1999 column, "Fundamental Grains."
The seven dates associated with each letter represent a single underlying date table, Plays. The seven dates are independent dimensions in their own right, but as I have remarked many times in this column, you don't need seven date tables. One date table will do, as long as you connect it to the fact table with seven separate views. (See my August 1997 DBMS column "Data Warehouse Role Models" for more information on roles.)
Note that a number of the dates may be undefined for a given letter. You may only have estimates for the posting date and the delivery date. You give these dates real values in the date table, but the date record itself needs to be of a special type like "estimated to the nearest year." Other dates like the restoration date, the presentation date, and the divestment date may be inapplicable, and these date keys in the fact table need to point to special "inapplicable" records in the date dimension table, not records representing actual dates.
Designing the Keyword Dimension
As it turns out, the last item in the list of dimensions is the most interesting part of the design. Each letter had one or more keywords that an expert examiner had added. The keywords described much of the interesting content of the postal history archive. Because the letters ranged over thousands of situations and subjects, there was no significant predictability or structure to the keywords. Some keywords described the subjects of the letter (Civil War, Charleston, or traveling in a covered wagon), while other keywords described special markings on the letter (posted at sea or delayed in train wreck). The expert examiners were good at applying keywords in a disciplined way, but literally hundreds of interesting keywords drawn from different domains served as targets for queries in the final database.
Because each letter could have a variable number of keywords, the keyword dimension seemed like a good candidate to be a multivalued dimension. (See my August 1998 DBMS column, "Help for Dimensional Modeling.") I show the simplest form of a multivalued dimension for handling the keywords in Figure 1. The tricky part of a multivalued dimension design is the many-to-many join between the fact table and the dimension table. The Keyword Group key appears many times in the fact table and dimension table. In the dimension table, the Keyword Group key is repeated for each keyword in the particular list of keywords for a given letter. Some data modeling tools balk at letting you define this many-to-many join, and these tools may try to force you to place an associative "bridge" table in between the fact table and the keyword dimension table.
FIGURE 1 The Fact table for the letter archive data warehouse showing the keyword dimension table containing a separate record for each
keyword and each letter.
Assuming you have worked around your data modeling tool and built the schema as I've shown in Figure 1, you still will be left with a serious query problem.
The AND/OR Dilemma
User requests against the keywords in the postal history archive fell into two equally important categories. The OR queries (Civil War OR Spanish American War) could be satisfied by a simple OR constraint on the Keyword field in the dimension table. But AND queries (Civil War AND train wreck) were difficult because an AND constraint is really a constraint across two records in the keyword dimension table. SQL is notoriously poor at handling constraints across records. Using this design, probably the best you could do is issue a SQL constraint such as:
SELECT K1.kwgroupkey
FROM key wordtable K1
WHERE (select COUNT(K2.kwgroup key)
FROM keywordtable K2
WHERE K1.kwgroupkey = K2.kwgroupkey
AND K2.key word in ('Civil War', 'train wreck')
GROUP BY K2.kwgroupkey)
= 2.
The final "2" in the query is the count of the number of target keywords in this particular search. This approach is awkward and probably slow. It certainly requires a custom user interface that hides the complex SQL from the end user.
Searching for Substrings
It is possible to simultaneously remove the many-to-many join and the complex embedded SELECT constraint by changing the design of the keyword dimension to the simpler form I show in Figure 2. Now the records in the keyword dimension each contain one big text string with all the keywords for that letter, one after the other. You should use a special delimiter character like backslash at the beginning of the keyword field and after every keyword in the list. Thus the keyword text string containing "Civil War" and "train wreck" would look like:
\Civil War\train wreck\
Those of you more detail oriented and experienced in the ways of text string searching may be worrying about the ambiguity of searching on upper and lower case. Is it "Civil War" or "civil war"? You can resolve this either by morphing all the keywords in the database to one case or by using a special database text string search function that is case insensitive.
With the design I've shown in Figure 2, the AND/OR dilemma goes away. The OR constraint looks like:
kwlist like '%\Civil War\%' OR keyword
like '%\Spanish American War\%'
while the AND constraint has exactly the same structure:
kwlist like '%\Civil War\%' AND keyword like '%\train wreck\%'.
The % symbol is a wild-card pattern-matching character defined in SQL that matches zero or more characters. We use the backslash delimiter character explicitly in these constraints to exactly match the desired keywords and not get bogus matches like "uncivil warfare."
FIGURE 2 The same Fact table for the letter archive data warehouse showing the keyword dimension table containing a record for keyword list. The keyword list is a variable-length text string.
High-Performance Substring Indexes
The keyword list approach shown in Figure 2 will work in any relational database because it is based on standard SQL. But leading wildcard text searches like these are notorious for being slow when the keyword dimension table gets large. If performance becomes objectionable, you can pursue two approaches, if your database allows them. First, you can pin the keyword dimension table in memory so that although the constraint may invoke an exhaustive search of the dimension, it may be pretty fast. Second, you can build a special "pattern index" on the keyword field that provides an index lookup to every conceivable substring. The ability to build a pattern index opens up a whole set of serious keyword-oriented applications in medium- to large-sized data warehouses.
To determine whether your database supports pattern indexes or their equivalent, ask your local DBMS vendor systems engineer if you can build an efficient index that supports leading wild-card text searches. She will know what you are talking about. (We built pattern indexes in the original version of Red Brick in 1989, but I believe the current version marketed by Informix does not support them.)
For an idea of what a pattern index looks like, read about "Patricia Trees" in Don Knuth's Art of Computer Programming, Volume 3: Semi-Numerical Algorithms (Addison Wesley, 1998)
Serious Text Searching With a Dimensional Flavor
I have only dipped my toe into the large lake of text string searching. All of you are familiar with the amazing Web-based text search engines that index hundreds of millions of freeform documents. Is it possible to marry these text search engines to your more predictable and systematic dimensional-design techniques? For which applications is this approach appropriate? Tune in next issue.
Ralph Kimball co-invented the Star Workstation at Xerox and founded Red Brick Systems. He has three best-selling data warehousing books in print, including the newly released The Data Webhouse Toolkit (Wiley, 2000). Ralph teaches dimensional data warehouse design through Kimball University and critically reviews large data warehouse projects. You can reach Ralph through his Web site at www.ralphkimball.com.