|
Data warehousing along with e-commerce and business intelligence now dominates the IT agenda in most organizations. Its success has many origins, including the availability of reliable, scalable database servers; a range of extract-transform-load (ETL) tools; and ad hoc query and online analytic processing (OLAP) applications. But perhaps the most important factor in data warehousings success is the existence of generalized, dimensional models for representing large volumes of data in an easily navigated way. These models in essence, the star schema and its variations are now so common that relational databases are optimized for them and many ad hoc query tools sport star schema-based architectures.
Dimensional modeling, like any technique, has it limits. But an area in which its full potential has yet to be tapped is an unlikely one: text management. In this article, Ill examine how dimensional modeling can support text in the data warehouse. The technique provides a logical and efficient organization to the masses of unstructured text that are potential sources of substantial business intelligence. Modeling with conceptual dimensions, by making the semantic content of text accessible, overcomes common problems of simpler techniques (such as keyword searches and string matching) that depend only on surface features of text.
Dimensions of Text
Text is often misclassified as unstructured data. This label is far from accurate; for 50 years, Noam Chomsky and other linguists studying transformational grammar have identified complex models in languages underlying structure. The results of these studies of syntax and semantics provide the basis for analyzing the structure of text in a well- defined, formal manner. Thus, in reality, the requirements of document data warehouses are similar (and in some cases identical) to those of traditional, numeric-oriented warehouses.
Similar to structured data, documentswhich may be as simple as email messages or as complex as a new drug application to the Food and Drug Administrationmust be collected and loaded into the warehouse. The information in the texts must be indexed (perhaps in several different ways) to facilitate retrieval and aggregated and structured to support navigational searches.
In the text data warehouse, the analog to the star schema fact is the document or text summary. The document is identified by a set of dimensions corresponding to words, word categories, or themes. For example, you could index the minutes of World Trade Organization meetings along general themes such as Monetary Policy, Banking, and International Trade as well as more specific ones such as Thai Textile Import Restrictions.
Although text lends itself to dimensional representation, the processes involved differ somewhat from traditional dimensional modeling. Thus, when discussing document data warehouse processes, you should consider them in terms of their analogs in conventional data warehouses, rather than expect to draw direct comparisons. For example, in a document data warehouse, dimensions are usually not predefined but rather extracted from the text, and as the set of documents grows, so do the dimensions. Furthermore, although well-defined and agreed-on operations such as aggregation are available for numeric data, no such consensus exists in the text arena about how to cluster documents, although several methods have been proposed. Even within the same clustering technique, slight variations in algorithms can yield different cluster groupings.
Building the Document Data Warehouse
The text data warehouse shares many ETL requirements with the traditional, numeric data variety. The main steps in the ETL process for documents are: filtering based on metadata criteria, extracting features such as person and organization names, creating keyword and thematic indexes, summarizing documents, and grouping documents into related clusters.
Filtering. Because text data is not as limited in structure as numeric data, the idea of identifying and cleansing erroneous values is not applicable. In the document data warehouse, the closest analog to cleansing is filtering based on metadata information or content. Metadata about each source is a primary means of filtering; for example, you could capture marketing campaign documents and mail messages en masse for archiving, protecting confidential material by excluding documents created by attorneys and negotiators. Message content provides a secondary means of filtering: You can easily exclude mail messages to or from particular parties, multiple messages to a mailing list, or messages containing particular phrases (such as memos about office social events).
Metadata about document types and sources also determines the operations performed on the documents. Metadata can determine, for example, whether the warehouse should store the entire document or just a summary or URL; when the summary documents and URLs are refreshed; and whether keyword indexing is sufficient or thematic indexing is required as well. Furthermore, the use of colloquial language may lead to poor clustering with some algorithms but not others, so you may need to identify clustering methods for individual data sources.
In addition, a process called feature extraction will identify statistically significant vocabulary items such as words or phrases that provide the basis for clustering documents. Primary features are names of people and organizations, complex nouns such as online analytic processing, or abbreviations and relations such as the ownership relation in Alpha Analysis, subsidiary of Beta Software Systems.
Indexing.Two types of indexing are common in text retrieval: keyword and thematic. In the former, the location of uncommon words within documents is tracked. This process may involve either manually identifying keywords or automatically generating a list of keywords; automatically generated lists can be built by analyzing the frequency of words within a document collection and eliminating the most frequent ones because they add no discriminatory power. Alternatively, you can use a predefined list of commonly used words to identify low-information carrying words. This approach works best with general collections, while generated lists are useful for domain-specific document collections.
Thematic indexing provides a more generalized version of keyword indexing. Rather than requiring users to search for particular words plus synonyms, such as aspirin and analgesic, they could search for the general category of pain reliever. As in keyword indexing, you can make the warehouse predefine or generate the reference taxonomy based on the document collection. Generating a taxonomy requires a training phase in which a training set of documents and their assigned-to categories are presented to a tool, such as the IBM Intelligent Miner for Text categorization tool, which then performs linguistic analysis.
Summarizing. Just as numeric data warehouses rely on aggregations to provide high-level assessments, document data warehouses require summaries of documents to provide core information from the collection without unnecessary detail. The size of these summaries depends on the size of the original document and the amount of information that document contains. The larger the summary, the greater the information content, but a practical upper limit on size can exist. Balancing the two factors usually requires experimentation. Some types of documents, such as emails, memos, and policy documents, may be adequately summarized by a relatively small percentage of the original text. Medical and legal documents may require a more detailed summary in order to maintain adequate information content. In general, metadata about document types should control the degree of summarization.
Because summaries are documents themselves, you can apply the same techniques applied to the original documentssuch as feature extraction, indexing, clustering, and even summarizationto the summarized documents to yield a crude, but sometimes effective, hierarchy of detail.
Clustering. Many Web search engines provide find similar documents options along with the results of keyword searches. Unlike numeric operations such as summing, in text searches no fixed, formal definition of similarity exists, and so there is no single solution to the problem of clustering or grouping related documents. However, several approaches have been developed, three of which Ill discuss here: hierarchical clustering, binary relational clustering, and self-organizing map (SOM) clustering.
All three methods seek to maximize intracluster similarity while minimizing intercluster similarity. This task has two requirements: a feature representation of the text that is easily compared to others, and a similarity measure. Common representations include a feature vector (such as a list of words and the number of times they occur), a histogram of word categories, and weighted measures of document themes.
The similarity measure can be as simple as the difference in the number of times frequently used words occur in each document. You can make a geometric interpretation of a document collection by considering the themes or word categories of a document as the dimension of the collection; the number of occurrences or relative weight of a theme in a document is the documents measure along that dimension. The location of the document in the multidimensional space defined by the set of dimensions is described by the feature vector [d,, d2, d3, dn] where dj is the theme weight or number of occurrences on the ith dimension. With such a geometric interpretation, minimizing the total Euclidean distance among documents results in similar outputs to the popular k-NN data mining algorithm. Of course, a single document may contain 10 noteworthy themes and a large collection may have thousands of them, resulting in thousands of sparsely populated dimensions. Fortunately, recent research by Teuvo Kohonen suggest that you can significantly reduce the computational load by minimizing the number of examined dimensions through a random projection of a high-dimension representation (thousands or more dimensions) into a lower-dimension representation (hundreds of dimensions) without sacrificing discriminatory power. (See Resources.)
In hierarchical clustering, a tree forms in which the root represents the entire set of documents and the leaves are singleton sets with individual documents. Hierarchical clustering merges pairs of leaves that are most similar into nonleaf nodes; thus, the features of a nonleaf node are the combination of features of the constituent nodes. The merging process continues, producing new levels of clusters until only one node remains, which is the root. One of the benefits of hierarchical clustering is that it produces an easily navigated data structure that moves from general groupings to more specific clusters.
Unlike hierarchical clustering, binary relational clustering results in a flat data structure in which each document is placed in the one cluster that best represents it. As a result, each cluster corresponds to a particular topic. As documents are added to a group, the combined feature vector changes so that documents originally added to one cluster may move to another if doing so minimizes intercluster similarity. Binary clustering algorithms need several iterations to find the optimal placement of documents.
The SOM clustering technique maps sparse, highly dimensional data into a two-dimensional representation. In the case of document clustering, dimensions can be either words or categories and the measure along that dimension is the frequency of occurrence. SOMs are represented as neural networks in which each node in the network contains a weight vector. Each document is presented to the network as a feature vector and the distance to each node is calculated as the Euclidean distance dk, w where
dk(t)=|| x(t) wk(t)||2
The best matching node is the one with the minimum value of dk(t), and the weight vector of the best matching node is adjusted to the following at the next time interval:
wk(t+1)=wk(t)+a(t)hck(t)[x(t) wk(t)]
where a(t) is a learning rate factor and hck(t) is the neighborhood function that controls how much the weight vector is allowed to adjust. The result is that each node of the network has a weighted feature vector that best describes the collection of documents associated with that node. SOMs have been used to cluster Internet documents in the WebSOM project and have worked well with documents comprising colloquial text (generally a difficult problem; see Resources).
Document Retrieval
Ultimately, the goal of document data warehousing is to integrate text with numeric information within a single repository. To meet this objective, storage and retrieval tools must support both types of data. The use of binary large objects within a relational database provides the mechanism to store and manage collections of documents. SQL requires extensions to adequately support basic document retrieval, some of which are already available. For example, since the release of Oracle ConText (now part of Oracle Intermedia) in Oracle7, the Oracle RDBMS has supported text operators for fuzzy searches, word-stem searches, proximity searches, weighting, and accumulation.
With the ability to thematically index, summarize, cluster, and efficiently retrieve through text operators, document, data warehousing is now a viable option for extending your organizations business intelligence efforts. See the sidebar Text Machines (p. 38) for information about two text-management tools for building document data warehouses.
End of Story
The time has come for document data warehouses. Documents are potentially significant resources for business intelligence operations that have, to date, been largely untapped. Leveraging the strategic value of text is possible now that tools as well as design techniques exist to support the endeavor, and many of the lessons learned in data warehousing apply to this domain.
Of course, not all questions about document data warehouses have been answered. For example, how should a dimensional model of text be tied to numeric models? What are the best techniques for indexing documents by the same dimensions used for numeric data? For example, from a star schema of production quota facts, users should be able to drill down into production control documents to discover why output is lower than expected.
Furthermore, complex data requires complex security, raising many questions. For example, while access to individual documents can be limited, what information can intruders piece together by studying the summaries of a range of generally accessible documents? By what criteria should documents enter a warehouse to begin with? How can information content be measured with regard to organizational objectives? What distinct type of metadata should be collected for text? Can the Common Warehouse Model support these types, or should it be extended? These areas require further examination before text data warehouses can become fully mainstream.
RESOURCESAudio Mining, Dragon Systems: www.dragonsystems.com
Exploration of Full-Text Databases with Self Organizing Maps. In Proceedings of the 6th International Conference on Neural Networks. IEEE Service Center, 1996.
Self-Organization of Very Large Document Collections: State of the Art. In Proceedings of the 8th International Conference on Artificial Neural Networks. Springer, 1998.
A Scalable Parallel Algorithm for Self- Organizing Maps with Applications to Sparse Data Mining Problems. Data Mining and Knowledge Discovery 3, no. 2 (1999), www.research.microsoft.com/datamine.
websom.hut.fi
www.ibm.com/software/data/iminer/fortext |
Dan Sullivan (desullivan@crtinc.com) is director of data warehousing at Computer Resource Team, developers of custom e-business applications based in Richmond, Va. Prior to focusing on database design and development, Dan designed and developed natural-language processing systems.
|
|
|
|
|











