Prime Time for XMLOpen up new application types with XML embedded in your relational databasesBy Steve Tracy During the last decade, data warehouse designers have successively refined dimensional modeling techniques for storage and retrieval of numbers, dates, and simple textual descriptors. But we haven't had the same impressive improvements with more complex structured text, especially in a data warehousing context. What improvements we have made in gracefully handling text storage and analysis have been mostly based on expensive advances in proprietary database storage and indexing technology, not on logical modeling insights. But there is hope. In this column, I describe one alternative approach that can be implemented today, applying XML to a dimensional modeling example. Complex Data Warehouse TextIn this column, I address the challenges of analyzing complex text attributes typically stored in data warehouse dimension tables, and not the larger issue of searching archives of unstructured documents. Legacy source systems contain text fields for comments and descriptions, often with little predictable structure. Such text isn't a "document," but is usually larger than a simple character attribute, and it is often hard to predict exactly what you'll find. Storing this text in standard dimensional attributes presents the following data modeling challenges:
These issues are probably very familiar to data warehouse modelers. Putting unstructured text into a structured world usually means making compromises. Data modelers sometimes use proprietary binary storage to house large text attributes as binary large objects (BLOBs). Alternatively, using an object storage approach ignores text as a true dimensional attribute. With both of these techniques, information becomes invisible to conventional SQL and relationships within the original text are often inaccessible. Test Example: Substituting "Tags" for TextIn his last column, "Wrangling Behavior Tags" (May 9, 2002), Ralph Kimball described an innovative method of representing patterns of customer behavior as simple character tags. The traditional approach requires sifting through large volumes of transactions arising from customer-facing processes such as call centers, Web-site visits, delivery systems, and payment reconciliation systems. In Ralph's approach, each customer's behavior is represented for 24 time periods by assigning textual tags A through H to behavior clusters: A: High-volume, repeat customer, good credit, few product returns B: High-volume, repeat customer, good credit, but many product returns C: Recent new customer, no established credit pattern D: Occasional customer, good credit E: Occasional customer, poor credit F: Former good customer, hasn't been seen recently G: Frequent window shopper, mostly unproductive H: Other. Then a series of attributes in the customer dimension effectively stores the time series of patterns as letters, based on analysis of the facts. The method assumes:
The result is an effective means of accomplishing the goal, profiling customer behavior. The time series of tags is stored efficiently and can be queried efficiently with straightforward SQL techniques. But is Text Only Text?Ralph's approach, clean as it is, contains a number of typical text issues: XML is TextRalph's customer behavior design is tailor-made for XML representation. XML is a language ideally suited to modeling and communicating structured and semi-structured text. XML lets us address many of these modeling shortcomings. I'll recast the customer behavior solution using XML and:
Extending XML into the Dimensional ModelWhy not really add meaning, depth, and ease of exchange to dimensional data? XML has two basic mechanisms for declaring text structure: XML Document Type Declarations (DTDs) and XML Schemas. XML DTDs are the original and most widely supported approach, but lack the full ability to define data types and data widths at the receiving end of a communication. Nevertheless, using an XML DTD approach gives us data and structure as shown in Listing 1.
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
|
|











