CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise





May 28, 2002

Prime Time for XML

Open up new application types with XML embedded in your relational databases

By Steve Tracy

During the last decade, data warehouse designers have successively refined dimensional modeling techniques for storage and retrieval of numbers, dates, and simple textual descriptors. But we haven't had the same impressive improvements with more complex structured text, especially in a data warehousing context.

What improvements we have made in gracefully handling text storage and analysis have been mostly based on expensive advances in proprietary database storage and indexing technology, not on logical modeling insights. But there is hope. In this column, I describe one alternative approach that can be implemented today, applying XML to a dimensional modeling example.

Complex Data Warehouse Text

In this column, I address the challenges of analyzing complex text attributes typically stored in data warehouse dimension tables, and not the larger issue of searching archives of unstructured documents. Legacy source systems contain text fields for comments and descriptions, often with little predictable structure. Such text isn't a "document," but is usually larger than a simple character attribute, and it is often hard to predict exactly what you'll find. Storing this text in standard dimensional attributes presents the following data modeling challenges:

  • Cardinality: An otherwise well-behaved dimension record is forced to contain a multivalued text attribute.
  • Repeating groups: Code descriptions and other text fields often repeat, provoking the "we have to snowflake" response from the data modeler.
  • Meaningful information (such as a comment that is virtually unstructured): The data isn't in a very useful form by which to find the information (through a query against a simple text attribute).

These issues are probably very familiar to data warehouse modelers. Putting unstructured text into a structured world usually means making compromises. Data modelers sometimes use proprietary binary storage to house large text attributes as binary large objects (BLOBs). Alternatively, using an object storage approach ignores text as a true dimensional attribute. With both of these techniques, information becomes invisible to conventional SQL and relationships within the original text are often inaccessible.

Test Example: Substituting "Tags" for Text

In his last column, "Wrangling Behavior Tags" (May 9, 2002), Ralph Kimball described an innovative method of representing patterns of customer behavior as simple character tags. The traditional approach requires sifting through large volumes of transactions arising from customer-facing processes such as call centers, Web-site visits, delivery systems, and payment reconciliation systems. In Ralph's approach, each customer's behavior is represented for 24 time periods by assigning textual tags A through H to behavior clusters:

A: High-volume, repeat customer, good credit, few product returns

B: High-volume, repeat customer, good credit, but many product returns

C: Recent new customer, no established credit pattern

D: Occasional customer, good credit

E: Occasional customer, poor credit

F: Former good customer, hasn't been seen recently

G: Frequent window shopper, mostly unproductive

H: Other.

Then a series of attributes in the customer dimension effectively stores the time series of patterns as letters, based on analysis of the facts. The method assumes:

  • A single customer dimension record with a 24-month time series of behavior tags as 24 attributes.
  • Each customer is profiled each month.
  • A new behavior tag measurement doesn't by itself generate a new record.

The result is an effective means of accomplishing the goal, profiling customer behavior. The time series of tags is stored efficiently and can be queried efficiently with straightforward SQL techniques.

But is Text Only Text?

Ralph's approach, clean as it is, contains a number of typical text issues:

  • The meaning of the tags isn't declarative within the customer dimension. A sequence of letter tags is by itself, Greek.
  • The scope and meaning of the time periods is only indirectly communicated through the database structure.
  • The order of the column attribute implies the position in the time series. The time series data structure is inflexible and not easily scaled.
  • Column tag values don't contain explicit indicators of patterns of change.
  • Detecting change patterns requires many variations of SQL statements to support the combinations across time.
  • XML is Text

    Ralph's customer behavior design is tailor-made for XML representation. XML is a language ideally suited to modeling and communicating structured and semi-structured text. XML lets us address many of these modeling shortcomings. I'll recast the customer behavior solution using XML and:

    • Embed the solution in a conventional relational database
    • Store the complex text-based information in just a single database column
    • Eliminate the need for fixed time periods
    • Understand the meaning of the text without a second layer of translation
    • Store important attributes within the time series, such as when a change occurs
    • Easily support a hierarchical representation within the time series
    • Eliminate complex and lengthy SQL to extract patterns.

    Extending XML into the Dimensional Model

    Why not really add meaning, depth, and ease of exchange to dimensional data? XML has two basic mechanisms for declaring text structure: XML Document Type Declarations (DTDs) and XML Schemas. XML DTDs are the original and most widely supported approach, but lack the full ability to define data types and data widths at the receiving end of a communication. Nevertheless, using an XML DTD approach gives us data and structure as shown in Listing 1.






    IE Weekly Newsletter
    Subscribe to the newsletter
        Email Address