Intelligent Enterprise Magazine - Data Webhouse // Stop hiding from old browsers --> Intelligent Enteprise
Guide to the TechWeb Network

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home




April 16, 2001




XML Will Make It Easier


Data integration and reuse possibilities are wide open but not yet very well realized


by Ralph Kimball


By now, most of us in the data warehouse community have heard of extensible markup language (XML). We have been told that it will substantially extend HTML for much of our Web communications, and it will make the content of these communications far more transparent. That all sounds fine.

But what is XML? What will it really do for us? How far are we along in realizing the promised benefits? What does a data warehouse project manager need to know about XML?

What Is XML?

XML is a language for data communication. Superficially, XML takes the form of plain text documents containing both data and metadata but no formatting information. XML is expressed with much of the same notation as HTML, but departs from the architecture of an HTML document. HTML contains data and formatting information, but no metadata.

These differences between XML and HTML are crucial to understanding how XML will affect data warehousing. XML metadata consists of tags unambiguously identifying each item in an XML document. For instance, an invoice coded in XML contains sequences such as:

<Customer Name = "Bob" Address = "123
Main Street" City = "Philadelphia" />

Here "Customer" is an XML element defined between the sending and receiving parties before the invoice is transmitted. The Customer element has been defined to contain some XML attributes, including Name, Address, City, and possibly others.

You can see that the XML coded invoice therefore contains both metadata (the element and attribute tags) as well as data (the values Bob, 123 Main Street, and Philadelphia). Notice that the XML does not tell an application such as a Web browser how to lay out the invoice on the screen. This information is deliberately omitted from the XML document.

If we had sent the same invoice coded as an HTML document, we would have produced a superficially similar looking document consisting of tags and data values, but the HTML tags would be restricted to formatting the document; they would not convey the meaning of the data. There is no way in HTML to, for example, declare the components of an invoice.

So, if you receive an XML coded invoice, how do you know what the tags mean? And how do you format it in your Web browser?

DTDs, XML Schemas, And XSLT

In XML, the standard way for two parties to define the set of possible tags is by exchanging a special document known as a document type definition (DTD). The DTD declaration for our customer example could be expressed as:

<!ELEMENT Customer (Name, Address,
City?, State?, Postalcode?)> <!ELEMENT Name (#PCDATA)>

plus similar lines for Address, City, State, and Postalcode.

Here the question marks after City, State, and Postalcode indicate that these fields are optional. The #PCDATA declaration indicates that Name is an unformatted text string.

Notice that the DTD contains somewhat less information than a SQL CREATE TABLE statement. In particular, there is no field length.

DTDs have until now been the basis for setting up a "metadata understanding" between two parties exchanging XML, and a number of industry groups have been defining standard DTDs for their subject areas. But as data warehousers, we don't get enough from DTDs to build a SQL table. To rectify this problem, the World Wide Web Consortium (W3C) standards organization has been working on an industrial strength successor to DTDs known as XML schemas. XML schemas will contain much more database-oriented information about data types and how the XML elements relate to each other - or in other words, how tables can be joined.

After participating parties agree on an XML schema and one receives XML content from the other, the recipient uses a parsing engine to render the information content via another specification called extensible stylesheet language transformations (XSLT). Actually, XSLT is a general mechanism for translating one XML document into another XML document, but its most visible use is for turning XML into HTML for final on-screen presentation.

Where XML Will Make Life Easier

XML is intended to be the "dial tone" for computers to exchange information. Although XML is independent from the Web, most of the interesting uses will piggyback on the Web's hypertext transfer protocol (HTTP). Here, in my opinion, are the places where XML will affect data warehousing: legacy data extraction, input transaction capture, direct storage of XML, and agnostic front-end information delivery.

Legacy data extraction. If a data source can expose its data in XML format, then parties sharing a common XML schema can transmit and receive the data. Although older legacy applications may need to be retrofitted with XML writers, many more modern systems will be equipped to write data in XML format. Relational databases, such as Oracle, DB2, and Microsoft SQL Server 2000 already support query output and bulk data transfer directly in XML form, with no intermediate application required.

The ability to read remote data via XML greatly simplifies data extraction, because custom parsers for the remote data source do not have to be written. This, in turn, promotes a more distributed and opportunistic approach to a spread-out "data webhouse."

Input transaction capture. In many cases, the data warehouse is focused on capturing the original transaction inputs and using them as the source of data, rather than waiting for an extract or a query from the production system. This transaction capture capability has been growing in importance with the increased interest in realtime data warehouses.

A serendipitous side effect of the growing use of XML is that the transaction flow from a data input terminal to the production system can be siphoned off so that the data warehouse receives the transactions in parallel with the production system. If the data is expressed in XML and the XML schemas are shared, then the data warehouse can update in parallel with the production system, without waiting for a conventional extract.

Direct storage of XML. One of the more interesting points of overlap between XML and relational databases is the possibility of storing XML documents directly as relations. Direct storage can take a couple of forms. An XML document can be a kind of replacement for a SQL INSERT or UPDATE statement, where the data in the XML document ends up in relational tables. Or relational systems will implement a new XML native data type. This transformation is more than cosmetic: XML content can be far more hierarchical than most relational systems support. Vendors are discussing hybrid scenarios in which data in XML form could be joined inside the database directly to data in conventional relational tables. An XML data type in a relational system would open up whole new classes of hierarchical applications.

Agnostic front-end information delivery. The widespread deployment of XML will be the final step in removing query and reporting tools from end users' machines. An XML data transfer plus an associated XSLT formatting specification may be enough to produce any desired user interface presentation on a remote browser. What makes this interesting is that different end-user clients can use different XSLT formatting specifications while drawing data from the same database server. All that the server needs to do is expose its query results through XML. All that is needed at the client end is a Web browser.

The State of XML Implementation

XML is a set of open standards, defined through W3C. Although visiting the W3C's Web site (www.w3c.org) is interesting, browsing it is like drinking from a fire hose. There are dozens of subcommittees, vendors, and industry groups working on various aspects of XML, DTDs, and myriad other standards. But be aware that many of these standards are part of the plumbing of XML, and are of more interest to software developers in the vendor community than to data warehouse project managers.



Rate This Article

Comments:

Optional e-mail address:

Honestly, as a data warehouse project leader, I don't think you should dig deeply into the alphabet soup of all these committees and their work at this time. Rather, I would continue to wait for your primary, familiar vendors to complete their XML capabilities, and then I would try to take advantage of the capabilities described in this column. Ultimately all of this should make it easier to build up your data warehouse by connecting remote, unrelated systems while at the same time doing less custom software development.

 

Ralph Kimball co-invented the Star Workstation at Xerox and founded Red Brick Systems. He has three best-selling data warehousing books in print, including The Data Webhouse Toolkit (Wiley, 2000). He teaches dimensional data warehouse design through Kimball University and critically reviews large data warehouse projects. You can reach him through his Web site, www.rkimball.com.




RESOURCES

Related Articles at IntelligentEnterprise.com:

XML-based remote object access: "Untangling the Web," March 27, 2001: www.Intelligent Enterprise.com/010327/feat2.jhtml
Metadata issues: "The Naked Truth," November 10, 2000: .../001110/metaprise.jhtml
Data integration: "Pillar of the Community," August 18, 2000: .../000818/feat1.jhtml
XML data servers: "E-Commerce X-Factor," April 10, 2000: .../000410/feat3.jhtml






IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







techweb
Online Communities TechWebInformationWeekLight ReadingIntelligent EnterprisebMightyNetwork ComputingDark ReadingDigital LibraryWall Street & Technology
Byte & SwitchNo JitterInternet EvolutionLight Reading's Cable Digital NewsContentinopleUnStrungBank Systems & TechnologyAdvanced TradingInsurance & Technology
Face-to-Face Events
InteropWeb 2.0 ExpoWeb 2.0 SummitVoiceConBlack HatCSISoftwareEntrprise 2.0 ConferenceGTEC
Mobile Business Expo
InformationWeek 500 ConferenceBuy Side Trading XchangeBuy Side Trading SummitBank Executive SummitInsurance Executive SummitTelcoTVEthernet ExpoOptical Expo
Magazines  
InformationWeekWall Street & TechnologyInsurance & TechnologyBank Systems & TechnologyAdvanced TradingMSDNTechNetSmart EnterpriseThe Architecture JournalDatabase Magazine
 
Research & Analyst Services  
Heavy ReadingInformationWeek ReportsInformationWeek Analytics