Guide to the TechWeb Network

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home




February 1, 2002

/020201/503feat3_1.jhtml">

The Origin of Data

A plan for data lineage tracing is too often an afterthought in data warehouse design, which makes a difficult problem worse

By Ganesh Variar

Continued from Page 1

You can find the right set by inspecting the transformations involved and selectively "inverting" them to arrive at the contributing set of source records. (See the sidebar, "Data Sleuthing.") When a series of transformations is required to derive a field in the data warehouse, the task becomes even more complex. With current technology, you are faced with the unenviable task of deciphering the inverse query for every transformation involved. This is one of the major bottlenecks that make lineage tracing so complicated in the present environment. It is more than likely that the business user will require some assistance from the programmer analyst to trace data lineage completely.

A breakthrough can be achieved in data lineage tracing if we are able to drill through from the data warehouse to the source data, even when multiple transformations of arbitrary complexities are involved. Currently, this approach still occupies the realm of academic research. The database group at Stanford University is exploring problems related to data warehousing within its WareHouse Information Prototype at Stanford project (more commonly called WHIPS). The problem of tracing data lineage is part of what the project explores. Jennifer Widom and Yingwei Cui have published several papers related to the subject. (See Resources for references.) They have suggested various approaches to identify the source records based on the properties of the transformations involved and have implemented a prototype lineage tracing system based on their algorithms. If software vendors are able to adopt the techniques the academic community suggests, systems that can automatically trace warehouse data back to its source may become available in a few years' time.

Despite such possible advances, there are several practical stumbling blocks to physically isolating the contributing source records. For instance, the source data records might have been updated since the data warehouse extract. Things could be even worse in the case of historical data that is no longer significant in an operational context. The old data might have been periodically purged from the operational system. A possible solution to this problem is to store part of the raw operational data in the warehouse itself. This is an extreme measure and should be attempted only for those portions of the data for which lineage tracing is of the utmost importance.

WORK NOW, BENEFIT LATER

The capability to trace data lineage is one of the critical success factors for your data warehouse. Tracing data lineage is a challenging problem that requires both strategic and tactical initiatives. A plan for recording data lineage should be well defined in the design phase itself. Implementation of the approach requires careful planning and attention to detail. Furthermore, when the need arises, you must augment your business analyst's efforts with appropriate technical support to help the analyst understand the transformations and trace the warehouse data back to its source. If you take this path, you should be able to sit back and reap the benefits of choosing to record data lineage information at the outset.


SIDEBAR

DATA SLEUTHING — A DATA LINEAGE TRACING EXAMPLE

This simplified example traces data lineage for a customer dimension table within a data mart. The table is populated using two of the most common types of transformations found in a data warehousing environment — key lookups and banding. Consider the source table Customer_Age, which contains customer ages.

A typical data mart doesn't use production keys. Data mart-generated surrogate keys are used in their place. The mappings between the keys are stored in a lookup table, Customer_Key_Map, within the data mart.

The columns in the customer dimension table that are relevant to this example are shown in the Customer table. The customers' ages are banded into predefined ranges, which makes it easier to analyze customer behavior based on age bands rather than on individual ages.

The transformations involved in populating these columns in the customer dimension are:

1. For each record in the Customer_Age source table, look up the value of the Customer_ID column in the Customer_Key_Map table to retrieve the corresponding Customer_Key value.

2. The value of the Age column in the Customer_Age table is compared with the predefined age ranges to determine which age band the customer's age falls into.



Rate This Article

Comments:

Optional e-mail address:

Suppose you wish to trace data lineage for the record in the Customer dimension table with the Customer_Key value of 101234. You need to understand the transformations involved in populating the table and determine its inverse to arrive at the contributing set of records from the source table. The steps involved are:

1. Look up the Customer_Key value (101234) in the Customer_Key_Map table to retrieve the corresponding Customer_ID (686152). This action is sometimes referred to as an inverse lookup.

2. From the Customer_Age table, retrieve the record with the Customer_ID value of 686152. This is the source record that contributed to the customer dimension record the lineage of which you are tracing.

The data lineage for the customer dimension record with the Customer_Key value of 101234 is fully defined by the source record from the Customer_Age table with the Customer_ID value of 686152 and the transformations involved in converting it.



Ganesh Variar [ganesh_variar@yahoo.com] is a project manager at Saama Technologies, a Silicon Valley-based consulting firm. He has eight years' experience in managing and designing business intelligence solutions.


RESOURCES

Ascential Software Corp.: www.ascentialsoftware.com

Data Warehousing at Stanford University: www-db.stanford.edu/warehousing/publications.html

Informatica Corp.: www.informatica.com

Microsoft: www.microsoft.com

Object Management Group: www.omg.org

Ralph Kimball: www.ralphkimball.com

Related Article at IntelligentEnterprise.com:

"Indicators of Quality," April 10, 2000







IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







techweb
Online Communities TechWebInformationWeekLight ReadingIntelligent EnterprisebMightyNetwork ComputingDark ReadingDigital LibraryWall Street & Technology
Byte & SwitchNo JitterInternet EvolutionLight Reading's Cable Digital NewsContentinopleUnStrungBank Systems & TechnologyAdvanced TradingInsurance & Technology
Face-to-Face Events
InteropWeb 2.0 ExpoWeb 2.0 SummitVoiceConBlack HatCSISoftwareEntrprise 2.0 ConferenceGTEC
Mobile Business Expo
InformationWeek 500 ConferenceBuy Side Trading XchangeBuy Side Trading SummitBank Executive SummitInsurance Executive SummitTelcoTVEthernet ExpoOptical Expo
Magazines  
InformationWeekWall Street & TechnologyInsurance & TechnologyBank Systems & TechnologyAdvanced TradingMSDNTechNetSmart EnterpriseThe Architecture JournalDatabase Magazine
 
Research & Analyst Services  
Heavy ReadingInformationWeek ReportsInformationWeek Analytics