CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise





February 1, 2002

The Origin of Data

A plan for data lineage tracing is too often an afterthought in data warehouse design, which makes a difficult problem worse

By Ganesh Variar

A data warehouse integrates data from disparate systems to present "a single version of the truth" to its users. However, analysts often want a data warehouse to help them explore where a piece of data came from and the process by which it arrived in the warehouse. In industry terminology, they want to be able to trace the "data lineage."

EXECUTIVE SUMMARY

Ganesh Variar

The ability to trace the lineage of data in the data warehouse increases users' confidence in the data warehouse and assists data analysts trying to understand anomalous data. Proven techniques exist for tracking data lineage, and they are best planned along with the initial warehouse design. Academic research may soon deepen the type of information that can be recovered.

It is important to have a strategy to record data lineage information in a warehousing environment. If you've tried implementing a formal approach to trace data lineage, the complexities involved are apparent to you. I address some of those issues here and present various techniques used for recording and tracing data lineage. I also explore how input from the research community could possibly improve the process in the future.

DATA UNDER SCRUTINY

The ability to trace data lineage is invaluable to the business users of the data warehouse as well as to the IT personnel who are responsible for supporting the warehouse. However, most data warehouse blueprints don't include a clear strategy for recording and tracing data lineage. Sometimes, a provision for capturing and using data lineage is incorporated later as an afterthought. This tacking on typically happens after the warehouse has been in production for a while, and in response to repeated requests from data analysts.

The most common use of data lineage is to verify the origins of suspect or anomalous data. When analysts see warehouse data that they can't fully comprehend, they often look at the source data from the operational systems and try to understand how the data was derived in the warehouse. Analysts can use data lineage to authenticate the source data as well as the transformation rules used in the warehouse. If users can reconcile the data in the warehouse with the operational data in the source systems, it enhances their confidence in the data warehouse.

Data lineage information also facilitates data mining and data discovery. As the business evolves, the operational systems' structure and processing logic change. The programmer analysts, who maintain the data warehouse, are often faced with the problem of determining the effect of these changes on the warehouse. Data lineage information supports impact analysis in such situations. You can use it to pinpoint the structures and the programs in the data warehouse that should be changed in response to source system changes.

A data warehouse typically extracts data from various sources, some of which might be more reliable than others. Data lineage can help you determine the relative reliabilities of various source systems. If multiple systems source the same data, you can use the reliability estimate to choose the most reliable system as the source for that data.

A concept closely associated with reliability is data cleansing. Data cleansing is an integral part of many warehousing systems. Data lineage helps trace the origins of the cleansed data, and you can even use it to set up a feedback loop to correct the data back in the operational system.

CURRENT TECHNIQUES

The solution to the data lineage problem consists of two steps. First, you must design and implement a good strategy for recording data lineage. Second, you have to use the information you record to actually trace data lineage.

As for the first step, various methods have been proposed to capture data lineage information. Most commercial models use the metadata of the extract, transform, and load (ETL) process to trace lineage. One of the complexities involved in this approach is the time-variant nature of the ETL process. As new business rules are introduced and the existing ones are changed, the transformation logic and the ETL programs are modified. Hence, it is important to maintain versions of the ETL programs and link the warehouse data to the correct version of the program that loaded it.

You can record lineage at various levels of granularity. Data lineage information can be maintained at table, column, or row level. The analyst uses the ETL metadata to trace the path traversed by the data while it was being transported from the source system to the data warehouse. The source data fields and the transformation logic involved in computing the given piece of warehouse data are determined in the process.

Ralph Kimball proposes creating an audit dimension to enable data quality and lineage reporting in a dimensional data warehouse (see "Indicators of Quality," April 10, 2000). Under this plan, you use a data warehouse-generated audit key to link the audit dimension to each record in the fact table. The audit dimension contains an ETL master version number. This number represents the version of the ETL program that loaded the corresponding fact record. The administrator creates a new master version number whenever changing any aspect of the ETL programs. This compact and elegant solution links each fact record to the set of transformations that were executed to create the record, and thus enables tracing lineage at the lowest level of granularity.

Microsoft's data movement tool, Data Transformation Service (DTS), uses the Microsoft repository to store ETL metadata and enables recording of data lineage. When a program executes, DTS stores a description of that execution in the repository. When a warehouse uses DTS to load a record, it optionally tags the row by a lineage identifier. An analyst can use this identifier to link the row in the warehouse to the repository's description of the program execution that populated or updated it.

Until recently, Object Management Group's (OMG) Common Warehouse Metamodel and Meta Data Coalition's (MDC) Open Information Model were the two most influential standards for defining data warehouse metadata. These organizations developed these standards to facilitate better exchange of metadata between different software vendors' products. The MDC is a consortium of about 50 vendors including Microsoft, BMC Software Inc., Informatica Corp., Computer Associates International Inc., NCR Corp., and SAS Institute Inc. OMG consists of Oracle, IBM, NCR, Unisys Corp., and Hyperion Solutions Corp., among other vendors.

The metadata models both these groups proposed contain provisions for tracing data lineage. In September 2000, MDC ceased operations and merged with OMG to work on a combined set of metadata specifications. Data lineage will continue to figure as an important element in the integrated standard.

Most of the top software vendors involved in standardizing warehouse metadata offer their own repositories, either as a stand-alone product or as a part of one of their analytic products. Many of these repositories support data lineage tracing, and we can expect others to follow suit in the near future.

ETL tool vendors are closer to the action, as their tools are used to transfer the data from source systems to the warehouse. Hence, products from leading ETL vendors such as Ascential Software's MetaStage and Informatica's metadata repository support lineage tracing at table and column levels.

If your data warehouse model isn't fully dimensional, or if your ETL tool doesn't directly support tracing data lineage, let me propose a simple solution: The first step is to add a process date column to all the tables in your data warehouse. This column will contain the date and time when the corresponding record was loaded into the warehouse. You must also include a set of effective start and end dates in the repository tables, which store your ETL metadata. Whenever you make changes to any of your ETL programs, you must update the effective end dates for the existing metadata records and insert new records that will reflect the changes to the program. Thus, you will have one record in the metadata repository corresponding with every change to the ETL programs. The effective start and end dates associated with each metadata record define the period during which the change is valid. You can now link the data from any of the tables in your warehouse to the metadata for the set of ETL programs that loaded the record: Filter on the condition that the process date for the data row should be between the effective start and end dates for the metadata records. The metadata records that meet this criterion describe the set of ETL programs that were used to load the data.

LIMITATIONS AND POSSIBLE BREAKTHROUGHS

The final goal of tracing data lineage is to isolate a set of source records and the transformations that were used to derive the warehouse data. Current techniques, as I described earlier, identify only the source fields along with the transformations involved. It is then up to the user to identify the actual set of source records used to derive the specified warehouse data item.







IE Weekly Newsletter
Subscribe to the newsletter
    Email Address