The Origin of DataA plan for data lineage tracing is too often an afterthought in data warehouse design, which makes a difficult problem worse
By Ganesh Variar Continued from Page 1 You can find the right set by inspecting the transformations involved and selectively "inverting" them to arrive at the contributing set of source records. (See the sidebar, "Data Sleuthing.") When a series of transformations is required to derive a field in the data warehouse, the task becomes even more complex. With current technology, you are faced with the unenviable task of deciphering the inverse query for every transformation involved. This is one of the major bottlenecks that make lineage tracing so complicated in the present environment. It is more than likely that the business user will require some assistance from the programmer analyst to trace data lineage completely. A breakthrough can be achieved in data lineage tracing if we are able to drill through from the data warehouse to the source data, even when multiple transformations of arbitrary complexities are involved. Currently, this approach still occupies the realm of academic research. The database group at Stanford University is exploring problems related to data warehousing within its WareHouse Information Prototype at Stanford project (more commonly called WHIPS). The problem of tracing data lineage is part of what the project explores. Jennifer Widom and Yingwei Cui have published several papers related to the subject. (See Resources for references.) They have suggested various approaches to identify the source records based on the properties of the transformations involved and have implemented a prototype lineage tracing system based on their algorithms. If software vendors are able to adopt the techniques the academic community suggests, systems that can automatically trace warehouse data back to its source may become available in a few years' time. Despite such possible advances, there are several practical stumbling blocks to physically isolating the contributing source records. For instance, the source data records might have been updated since the data warehouse extract. Things could be even worse in the case of historical data that is no longer significant in an operational context. The old data might have been periodically purged from the operational system. A possible solution to this problem is to store part of the raw operational data in the warehouse itself. This is an extreme measure and should be attempted only for those portions of the data for which lineage tracing is of the utmost importance. WORK NOW, BENEFIT LATERThe capability to trace data lineage is one of the critical success factors for your data warehouse. Tracing data lineage is a challenging problem that requires both strategic and tactical initiatives. A plan for recording data lineage should be well defined in the design phase itself. Implementation of the approach requires careful planning and attention to detail. Furthermore, when the need arises, you must augment your business analyst's efforts with appropriate technical support to help the analyst understand the transformations and trace the warehouse data back to its source. If you take this path, you should be able to sit back and reap the benefits of choosing to record data lineage information at the outset. SIDEBAR DATA SLEUTHING A DATA LINEAGE TRACING EXAMPLEThis simplified example traces data lineage for a customer dimension table within a data mart. The table is populated using two of the most common types of transformations found in a data warehousing environment key lookups and banding. Consider the source table Customer_Age, which contains customer ages.
A typical data mart doesn't use production keys. Data mart-generated surrogate keys are used in their place. The mappings between the keys are stored in a lookup table, Customer_Key_Map, within the data mart.
The columns in the customer dimension table that are relevant to this example are shown in the Customer table. The customers' ages are banded into predefined ranges, which makes it easier to analyze customer behavior based on age bands rather than on individual ages.
The transformations involved in populating these columns in the customer dimension are: 1. For each record in the Customer_Age source table, look up the value of the Customer_ID column in the Customer_Key_Map table to retrieve the corresponding Customer_Key value. 2. The value of the Age column in the Customer_Age table is compared with the predefined age ranges to determine which age band the customer's age falls into. Suppose you wish to trace data lineage for the record in the Customer dimension table with the Customer_Key value of 101234. You need to understand the transformations involved in populating the table and determine its inverse to arrive at the contributing set of records from the source table. The steps involved are: 1. Look up the Customer_Key value (101234) in the Customer_Key_Map table to retrieve the corresponding Customer_ID (686152). This action is sometimes referred to as an inverse lookup. 2. From the Customer_Age table, retrieve the record with the Customer_ID value of 686152. This is the source record that contributed to the customer dimension record the lineage of which you are tracing. The data lineage for the customer dimension record with the Customer_Key value of 101234 is fully defined by the source record from the Customer_Age table with the Customer_ID value of 686152 and the transformations involved in converting it. Ganesh Variar [ganesh_variar@yahoo.com] is a project manager at Saama Technologies, a Silicon Valley-based consulting firm. He has eight years' experience in managing and designing business intelligence solutions. RESOURCESAscential Software Corp.: www.ascentialsoftware.com Data Warehousing at Stanford University: www-db.stanford.edu/warehousing/publications.html Informatica Corp.: www.informatica.com Microsoft: www.microsoft.com Object Management Group: www.omg.org Ralph Kimball: www.ralphkimball.com Related Article at IntelligentEnterprise.com: "Indicators of Quality," April 10, 2000
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
| ||||||||||||||||||||||||||||||||












