CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise





December 10, 2003

Measure Twice, Cut Once

As with construction, you must plan before building the ETL.

by Warren Thornthwaite, Edited by Margy Ross

Almost every task area in the data warehouse life cycle begins with a planning step. Unfortunately, it's human nature to skip the planning and jump right to the task at hand. With ETL, this goes something like "Let's just get some data loaded, then we'll figure out what we have to do to it."

In this column, we describe a simple planning tool to help the ETL team focus on the big picture, while documenting the major events in moving data from the source systems into the target dimensional model. The technique, in conjunction with the source-to-target mapping, provides a preview of fundamental ETL metadata.

Objective: High-Level ETL Plan

Our goal is to create a conceptual diagram that succinctly captures the ETL process. Figure 1 shows a simplified high-level ETL plan for a fictitious utility company. It illustrates source-to-target data flows at the table level, along with the major transformations for the entire target dimensional model on a single page. In the real world, ETL plans often require several pages, depending on the complexity of the target model. Regardless, a graphical representation similar to Figure 1 is the desired result.

Create the plan. The standard template for an ETL plan shows source tables at the top of the page, target tables at the bottom, and data flows, transformations, and processing notes in between. Let's work through an example by delving into the customer dimension table.

Plan inputs. There are several critical inputs to the ETL plan: the target dimensional model, source system data model, source-to-target mapping, and data content and quality information from the relevant sources. Figure 2 shows the target customer dimension, while Figure 3 shows sample data rows from the source tables. Access to the source system itself is particularly helpful. Directly querying the source tables to see the exact contents almost always reveals interesting issues.

Data flows. The customer dimension data flow is fairly simple. We need to join the CUST_MAST table with the CUST_TP_MST table to decode the customer type description called DSC in CUST_TP_MST. We just read the source systems, we don't name them. We draw a line straight from CUST_MAST to the customer dimension and another line joining the CUST_TP_MST as shown in Figure 4. This join conveys standard denormalization, which is a common task when processing dimensions.

The plus sign on the CUST_TP_MST join line is notational shorthand. Shocking as it may seem, sometimes the source system can't be trusted to enforce referential integrity. We may have CUST_MAST rows with CUST_TYPE codes that do not have entries in the CUST_TP_MST table. In this case, we need outer join logic to find the rows in the CUST_MAST that do not have corresponding entries in CUST_TP_MST, enter a default value for them, and notify the appropriate person to fix this data problem.

Once the data flow lines are drawn, we make notes along the line describing the transformations needed to get the data ready for loading into the target.

Standard dimension transformation notes. The initial tasks noted along the data flow line apply to every dimension: capturing audit and data quality information, adding the surrogate key, and relaying the slowly changing dimension rules.








IE Weekly Newsletter
Subscribe to the newsletter
    Email Address