Measure Twice, Cut OnceAs with construction, you must plan before building the ETL.by Warren Thornthwaite, Edited by Margy Ross
Almost every task area in the data warehouse life cycle begins with a planning step. Unfortunately, it's human nature to skip the planning and jump right to the task at hand. With ETL, this goes something like "Let's just get some data loaded, then we'll figure out what we have to do to it." In this column, we describe a simple planning tool to help the ETL team focus on the big picture, while documenting the major events in moving data from the source systems into the target dimensional model. The technique, in conjunction with the source-to-target mapping, provides a preview of fundamental ETL metadata. Objective: High-Level ETL PlanOur goal is to create a conceptual diagram that succinctly captures the ETL process. Figure 1 shows a simplified high-level ETL plan for a fictitious utility company. It illustrates source-to-target data flows at the table level, along with the major transformations for the entire target dimensional model on a single page. In the real world, ETL plans often require several pages, depending on the complexity of the target model. Regardless, a graphical representation similar to Figure 1 is the desired result. Create the plan. The standard template for an ETL plan shows source tables at the top of the page, target tables at the bottom, and data flows, transformations, and processing notes in between. Let's work through an example by delving into the customer dimension table. Plan inputs. There are several critical inputs to the ETL plan: the target dimensional model, source system data model, source-to-target mapping, and data content and quality information from the relevant sources. Figure 2 shows the target customer dimension, while Figure 3 shows sample data rows from the source tables. Access to the source system itself is particularly helpful. Directly querying the source tables to see the exact contents almost always reveals interesting issues. Data flows. The customer dimension data flow is fairly simple. We need to join the The plus sign on the Once the data flow lines are drawn, we make notes along the line describing the transformations needed to get the data ready for loading into the target. Standard dimension transformation notes. The initial tasks noted along the data flow line apply to every dimension: capturing audit and data quality information, adding the surrogate key, and relaying the slowly changing dimension rules.
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
|
|











