Measure Twice, Cut Once
As with construction, you must plan before building the ETL.
by Warren Thornthwaite, Edited by Margy Ross
Continued from Page 1
Capturing audit data can be as simple as saving the job name, start time, stop time, row counts, and system variables. Most ETL tools do this as a standard part of their logging process. We may want to include supplemental data quality measures.
The next notes inform the ETL team about surrogate keys and slowly changing dimensions (SCDs). The customer dimension will deal with attribute changes using a combination of Type 2 (additional rows) and Type 1 (attribute overwrite). We communicate this in the ETL plan with a simple notation like SCD (2,1).
More specific transformation notes. We continue working through each column of the target dimension. The next one is simple; Cust_ID is the transaction system key, so we just copy it over.
The Customer_Name looks like a straightforward copy of the NAME field in the source. However, the sample data in Figure 3 highlights potential transformations. We could split NAME into its individual elements: First_Name, Middle_Name, Last_Name. Elementizing is particularly important if we need to integrate data from multiple sources. Separate elements are also important if the dimension supports customer contact, like direct mail or call center contact. Apparently we don't have either of these needs since NAME was not split out in the target model. Another possible NAME transformation would be to convert upper case to title case. Since dimension values appear as row and column headers, cleaning up their appearance makes for more legible reports. Many of the data quality tools have special functions to help with case conversion.
The next column, Contact_Name, is similar to Customer_Name, although it has some null values. In general, we prefer values in every dimension attribute so users understand the meaning. In this case, we might substitute a default value like N/A during the extract process. Or we might copy the Customer_Name field because Contact_Name consistently contains the name of a person. Whatever choice we make is ultimately a business decision and needs to support the business requirements. In this case, we'll copy over the Customer_Name field when Contact_Name is missing.
The next column in the target table is Phone_Number, and it has a few formatting problems. In this case, we will standardize the format so users can query it in a single, consistent fashion. We'll use the common area code, phone number pattern: (###) ###-####. The next column is Area_Code. There must be a business reason for separating out the area code. The good news is it will be easy to do after standardizing the phone number.
The rest of the address fields through State can be copied over fairly directly. Potential problems, such as mixed case and null values, may also need to be resolved with these fields, as we already discussed.
The ZIP field in the Figure 3 source is a mix of five-digit and nine-digit zip codes. In this case, the target table gives us guidance on what we need to do. There are two ZIP code fields in the target table: Customer_Full_Zip_Code and Customer_5_Digit_Zip_Code. The Customer_Full_Zip_Code field might be a direct copy of ZIP, or it might be a fully resolved nine digit ZIP, depending on the business requirements. The Customer_5_Digit_Zip_Code field is just what we'd expect. Interestingly, significant demographic data is available at the five-digit ZIP code level. If you create a separate five-digit field, it's easy to join it with a five-digit demographic table to better understand your customer base profiles.
Finally, we do the Customer_Type_Desc lookup by joining to the CUST_TP_MST table as described earlier. Again, we may want to do a mixed case transformation on the description itself.
Finish Planning Before Cutting
We follow the same process for the other target tables gathering inputs, drawing data flows, and noting transformations. The goal is a compact, visually understandable model to plan and document the overall ETL process. It captures enough information so that a competent ETL professional could complete the detailed design and begin construction. A picture is worth a thousand words; however, we need to spend much more time doing our homework relative to the contents of Figure 4 rather than beautifying the graphical representation.
Warren Thornthwaite [warren@ralphkimball.com]
is a Kimball University instructor. He cowrote The Data Warehouse Lifecycle Toolkit (Wiley, 1998).