CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise




Inside Job

Does your data warehousing team need an efficient ETL strategy? "Look within" may be the best advice

By Michael Jennings



In many data warehouse efforts, determining the best methodology and implementation strategy for extraction, transformation, and load (ETL) processing is a struggle. Budget constraints, deadlines, lack of internal warehouse architecture experience, or a combination of the three can consume precious time and money in these situations. In other cases, data warehouse project managers spend a great deal of time and effort analyzing, investigating, and evaluating various ETL tools. In doing so, these groups often overlook their own proven internal capabilities.

In this article, I’ll offer a technical methodology and implementation strategy for performing custom ETL processing. These techniques — which provide a fast, easy-to-learn, cost-effective method for loading information into a warehouse — can be useful for decision-support projects that are just getting started or for quickly implementing a prototype. Furthermore, you can easily augment existing procedures and pro- cesses, even those already involving commercial, off-the-shelf ETL products.

Understanding the Fundamentals

For the purposes of this discussion, we’ll assume that flat-file processing outside the RDBMS is the most efficient way to transform data. It offers the advantage of avoiding lock-in with a particular RDBMS vendor during a project’s life span, and sorting and transforming flat files is typically faster than doing so within the RDBMS.

We will further assume that your warehouse operating system is using one of the more popular flavors of Unix and a known RDBMS (preferably one with a warehouse focus). You’ll need to create three versions of the data warehouse database (development, test, and production) and load source extract files, temporary work files, DBMS load-ready files, scripts, and programs into a project directory subdivided into development, test, and production areas. You’ll also load those data files, scripts, and programs shared among decision-support projects into a common directory.

Next, you’ll need to make sure you have six basic infrastructure components. First, you need a network environment that can connect your source data systems to your warehouse platform. Most corporate network architecture can move large amounts of data among platforms using sheer bandwidth or various compression utilities. With the ubiquity of TCP/IP, utilities such as FTP are available for many platforms and operating systems.

The second component is an RDBMS for the warehouse. Most large RDBMS vendors offer their products on a variety of platforms and operating systems. Some vendors have also addressed the needs of decision-support environments by adding optimization packages for data warehouses, such as high-speed parallel processing. (Examples include Informix Advanced Decision Support Option, Sybase IQ, and Informix Red Brick Decision Server.) I’ve seen one RDBMS specialized for decision support insert 3.5 million records per minute and update 750,000 records per minute. That equates to 1 billion records inserted into a warehouse table in less than five hours. (Of course, we were using a 24-node box with several gigabytes of RAM, so compare your results appropriately.) An additional plus is the ability to remotely access databases on other servers; this feature will help facilitate the data extraction process from source systems directly to the data warehouse.

Third, you’ll need a sort-merge utility to integrate data from the various source systems. Several operating systems (such as Unix) include sort utilities for limited volumes. Otherwise, several popular commercial sort utilities are available. If you use an in-house sort package that isn’t available on the warehouse operating system, look into extending or migrating your current license. Some of these sort packages also include a merge utility for combining records from different source files with similar keys. If your sort utility does not have this capability, many script and programming examples of how to perform this function are available on the Internet. I’ve seen several IT organizations address this need by developing their own in-house utility for this type of activity.

The fourth component in your custom ETL environment is a calculation method. You can develop such a method in SQL during loading of the target warehouse table or, depending on volume, through a script (using awk or Perl, for example).

Fifth, you’ll need a way to schedule and run ETL batch cycles based on events or timelines. You can do so using a commercially purchased program or a utility that is available through the operating system or RDBMS (a cron, daemon, stored procedure, or trigger).

Last, you’ll need a source control utility to manage updates and version control of programs and scripts. It could be a commercially purchased program or a utility that is bundled with the operating system (for example, SCCS in Unix).

Method to the Madness

After you’ve identified the warehouse project’s business requirements, analyzed the source systems, and developed the physical data model (that’s another article), you’re ready to begin designing your ETL processes. You should design your ETL flows systematically; for example, the ETL environment described here will use a staged approach in which each loading process is directed toward a particular target table or cluster of tables depending on business needs.

Staged transformation optimizes data acquisition and transformation from source systems by providing a single processing method for initial and successive loads to the data warehouse. The following five stages offer a modular, adjustable transformation process for the target table that you can easily adapt to changes in the source systems or warehouse model without affecting your entire ETL workflow. Furthermore, using technical metadata tags, sometimes called operational metadata, in the warehouse schema design as well as the ETL transformation processes can improve loading and maintenance capabilities.

Each of these stages includes a sequential execution of steps using the appropriate infrastructure components I described earlier. These steps and target table loads can occur in parallel with some additional planning of the overall ETL batch cycle process. First, you assemble each step through a script or higher-level language program. (For this discussion, we’ll assume some form of shell script.) Then, you collect the stage transformation scripts in a higher-level target table load script. Finally, you gather the target table load scripts in the required execution order to form the processing needs of the batch cycle (daily, weekly, monthly, and so on).

Keep in mind, however, that depending on your business requirements and the complexity of business rules needed in the transformation, you wouldn’t necessarily use all five of these stages for every target table.

Stage 1: Source verification. In this stage, your system will access, extract, and build a temporal view of the data in the source system. This source extract is included in backups of the entire batch cycle for reload purposes and for audit-and-reconciliation purposes during testing. During this stage, you can also capture technical and business metadata and verify it against the metadata repository (if available), and verify business rules unique to the source system.

Stage 2: Source alteration. Depending on your business requirements, in this stage you can perform a variety of transformations unique to the source. The options include integrating data from multiple source systems based on priority ranking or availability, integrating data from secondary sources, splitting source-system files into multiple work files for multiple target table loads (clusters), and applying logic and conversions unique to the source systems. You can also apply technical metadata tags such as source-system identifiers, active production keys, and confidence-level indicators during this stage.

Stage 3: Common interchange. In the third stage, common interchange, you apply business rules or transformation logic that are common across multiple target tables — such as referential integrity (populating fact table surrogate keys from dimension tables, for example) and enterprise definitions and business rules from the metadata repository (conversion of country codes into ISO standard formats and so on).

Stage 4: Target load determination. In this stage, the system puts data into its final format to produce load-ready files for the target table, identifies and segregates rows to be inserted or updated (if applicable), applies remaining technical metadata tagging, and processes data into an RDBMS. It also compares data records processed from the source system for the current batch cycle against records loaded in previous cycles to determine if inserts or updates are required. For updated records, technical metadata tags such as load date, update date, current flag, load cycle, hash values, or cyclic redundancy checksums are also updated to indicate the change in the state of the data records. The load-ready files built in this stage are included in backups of the entire batch cycle for reload and audit purposes during testing.

Depending on the DBMS platform for the warehouse and the number of records involved, at this stage you can substitute the high-speed parallel load option in place of the standard database load utility. You can often achieve faster load performance by simply dropping and re- inserting all data rows.

Stage 5: Aggregation. In the fifth and final stage, aggregation, the system uses the load-ready files from the previous stage to build aggregation tables that improve query performance against the warehouse. This stage is typically applied only against fact table target-load processes. Make sure that aggregated records end up with the correct surrogate keys from the dimension tables for the rollup levels required in reports.

ETL Process Example

Now, let’s apply these infrastructure components and transformation method to a common ETL process. For our example, let’s assume that our source system is a human resources (HR) enterprise resource planning system (ERP). The target is an organization dimension table that happens to use type 2 slowly changing dimensions (SCDs; as described by Ralph Kimball) to capture changes to relevant columns over time.

Here’s the challenge: SCDs rely on production or natural keys in the dimension table and an unchanging source system. Consequently, we’ll need to assign new surrogate keys to the dimension table when changes to these relevant columns are detected during a batch load cycle. To do so, we’ll need to compare the previously loaded production keys of the dimension table and the new load cycle data from the source system.

Because we’ve used metadata tag col- umns in the design of the dimension table and ETL process, we can easily flag newly assigned surrogate keys as “current” and records previously loaded for a matching production key as “historical.” Data warehouses that use a star schema data model can use this current flag indicator to enable type 2 SCD processing for dimension table loads.

In the first transformation stage, source verification, we use the unload utility of the DBMS to extract data from the Organization and Region tables in the HR ERP source system into flat files. (See Figure 1, top left.) We then initially sort and reformat data to the extracted records from the Organizational table, which will be the primary data source for this target table load. The DBMS unload and sort utilities then extract data from the source system into flat files and sort or reformat the primary source records.

FIGURE 1 Source verification, alteration, and common interchange stages.


In the second stage of transformation, source alteration, we append data from secondary sources — in this case the HR ERP Region table — to the primary organizational extract file. (See Figure 1, middle right.) We need the appended column, Region Name, for loading into the data warehouse target dimension table, Organization. We use a merge utility in this stage to match the two files based on node ID, and then append the region.

As you’ll recall, in transformation Stage 3, common interchange, the system compares data records from the source system against data mappings stored in the metadata repository. (See Figure 1, bottom left.) In this stage, we find that the region name values stored in the HR ERP system do not conform to the established enterprise definitions. (The metadata repository contains a data mapping of the region names from the HR ERP system cross-referenced to the data warehouse enterprise definitions.) Thus, we need to use the merge infrastructure utility to update the organization record region names to reflect the enterprise versions.

In the fourth and final stage of this transformation example, target load determination, we compare the current load of organization records against those previously loaded in earlier batch cycles. (See Figure 2, top left.) In this particular case, processing is slightly more complex due to the SCDs in the organization dimension table. The established method for processing of type 2 SCD records is to sort both sources of information (in this case HR ERP organizational table records vs. warehouse organizational table records) based on the production or natural key (node ID) of the dimension table. We can use the current flag indicator to constrain which Organization table records we will need to compare to the current load-cycle data processed to the warehouse. Only those records containing a “Y” in the current indicator column are required for comparison to the HR ERP load cycle data because they contain the most updated information for the specific node ID (production key).

FIGURE 2 Start of target load determination stage.


We then make a consecutive comparison between the two data files based on node ID using the merge utility. Current load-cycle records with node ID key values not found in the Organization dimension table are loaded as new rows. (See Figure 2, top left.) These rows receive new surrogate keys for the Organization dimension and we set their current indicator flag to “Y.” The current flag setting is due to the fact that no other occurrence of the node ID combination exists in the dimension table.

Using our merge utility, we further interrogate matching node IDs found in both the load-cycle data and the organization dimension table based on columns we’ve deemed relevant for tracking changes made by business users — in this case, Region Name and Manager ID. (See Figure 2, bottom left.) We sort the two sources by the node ID and two relevant columns.

Next, we flag new rows for insertion: current load-cycle records that have relevant columns that do not match their corresponding organization dimension table rows, new region names, or manager IDs. (See Figure 3, top.) We insert these rows with new organization surrogate keys and set the current flag indicator to “Y” because the Region Name or Manager ID updated since the last load cycle.

FIGURE 3 SCD type 2 processing in stage 5.


Next, we assign a current indicator of “N” to previously loaded organization dimension rows that have a node ID matching the current HR ERP load-cycle file but have differences in relevant columns. (See Figure 3, bottom.) These rows receive this setting because of changes to the Region Name or Manager ID of the dimension in the current load cycle.

This same process of constraining on the current flag indicator and performing comparison on production keys between the dimension table and load cycle repeats during each update process to the data warehouse.

A Better Way

Because developing ETL processes is among the most complex and time consuming tasks in data warehousing, the project team must fully understand the infrastructure components and methodology involved. Fortunately, custom ETL processing offers these teams an intuitive, rapid, adaptable solution for implementing their back-end warehouse environment. By using existing infrastructure components and adopting a standard ETL methodology, companies can reduce development and operation costs for current and future decision-support projects.



Rate This Article

Comments:

Optional e-mail address:



Michael F. Jennings (mfjennin@hewitt.com) is a data warehouse architect, author, and conference speaker on business intelligence and data warehousing issues at Hewitt Associates. He is a contributing author to the book Building and Managing the Meta Data Repository: A Full Life-Cycle Guide, forthcoming in 2000 from John Wiley & Sons. Mike is also a regular columnist for the Real World Decision Support newsletter (www.ewsolutions.com).