SUBSCRIBE

June 5, 2000, Volume 3 - Number 9


First Analysis

This analysis-led approach to building a data warehouse can help you avoid data quality problems before they occur

By Colin White



New IT data integration and migration projects designed to give business users consistent, clean business information often fail. The reason for failure is a simple one: The IT organization can’t handle the complex data quality issues involved in a cost-effective or timely manner. These problems frequently arise because the organization launches the project without clearly understanding the source data that it has to extract and then load into the new system, or even whether this data can support the new application. For example, when implementing a data warehouse that provides performance metrics for customer and product revenues, you must identify the source systems that contain raw revenue data as well as determine the business rules that will help maintain that data.

To help solve data quality problems and reduce project failures, your development organization must thoroughly understand the source data before embarking on new data integration and migration projects, as well as implement procedures and tools to monitor data quality during data extraction and conversion. I’ll discuss here how you can maintain data quality during data integration and migration projects, focusing specifically on ensuring data quality during the building of a data warehouse. I’ll also explain how you can use data profiling, data reengineering, and data warehouse extract-transform-load (ETL) tools cooperatively to form an analysis-led approach to data warehouse design and development. (See Figure 1)

FIGURE 1 Analysis-led data warehouse development.


Breakdown

First, let’s look at the tasks and issues involved at each stage of the development process.

Identify user requirements and design data warehouse information stores. The first task in the design and development process is to build a business model of the user requirements for the business area the data warehouse addresses. You obtain these requirements via interviews with business users and then identify the business information and analytic processing the data warehouse needs to support. As you add each business area model to a data warehouse, you should integrate it into a common business model of your organization’s information requirements. The organization will adopt this common business model to share information and enforce common business names across different business areas of the data warehousing system.
THREE GENERATIONS
HOW DATA MIGRATION IS EVOLVING TOWARD AN ANALYSIS-LED APPROACH

Whether involving the introduction of a new ERP package, new CRM system, or data warehouse, data-related issues consume an estimated 30 to 40 percent of deployment efforts and costs. Until the 1990s, organizations usually addressed data migration through manual analysis and coding of ETL programs. Such data migration initiatives within IT often caused projects to overspend on budget, run late, and in the most extreme cases, fail altogether. You could consider this process the first-generation approach to data migration.

In the mid ’90s, the next generation of solutions arrived as vendors began to offer automatic creation of program code for ETL. At the same time, ETL vendors promoted an ETL-driven approach for data migration. Essentially, this approach started with the rather uninspiring thought that although the analysis would always be wrong, building data migration code rapidly could quickly help you discover your errors. This second-generation ETL-driven approach to data migration created a new set of issues, including continual program changes necessitated by poor legacy data analysis.

For example, as one data migration manager at Shell UK told me, his organization had hoped code-generation tools would realize huge cost and time savings during the build phase of its migration to SAP R/3. In reality, the anticipated savings were eaten up by continual program changes necessitated by poor legacy data analysis. That experience alerted Shell UK to the critical need for comprehensive data analysis from the start.

More recently, a new breed of data migration tool has appeared, a class of tools that for the first time automates the data migration process. These so-called data profiling tools offer automated data analysis, which when combined with appropriate methodologies, lets organizations radically reduce the risks associated with data migration. These tools — currently available from vendors such as Informix/Ardent and Evoke Software — support a third-generation analysis-led approach to data migration.

Anecdotally, users of these tools are pleased. According to Tony Eccleston, program manager of Glaxo-Wellcome’s global data warehouse, his company believes that data quality is a common reason for decision-support system failure. As Eccleston puts it, “During our initial projects, the effort to understand and integrate data was very resource intensive. Today, the ability to thoroughly analyze data — and quickly establish awareness of data quality with business stakeholders — makes data a business issue, and not an IT one.”

Richard John (richard.john@askalchemy.com) is CEO of Alchemy Migration Services, a U.K.-based consulting company that provides data migration services.

The next task is to design the data warehouse information stores using the business area model as input. When the data warehouse information store design is complete, you need to design a staging-area model for the business area under development. A staging area is the key to providing data quality and integration in a data warehousing environment; funneling extracted source data through there ensures the system’s information stores contain a consistent version of the cleansed source data. In this process, extracted source-system data is integrated here, checked for quality, cleansed, and then used to populate the various information stores of the underlying data warehouses and data marts that make up the data warehousing system.

As with the common business area model, the staging area’s design evolves as the data warehouse incorporates more business areas. This design consists of a normalized data model that documents the detailed data for the business areas the data warehouse supports. Each of these business areas in turn consists of business components such as product and customer.

The last task at this stage of the data warehouse design process is to develop the data mappings for mapping data from the staging area to the data warehouse.

Identify and profile candidate source data. At this stage of the development process, you must determine whether any data in transactional applications can serve as a raw data source for the data warehouse. If the requisite data does not exist, you must obtain it by other means; otherwise, you’ll need to modify project requirements (and therefore user expectations) accordingly.

During the source-data identification process, many data warehousing projects rely solely on definitions such as old project documentation or application source code to obtain accurate descriptions of the source data. For many systems, however, this documentation is often out of date and inaccurate. Even if the source data is well documented, no guarantee exists that the source files’ structure and content match their descriptions or satisfy your organization’s current business rules.

You must, therefore, verify the quality of potential data sources in order to maintain the project timeline and budget. You’ll need to identify and correct data structure and content errors as early as possible in the data conversion process — otherwise, data quality problems will have a negative impact on downstream processes such as the ETL jobs that extract source data for loading into the staging area and data warehouse information stores. Undetected data quality problems can also occur into the data warehouse, which will ultimately degrade its value as a decision-support platform. For example, if the revenue performance information in the data warehouse I mentioned earlier were incorrect or based on erroneous assumptions, the accuracy of downstream campaign-management applications would be at risk.

The first step in ensuring data quality in a data warehouse is to obtain a complete description of the source data by profiling the structure and integrity of potential data sources. Then you should compare the results with existing source data documentation (metadata) if it exists, or create it if it does not. Profiling involves analyzing the source data, taking an inventory of available data elements, and checking the format of those data elements. Furthermore, business users must be involved to ensure that you extract the source data’s business meaning effectively and efficiently.

Profiling can also determine the domain, integrity, and other business rules in the source data, and verify that those rules match the business rules for the proposed data warehouse. This business rule analysis determines data relationships both within a file and among files.

It is important during data profiling to determine the degree to which the source data violates business rules. The quality of the data here will affect the ability to integrate and merge files in downstream processes. If you find problems, you can use data reengineering tools to fix them.

The output from the data profiling task consists of up-to-date, accurate metadata about the source data files. This metadata lets the development team identify the source files that will best satisfy user information requirements and develop an accurate data model of that source data. It also helps determine if the source data is in a suitable condition for extraction and provides a sound data quality frame of reference for the data reengineering processes that may subsequently correct data content errors as information flows from source systems into the data warehouse. You should also use the output from data profiling to help you determine if you need to adjust the staging-area and data warehouse designs based on the source data’s format and content.

Develop source data to staging area mappings. When you’ve established an accurate picture of the source data, you can now document the mappings that will extract data from the source files and load it into the staging area. Some data profiling tools automate the creation of these mappings using the output from the data profiling process and a user- supplied staging-area data model. Furthermore, several ETL tool and analytic application vendors provide extract and transformation templates that contain predefined mappings for extracting data from ERP systems such as SAP R/3 and PeopleSoft and loading it into a vendor-defined staging area or data mart.

Extract the required source data and load it into the staging area. The next step is to extract the required data from the source files and load it into the staging area. The objective at this point is to extract data from individual source files and reformat it, as thoroughly as possible, into a form that matches the business area business components in the data warehouse. At this stage, you should make few attempts to cleanse and transform data significantly or to merge data from multiple files.

Many ETL tools are available for extracting source data and loading it into the staging area, usually by means of a proprietary definition language. The mappings you documented earlier can assist in defining the data mappings for source-data extraction and reformatting it for loading into the staging area.

Clean and transform the extracted data and merge it in the staging area. If required, you can cleanse the data in the staging area based on data profiling results. Data cleansing involves identifying and correcting business rule errors and fixing data inconsistencies. The process is complex, and the tasks associated with cleansing data are usually application specific.

Various data reengineering tools are available for this task. You can invoke several of them directly from ETL products, and if significant data cleanup is not required, the cleanup can occur during source data extraction. When the data is clean, ETL tools can then transform and merge the staging area data into a format that matches the staging-area data model. Data transformation involves operations such as decoding coded fields, adding a timestamp field, and handling any remaining business rule validation and data cleanup tasks not addressed by data reengineering tools. For less complex projects, the transformation and merging processes are sometimes performed when extracting the source data.

Extract and transform staging area data and load it into the data warehouse information stores. The final task is to extract the staging-area data, transform it, and load it into underlying data warehouses and data marts. Data transformation at this stage primarily involves building summarized and aggregated information, rather than data cleanup. Again, ETL tools can perform the data extraction, transformation, and loading.

The Role of ETL Tools

The purpose of ETL tools is to eliminate or reduce the need for in-house coding of extract and transformation programs. These tools operate according to rules defined by the developer; they either generate batch extract and transformation programs (in C or COBOL, for example) based on those rules, or interpret the rules at runtime while extracting source data.

The main benefit of ETL tools that generate code is that they usually support more complex data transformations than interpretive ETL products, and are therefore particularly well suited to projects involving large amounts of legacy data. If necessary, you can customize the generated code or enhance it using in-house coded exit routines. One problem with code-generation tools, however, is that they often create large numbers of batch programs that are difficult to manage, especially if you need to update the extract and transformation rules frequently and regenerate the program code. Examples of ETL tool code generators include the Informix/Ardent Warehouse Executive and Evolutionary Technologies International’s ETI-Extract tool suite.

The current direction of ETL vendors is to offer an alternative to the code- generation approach by supplying products that incorporate their own language for defining ETL tasks. Instead of generating code, these tools interpret these definitions during ETL routines. Many also supply customizable templates for extracting data from transaction processing packages such as ERP systems. The key benefit of these tools is that they are faster, cheaper, and easier to install and use than code-generation products. Consequently, they are ideal for smaller data mart projects that require rapid development times. Examples of interpretive ETL tools include Informix/Ardent DataStage, IBM Visual Warehouse (now called DB2 Warehouse Manager), Informatica Corp.’s PowerCenter, and Oracle Data Mart Suites.

Many organizations are finding that regardless of their choice, ETL products can fail to address many common quality problems in source data. Fixing data errors in ETL tools is time consuming and resource intensive, particularly when poor-quality source data leads to many transformation errors. Therefore, you should use data profiling and reengineering tools to analyze and identify problem areas and clean the data prior to ETL tool processing. You could characterize this third-generation approach to data migration and integration as analysis-led, rather than ETL-driven, data warehouse development. (See sidebar, “Three Generations.”)

Data Profiling Tools

Data profiling products support an analysis-led approach to building a data warehouse, helping you thoroughly understand source data and check for data quality problems. If problems do exist, you can use a data reengineering tool to preprocess and clean the data prior to ETL processing.

Data profiling involves analyzing source data to identify the data element and data integrity business rules. When you’ve identified these rules in the data, you can use them to update existing source data documentation and data models, or to create them if they do not exist. The ability to validate existing source data documentation is important because documentation for legacy systems is frequently out of date or incomplete.

In addition to data warehousing, you can also use data profiling tools in data integration and migration projects and for moving data from legacy systems into packaged applications such as ERP and customer relationship management (CRM) systems. Some innovative companies are also using data profiling to implement e-business applications and integrate them with back-office systems.

Examples of products that focus on data profiling include the Informix/Ardent Quality Manager and Evoke Software Inc.’s Migration Architect. Although you could classify both of these products as data profiling tools, they are quite different. Evoke’s Migration Architect, for example, places stronger emphasis on data discovery and analysis to uncover business rules in source data than Quality Manager, which focuses more on data validation.

If data profiling identifies quality problems in the source data, then you can use data reengineering tools to thoroughly investigate the problems and then correct them. Many of these products focus on the extraction of data from free-form text fields and the cleansing and consolidation of name and address data. This latter capability is particularly important for CRM applications. Products that support data reengineering include the Trillium Software System (recently acquired by Merant) and Vality Technology Inc.’s Integrity.

Quality Rules

As I’ve described here, you can use data profiling, data reengineering, and ETL tools together to support an analysis-led, rather then ETL-driven, approach to data warehouse development. This approach will let you detect and fix data quality problems earlier in the development cycle, and reduce the risk of data quality problems in the data warehouse. However, although this approach is a significant step forward, it is not a panacea or a substitute for addressing data quality problems in the source systems. As Larry English, a leading data quality consultant, points out, “The process of information quality improvement is one of continuous process improvement of any and all processes, to eliminate the causes of defective data.”


Rate This Article

Comments:

Optional e-mail address:



Colin White (cwhite@databaseassociates.com) is an analyst and consulting specializing in data warehousing and business intelligence.


RESOURCES

ETI: www.eti.com
Evoke Software: www.evokesoft.com
Informatica: www.informatica.com
Informix: www.informix.com
Merant: www.merant.com
Oracle: www.oracle.com
Vality: www.vality.com

 


Copyright © 2000 CMP Media Inc. ALL RIGHTS RESERVED
No Reproduction without permission