CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise





May 9, 2002

The Golden Rules

What if you could build data validation into every information-centric application?

By David Loshin

There is growing confusion in the business intelligence and data warehousing community about the difference between data "cleansing" and data "quality." Although many data cleansing products can help in applying edits to name and address data, or in transforming data during an extract-transform-load process, there is usually no persistence in this cleansing: Each time a data warehouse is populated or updated, the same corrections are applied to the same data. In reality, improved data quality is the result of a business improvement process designed to identify and eliminate the root causes of bad data. The repeated application of the same corrections to the same (or similar) data is essentially putting a Band-Aid over a gaping wound — the real solution lies not in treating the symptom, but in identifying and addressing the root cause of a problem.

A critical component of improving data quality is being able to distinguish between "good" (valid) data and "bad" (invalid) data. But because data values appear in many contexts, formats, and frameworks, this simple concept is clouded by conflicting definitions of "validity" because the validity of a data value must be defined within the context in which that data value appears. For example, there may be many ways in which customers refer to a corporation, but there is only one "legal name" under which the corporation is officially registered. In most contexts, any of the corporate aliases may be sufficient, but occasionally only the legal name is valid.

If you really want to improve data quality, you must be able to measure the levels of data quality, and to do that, you must know what defines a valid value. In this article, I'll describe how to use a business rules approach to build a data validation engine that measures and reports levels of data quality based on user expectations. This approach derives its value specifically from the separation of business logic from logic implementation. The focus is on abstracting the data validation process from the rules that govern that process: You can use a rules system for transforming expectations into data quality rules that qualify data values within their context. When input into a rules engine, these rules help measure conformance with these expectations. Next, you can use this rules application to "gate" valid data from a data source to its destination, while shunting invalid data items into a tracking and reconciliation workflow. (See Figure 1.)

Executive Summary

David Loshin

Despite growing recognition of data fitness as a prerequisite to successful application development, most of us tend to treat information conformance as an afterthought. But using a business rule approach, you can build data validation into any information-centric application.

Business Rule Approach

A business rules system is designed to capture the knowledge of all the assertions, constraints, guidelines, policies, regulations, and so on that drive your business. For example, the process of booking an airline ticket involves a number of stages, ranging from the accumulation of information, to charging the customer's credit card, to issuing a reservation and confirmation number. But behind the scenes, a number of business rules are applied during this process, such as:

  • Rules checking the customer's flight history
  • Rules for credit validation
  • Rules for checking availability within a specified service class
  • Rules for validating seating preferences
  • Rules for validating meal preferences.

In this system, knowledge about a business process is abstracted and separated from its explicit implementation. Rules expressed in a predefined formalism can be integrated with a rules engine to create an application that implements the business process. The system has four components: A rule definition framework, incorporating rule definition into a user interface for creating, editing, and managing a rule base; a persistent rule base, enabling the storing, browsing, editing of defined rule sets; a rules engine that can read in a set of rules, connect to the requisite data sources and targets, and manage the state of the rules environment; and an execution framework, which could be a message stream, database system, or workflow environment.

Implementing data validation using the business rules approach has some interesting advantages:

  • Speed of implementation — Because most development time involves analyzing business rules, you will have the opportunity to properly understand and document them. The fact that experts will be implementing complex business logic will reduce the time needed to complete the application.
  • Reuse — The same data quality rules will probably be applied in multiple databases. If you use a centralized rule repository, rule processing can be replicated and distributed across multiple servers across the enterprise.
  • Ease of modification — Because the rules aren't embedded in source code in unmanageable (and indeterminate) locations, changes to the rule base, as long as they don't cause inconsistencies within the rule base, can be integrated quickly.
  • Componentized development — A data validation system can be developed by selecting individual data tables and incrementally developing and deploying a rule set for each data set.
  • Persistently encapsulated business knowledge — An expanding rule base establishes an archive of data quality expectations. By recording your organization's data quality rules, your data validation scheme will survive the departure of subject matter experts.
  • Ease of reengineering — Managing business rules as content facilitates business process reengineering. Having all policies situated in one location enables analysts to understand what the application under investigation was meant to do.

Data Quality Rules Environment

In building a data validation system based on business rules, I like to use a framework for rules that relates directly to the way data sets are used. Ultimately, our set of rules will become part of the data set's metadata. Metadata is more than just the structural details of a data environment — the names of tables, data attributes, and data types. Rather, any assertions or rules about data add to a more comprehensive metadata repository that describes more about the "hows" and "whys" of the data as well as the "whats."

In a rule environment, you differentiate between different kinds of data. For example, reference data refers to mostly static data sets that are recognized as a standard, used for attribution, and possibly used by many clients, such as city names, or ZIP codes. Data sets that implement continuous operations are referred to as transaction data. You can also characterize the data in business intelligence systems as strategic data.

To this end, let's assume that you can divide our rules into these groups: definitions of reference data, including data domains, mappings between domains, and null values; value constraints, which assign names to predefined validity ranges; and consistency rules, which describe consistency constraints among values within a single record, or across records in multiple tables.

Reference Data

Many explicit and implicit rules are embedded in the use of reference data. To better describe the use of reference data, it is necessary to use a formal definition of domains and mappings. When you identify known collections of data and assign meaning to them, you begin to get a better understanding of what information is being used in the organization, who is using it, and how that information is being used.

Nulls, domains, and mappings are types of reference data. Although the values that make up the domains and the mappings may have actually been derived from transactional data, when you categorize the data sets as domains and the relations as mappings, and especially when you ascribe some meaning to the collections, you can move those sets into the reference data arena.

Nulls. Relational DBMS systems usually provide a system null value, but there are really different kinds of nulls out there. A null value can mean one of the following:

  • There is no value for this attribute.
  • There is a value; you just don't know it.
  • There is a value, but it isn't one that you recognize.
  • You don't know if there is a value.

If there are different real representations for each of these nulls, you can define rules to document them. For example, I recently saw a data set where some telephone number fields contained the string "No number provided." Clearly, the use of a defined null value qualified the type of null and distinguished it from all other null types.







IE Weekly Newsletter
Subscribe to the newsletter
    Email Address