The Golden RulesWhat if you could build data validation into every information-centric application?
By David Loshin There is growing confusion in the business intelligence and data warehousing community about the difference between data "cleansing" and data "quality." Although many data cleansing products can help in applying edits to name and address data, or in transforming data during an extract-transform-load process, there is usually no persistence in this cleansing: Each time a data warehouse is populated or updated, the same corrections are applied to the same data. In reality, improved data quality is the result of a business improvement process designed to identify and eliminate the root causes of bad data. The repeated application of the same corrections to the same (or similar) data is essentially putting a Band-Aid over a gaping wound the real solution lies not in treating the symptom, but in identifying and addressing the root cause of a problem. A critical component of improving data quality is being able to distinguish between "good" (valid) data and "bad" (invalid) data. But because data values appear in many contexts, formats, and frameworks, this simple concept is clouded by conflicting definitions of "validity" because the validity of a data value must be defined within the context in which that data value appears. For example, there may be many ways in which customers refer to a corporation, but there is only one "legal name" under which the corporation is officially registered. In most contexts, any of the corporate aliases may be sufficient, but occasionally only the legal name is valid. If you really want to improve data quality, you must be able to measure the levels of data quality, and to do that, you must know what defines a valid value. In this article, I'll describe how to use a business rules approach to build a data validation engine that measures and reports levels of data quality based on user expectations. This approach derives its value specifically from the separation of business logic from logic implementation. The focus is on abstracting the data validation process from the rules that govern that process: You can use a rules system for transforming expectations into data quality rules that qualify data values within their context. When input into a rules engine, these rules help measure conformance with these expectations. Next, you can use this rules application to "gate" valid data from a data source to its destination, while shunting invalid data items into a tracking and reconciliation workflow. (See Figure 1.)
Business Rule ApproachA business rules system is designed to capture the knowledge of all the assertions, constraints, guidelines, policies, regulations, and so on that drive your business. For example, the process of booking an airline ticket involves a number of stages, ranging from the accumulation of information, to charging the customer's credit card, to issuing a reservation and confirmation number. But behind the scenes, a number of business rules are applied during this process, such as:
In this system, knowledge about a business process is abstracted and separated from its explicit implementation. Rules expressed in a predefined formalism can be integrated with a rules engine to create an application that implements the business process. The system has four components: A rule definition framework, incorporating rule definition into a user interface for creating, editing, and managing a rule base; a persistent rule base, enabling the storing, browsing, editing of defined rule sets; a rules engine that can read in a set of rules, connect to the requisite data sources and targets, and manage the state of the rules environment; and an execution framework, which could be a message stream, database system, or workflow environment. Implementing data validation using the business rules approach has some interesting advantages:
Data Quality Rules EnvironmentIn building a data validation system based on business rules, I like to use a framework for rules that relates directly to the way data sets are used. Ultimately, our set of rules will become part of the data set's metadata. Metadata is more than just the structural details of a data environment the names of tables, data attributes, and data types. Rather, any assertions or rules about data add to a more comprehensive metadata repository that describes more about the "hows" and "whys" of the data as well as the "whats." In a rule environment, you differentiate between different kinds of data. For example, reference data refers to mostly static data sets that are recognized as a standard, used for attribution, and possibly used by many clients, such as city names, or ZIP codes. Data sets that implement continuous operations are referred to as transaction data. You can also characterize the data in business intelligence systems as strategic data. To this end, let's assume that you can divide our rules into these groups: definitions of reference data, including data domains, mappings between domains, and null values; value constraints, which assign names to predefined validity ranges; and consistency rules, which describe consistency constraints among values within a single record, or across records in multiple tables. Reference DataMany explicit and implicit rules are embedded in the use of reference data. To better describe the use of reference data, it is necessary to use a formal definition of domains and mappings. When you identify known collections of data and assign meaning to them, you begin to get a better understanding of what information is being used in the organization, who is using it, and how that information is being used. Nulls, domains, and mappings are types of reference data. Although the values that make up the domains and the mappings may have actually been derived from transactional data, when you categorize the data sets as domains and the relations as mappings, and especially when you ascribe some meaning to the collections, you can move those sets into the reference data arena. Nulls. Relational DBMS systems usually provide a system null value, but there are really different kinds of nulls out there. A null value can mean one of the following:
If there are different real representations for each of these nulls, you can define rules to document them. For example, I recently saw a data set where some telephone number fields contained the string "No number provided." Clearly, the use of a defined null value qualified the type of null and distinguished it from all other null types.
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
|
|











