Guide to the TechWeb Network

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home




May 9, 2002

The Golden Rules

What if you could build data validation into every information-centric application?

By David Loshin

Continued from Page 1

Domains. By assigning an attribute a data type, you indicate that it draws its values from a specific set of allowed values. Further, you expect that any value is taken from a value set that has some structural (syntactic) rules as well as explicit semantic rules governing validity. Either way, these expectations restrict the values that an attribute takes. Whether these rules are syntactic or semantic, you can define an explicit set of restrictions on a set of values within a type and call that a domain. Some examples of domains include U.S. States, country currency codes, credit card numbers (they have a predetermined length and there are semantic rules governing validity based on a high-level parity calculation), and colors.

Mappings. You can also look at relationships among pairs of values that are taken from different domains. A mapping is a relation between domain A and domain B, defined as a set of pairs of values {a, b} such that a is a member of domain A, and b is a member of domain B. There is an intuitive meaning to this mapping relationship. A familiar example of a mapping is the relationship between ZIP code and city. Every ZIP code belongs to a named area covered by a small post office or postal zone.

Value Constraints

There are different data quality rules regarding nulls. One is whether or not an attribute allows nulls at all. Another kind of rule relates previously defined null representations. If nulls are allowed, the rule specifies that if a data attribute's value is null, then it must use one of a set of defined null representations.

A value restriction describes some business knowledge about a range of values, such as "test score is greater than 200 and less than 800." A value restriction rule constrains values to be within the defined range.

Consistency

Domain membership asserts that an attribute's value is always taken from a previously defined data domain. For example, an online catalog vendor may specify a domain of fabric colors, then assert that all sweaters that can be ordered online must be of one of the named colors.

A mapping membership rule asserts that the relation between two attributes or fields is restricted based on a named mapping. An example enforces the mapping from U.S. State name to its corresponding two-letter postal abbreviation.

A completeness rule specifies that when a condition is true, a record is incomplete unless all attributes on a provided list aren't null. An example in the financial world would specify that if the security being traded is a stock option, the trade is incomplete unless a strike price and expiration date are provided.

An exemption rule says that if a condition is true, then those attributes in a named list shouldn't have values. For example, if the customer's age is less than 16, then the driver's license field should be null.

Consistency refers to maintaining a relationship between two (or more) attributes based on the content of the attributes. A consistency rule indicates that if a particular condition holds true, then a following consequent must also be true. An example in a credit analysis application might say that the amount allowed for a monthly mortgage payment must be no more than 35 percent of the monthly gross income.

The Improvement Process

You iteratively improve data quality by identifying sources of poor data quality, asserting a set of rules about our expectations for the data, and implementing a measurement application using those rules. In operation, a set of rules is instantiated at each point in the information flow where data quality conformance is to be measured. Each data instance is tested against all associated rules, and if no nonconformities are detected, the data instance is deemed to be valid; otherwise, it is said to be invalid. Data instances that fail the rules give us clues as to the source of the nonconformance, which are then isolated and remedied.



Rate This Article

Comments:

Optional e-mail address:

Given a set of rules that define fitness for use, and a mechanism for determining conformance of data instances to those rules, you have a measurement framework for data quality. Each data instance tested against a rule set can be scored across multiple dimensions. As you define more rules, you build a rule base that defines the basic expectations of "data fitness," against which each data instance (record, message, and so on) is measured, thereby providing an ongoing process for improved data quality.

In the End

There is growing industry recognition that data fitness is a prerequisite to the successful implementation of any information-centric application. Whether you're talking about operational or analytic applications, the results are trustworthy to the extent that you can trust the validity of the information being input to these applications. By using a rule-based approach to data fitness, your company can baseline, measure, and detect nonconformities and address their root cause instead of just "treating the symptoms."


David Loshin [loshin@knowledgeintegrity.com] is president and CTO of Knowledge Integrity Inc.


EXAMPLES OF DATA QUALITY RULES

Data quality rules lurk in all kinds of documentation. When you start reading record layouts more carefully, you see that constraints are sometimes expressed in a comment. You even can tell that some rules are implicit in the field names themselves!

A simple Internet search for "record layout" offers documents that have a definition that makes reference to one or more data quality rules:

  • In the Nebraska Employer File (www.nenewhire.com/employerpacket/File-Layout.pdf), Table Employer Record, the "Phone Number of Contact Person" field contains this definition: Must be in a format with area code first then number. This field must have numbers only.
  • In the HCFA Renal Provider File (www.hcfa.gov/stats/esrp0101.htm), the "Provider Master" field is a six-character element with the following restrictions: Identification number of provider. First 2 digits = State Code (see Attachment A). Next 4 digits beginning with:
    0 = Short Stay Hospital
    20 = Long Term Hospital
    25-28 = Free-Standing Renal
    33 = Children's Hospital
    35 = Hospital ESRD Satellite
  • In the FIPS55 Specific Record Layout (geonames.usgs.gov/layout.html), the "Postal Name Match" field has this constraint: "A "G" = matches the Place Name to the USGS name. A "P" = matches the Place Name to the U.S. Post Office."
  • In the InfoUSA record layout for its government data product (www.infousagov.com/business4.htm), there is a data field named "SIC Code." If you're familiar with industry classification, you would know that this acronym refers to the Standard Industry Classification code, the set of which is a defined data domain and mapping. More information about industry classification codes can be found at www.census.gov/epcd/www/naics.html.

Although these examples provide easy marks for data quality rules, some internally implemented databases aren't so accommodating. But careful reading of documentation and targeted questioning can turn anyone into a data quality rule detective!


RESOURCES

Loshin, David. Enterprise Knowledge Management: The Data Quality Approach. Morgan Kaufmann, 2001









IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







techweb
Online Communities TechWebInformationWeekLight ReadingIntelligent EnterprisebMightyNetwork ComputingDark ReadingDigital LibraryWall Street & Technology
Byte & SwitchNo JitterInternet EvolutionLight Reading's Cable Digital NewsContentinopleUnStrungBank Systems & TechnologyAdvanced TradingInsurance & Technology
Face-to-Face Events
InteropWeb 2.0 ExpoWeb 2.0 SummitVoiceConBlack HatCSISoftwareEntrprise 2.0 ConferenceGTEC
Mobile Business Expo
InformationWeek 500 ConferenceBuy Side Trading XchangeBuy Side Trading SummitBank Executive SummitInsurance Executive SummitTelcoTVEthernet ExpoOptical Expo
Magazines  
InformationWeekWall Street & TechnologyInsurance & TechnologyBank Systems & TechnologyAdvanced TradingMSDNTechNetSmart EnterpriseThe Architecture JournalDatabase Magazine
 
Research & Analyst Services  
Heavy ReadingInformationWeek ReportsInformationWeek Analytics