The Golden RulesWhat if you could build data validation into every information-centric application?
By David Loshin Continued from Page 1 Domains. By assigning an attribute a data type, you indicate that it draws its values from a specific set of allowed values. Further, you expect that any value is taken from a value set that has some structural (syntactic) rules as well as explicit semantic rules governing validity. Either way, these expectations restrict the values that an attribute takes. Whether these rules are syntactic or semantic, you can define an explicit set of restrictions on a set of values within a type and call that a domain. Some examples of domains include U.S. States, country currency codes, credit card numbers (they have a predetermined length and there are semantic rules governing validity based on a high-level parity calculation), and colors. Mappings. You can also look at relationships among pairs of values that are taken from different domains. A mapping is a relation between domain A and domain B, defined as a set of pairs of values {a, b} such that a is a member of domain A, and b is a member of domain B. There is an intuitive meaning to this mapping relationship. A familiar example of a mapping is the relationship between ZIP code and city. Every ZIP code belongs to a named area covered by a small post office or postal zone. Value ConstraintsThere are different data quality rules regarding nulls. One is whether or not an attribute allows nulls at all. Another kind of rule relates previously defined null representations. If nulls are allowed, the rule specifies that if a data attribute's value is null, then it must use one of a set of defined null representations. A value restriction describes some business knowledge about a range of values, such as "test score is greater than 200 and less than 800." A value restriction rule constrains values to be within the defined range. ConsistencyDomain membership asserts that an attribute's value is always taken from a previously defined data domain. For example, an online catalog vendor may specify a domain of fabric colors, then assert that all sweaters that can be ordered online must be of one of the named colors. A mapping membership rule asserts that the relation between two attributes or fields is restricted based on a named mapping. An example enforces the mapping from U.S. State name to its corresponding two-letter postal abbreviation. A completeness rule specifies that when a condition is true, a record is incomplete unless all attributes on a provided list aren't null. An example in the financial world would specify that if the security being traded is a stock option, the trade is incomplete unless a strike price and expiration date are provided. An exemption rule says that if a condition is true, then those attributes in a named list shouldn't have values. For example, if the customer's age is less than 16, then the driver's license field should be null. Consistency refers to maintaining a relationship between two (or more) attributes based on the content of the attributes. A consistency rule indicates that if a particular condition holds true, then a following consequent must also be true. An example in a credit analysis application might say that the amount allowed for a monthly mortgage payment must be no more than 35 percent of the monthly gross income. The Improvement ProcessYou iteratively improve data quality by identifying sources of poor data quality, asserting a set of rules about our expectations for the data, and implementing a measurement application using those rules. In operation, a set of rules is instantiated at each point in the information flow where data quality conformance is to be measured. Each data instance is tested against all associated rules, and if no nonconformities are detected, the data instance is deemed to be valid; otherwise, it is said to be invalid. Data instances that fail the rules give us clues as to the source of the nonconformance, which are then isolated and remedied. Given a set of rules that define fitness for use, and a mechanism for determining conformance of data instances to those rules, you have a measurement framework for data quality. Each data instance tested against a rule set can be scored across multiple dimensions. As you define more rules, you build a rule base that defines the basic expectations of "data fitness," against which each data instance (record, message, and so on) is measured, thereby providing an ongoing process for improved data quality. In the EndThere is growing industry recognition that data fitness is a prerequisite to the successful implementation of any information-centric application. Whether you're talking about operational or analytic applications, the results are trustworthy to the extent that you can trust the validity of the information being input to these applications. By using a rule-based approach to data fitness, your company can baseline, measure, and detect nonconformities and address their root cause instead of just "treating the symptoms." David Loshin [loshin@knowledgeintegrity.com] is president and CTO of Knowledge Integrity Inc. EXAMPLES OF DATA QUALITY RULESData quality rules lurk in all kinds of documentation. When you start reading record layouts more carefully, you see that constraints are sometimes expressed in a comment. You even can tell that some rules are implicit in the field names themselves! A simple Internet search for "record layout" offers documents that have a definition that makes reference to one or more data quality rules:
Although these examples provide easy marks for data quality rules, some internally implemented databases aren't so accommodating. But careful reading of documentation and targeted questioning can turn anyone into a data quality rule detective! RESOURCESLoshin, David. Enterprise Knowledge Management: The Data Quality Approach. Morgan Kaufmann, 2001
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
| ||||||||||||||||||||||||||||||||









