Intelligent Enteprise

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Digital Library
Subscribe
Home


December 21, 1999, Volume 2 - Number 18



Living Legacy

Discovering and documenting embedded business rules can help you automate data warehouse certification as well as improve the quality of business intelligence enterprisewide

By David Loshin



All business processes and data sets have rules; unfortunately, these rules are rarely represented in any documented form. In fact, these rules are frequently expressed as “lore” passed from one “generation” of managers and engineers to another by word of mouth. Worse yet, these rules are likely to have been forgotten over time, having been implemented in software that happily chugs away until the business environment changes, making the rules obsolete in the best case, and incorrect, in the worst.

When business rules remain undocumented, the chances are high that their meanings will be lost during a staffing change. And when business rules are lost, the opportunity to take advantage of the information resources involved is squandered as well.

Fortunately, your need to document and capitalize on the business rules embedded in your legacy systems, enhance data quality in the data warehouse, and improve business intelligence in your organization have a common solution. As your business moves its old databases into modern intelligent information platforms such as data warehouses and marts, the application of data quality rules to a data set will expose areas where the information does not conform to your expectations, which are where embedded business rules tend to hide. Discovering those rules can improve the quality of enterprise data through information consolidation and provide the basis for a rule-based data certification process for data warehouse migration.

The methodology I’ll describe here is based on the classification of value sets, called domains, and the classification of relationships between sets, called domain mappings. After making these classifications, you can then characterize data quality rules and business rules as assertions about a data set based on attribute domain membership and relationships.

Benefits of the Data Quality Process

Each business rule represents aspects of a business process, so collecting them is an important information quality goal. The opportunity to take advantage of this resource exists when you understand your system’s metadata, especially if a formal method is available for collecting, documenting, and validating business rules. Because your expectation is that data conforms to the rules, if you collect the rules and represent them coherently, you can use them as input to an information validation process.

This process entails more than just address standardization or de-duplification, however. A data quality process driven by senior management includes data quality assessment, focusing on areas for improvement, executing methods for improvement, and then measuring improvement. You can incorporate a data quality rule-discovery phase between the first and second steps.

Some typical data quality issues that imply the presence of embedded business rules include:

Attribute Overloading. Stewards of production applications hesitate to modify an underlying data model because such changes put a lot of stress in application behavior. When a new attribute is required, programmers may look for an infrequently used attribute to overload its use with values for the new virtual attribute. For example, in a sales database that has an industry code field filled for business customers, using that field for a numeric code to represent a classification of the business value of residential (nonbusiness) customers would be a case of overloading.

Attribute overloading is typically manifested in program code with conditional statements whose tests ensure the proper treatment of the overloaded attribute. This code implies an embedded business rule specified on the values that an attribute may take, and under which circumstances those values may be used.

Semantic Consistency. Semantic consistency refers to the consistency of definitions, meanings, and names of objects among attributes within a data model, as well as similarly named attributes in different data sets. If the attributes represent different meanings, the attribute names should be distinguished, or the attributes should be assigned different names. Conversely, if two attributes draw their values from the same domain, they should have the same names to reflect that similarity. Recognizing that multiple attributes in different tables draw their values from the same data domain represents an important business relationship across an enterprisewide information repository by showing how distributed data systems rely on the same reference data.

Default Null Values. Database systems without explicit null representations tend to contain artificial default values. A timely example is the Y2K problem associated with September 9, 1999. When entering dates, if no value was provided for the attribute and the system would not accept an empty field, a data entry technician might have entered 9/9/99 to force the system to accept a record. Because many business activities are predicated on the presence or absence of values, the types and representations of null values represent another dimension of business rules.

Record Completeness. A similar data quality problem occurs when records are incomplete — that is, the attributes for a particular record are missing information. But just because an attribute’s value is missing doesn’t mean that the record is incomplete; you can use business rules that specify record completeness to filter out nonconforming records.

Domains, Mappings, and Data Quality Rules



FIGURE 1 The result of the discovery process: data quality and business rules.


Data modelers assign types to defined attributes, but these types are very general. For example, an attribute called STATE may have been defined to be CHAR(2). While there are 676 distinct two-character strings, the values populating that attribute are limited to the 62 two-letter United States state and possession abbreviations. This implicit business rule can be explicitly stated as “All values in the attribute STATE must also belong to the set of recognized USPS state and possession abbreviations.”

Domains. A set of values described as a restriction or subset of the values allowed within the base type is called a domain. There are two different kinds of domains: enumerated sets, which are likely to draw focus because of their relationship to preexisting business set definitions; and rule-oriented sets, which are defined through constructive rules such as “All three character strings where the first character is A, and the third character is a digit.”

Mappings. A mapping asserts a relationship between the values in two chosen domains, represented as a pair of domains (X and Y), and a set of value pairs (x, y), where all x values belong to domain X and all y values belong to domain Y. Mappings may be one-to-one, one-to-many, many-to-one, and many-to-many.

Data Quality Rules. Data quality rules describe the relationship between a set of attributes and the domains and mappings to which they belong, along with the null value representations and the completeness constraints. These rules are a subset of the range of data quality rules that you can use to address the problems I described earlier:

Domain membership rules assert that the values populating an attribute are taken from a predefined domain.

Mapping membership rules assert that the relationship between two sets of attributes conform to a predefined mapping.

Null value rules specify the representation of null values and whether they are allowed.

Completeness rules specify when a record is complete.

Business Rule Discovery through Data Quality

Domain discovery is the recognition that a set of values is classified as a set, and that attributes draw their values from that set. Mapping discovery is the recognition that two distinct sets of attributes X and Y each belong to a specific domain, and that the relation between those two sets is embodied in a mapping between those domains.

When you’ve discovered these domain and mapping relationships, you should proceed to a manual validation phase to verify the discovered domains, mappings, and rules. You can then use the discovered rules as input to a data quality validation rules engine, which can certify the data quality of a data warehouse, input validation gateway, or even a data-dependent GUI generator system.

You can then perform subsequent analysis to understand the business meaning of the discovered domains and mappings. This stage applies a semantic meaning to a domain, mapping, or rule, and can then serve as the basis for the validation of automated business processes. Some benefits of this process include:

•Determining that an attribute belongs to a domain lets you document the domain and create a means for validation. For example, in a financial product database, products of different types are treated differently; trades involving equity products have different settlement requirements than debt products or options. But the industry is constantly evolving, with new types of products being invented all the time. Creating a special domain for financial product type eases the introduction of new product types and guarantees that no trades involving unsupported product types may occur.

•Determining that attributes from different tables use the same domain provides a means for consolidating shared reference data from across the enterprise, thereby lowering the risk of inconsistency in enterprise data. It also establishes the existence of an inadvertent relationship between those attributes or one that simply had been forgotten, but may prove to be valuable. A good example is a company whose billing department uses one database for account information, attributed with customer name and address, while the sales department has its own database attributed with customer name and address. Determining that the set of customers is a shared resource would help the company migrate to a more customer-focused business.

•Discovering that two (or more) sets of attributes are related via a domain mapping establishes a business rule going forward with respect to those sets of attributes. For example, rules of this sort, when applied in sales and marketing systems, can uncover opportunities for up-sells and cross-sells based on known customer behavior. Another example is the consolidation of approval criteria inside credit-approval application systems.

•Managing domains and mappings under explicit stewardship provides a framework for the maintenance of that enterprise resource.

•If the data stewards choose, the domain may be stored once and referenced via a numeric encoding, thereby saving space when referring to domain values. Using an enterprisewide, accepted set of numeric encodings for known domain values embedded in request-reply transactions reduces the amount of data transmitted inside each message. Thus, client/ server applications that rely heavily on network communication can be more efficient.

•Extracting the rules into a formal definition transforms an execution-oriented object into one that can be treated as content, increasing the value of the enterprise information resource. Giving the data customer the opportunity to review his or her business rules that does not require knowledge of complex programming languages can simplify the user-requirements analysis and validation process.

•You can use normalization metadata, such as identifying relations among attributes within the same table when a pairwise correlation exists (for example, there is a one-to-one relation between street name and telephone number), to transform an unnormalized table into normal form. Organizations that are upgrading from nonrelational DBMS systems to RDBMSs will use mappings heavily for both the data and application migration processes.

•Relational metadata is embedded in domain information. For example, if all customer account numbers referred to by ACCOUNT in one table are the same accounts referred to by ACCT_NUM in another table, you infer that the two attributes refer to the same set of objects.

Techniques for Discovery

The brute force method for identifying domains is to look at all possible value sets. For each table, attribute, or column, select the distinct values into a set, each of which is a domain candidate. Heuristics are applied to advance the process. For example, if the number of distinct values is similar to the number of records from which the values were extracted, that set is less likely to be a domain than if the number of distinct values is much smaller than the record count. Conversely, if the distribution of the values over the column is relatively equal, that set is more likely to be a domain.

Upon domain identification, you assign a name and a semantic meaning to that domain, insert it into the reference data tables, and document the domain membership rule for each attribute whose values are taken from that domain, thereby building the domain “inventory.” This process paves the way for domain membership analysis.

•Domain Membership Analysis Through Value Matching. Another method for domain identification is to analyze how well the set of values used to populate an attribute match the values of a pre-existing domain. If all of the values used in the attribute always match one or more values belonging to the domain, the attribute belongs to that domain. If not, look for the domain whose values most closely match the attribute. (See Figure 2.)



FIGURE 2 Domain identification.


•Domain Discovery Through Pattern Analysis. For string-based attributes, you can derive a rule-oriented domain by analyzing attribute value patterns. For example, if you determine that each value has 10 characters, where the first character is always A, and the remainder are digits, that syntax rule can be posited as a validation rule, which is added to a meta-database of domain patterns. Examples include telephone numbers, ZIP codes, and social security numbers.

One method is to segregate each possible character appearing in a data value into letters, digits, punctuation, or white space. For each string that appears as a value, the character pattern is analyzed and recorded. When all strings are analyzed, the patterns that appeared are collated and counted. As before, those patterns are checked against the known sets of patterns, and a set of candidate domains are presented to the user.

If no candidates exist, there may be embedded information in the patterns themselves that you should investigate. At this point, you can classify all white space-separated strings as alphabetic, alphanumeric, numeric, or as one of a set of categorized words such as names, business words, or address words. After this analysis, the patterns are collated and counted, and the matching step occurs again.

•Superdomains, Subdomains, and Composed Domains. When no domain matches with 100 percent accuracy, more interesting business rules may exist. For example, if more values are in the attribute than in any one domain (a superdomain), the attribute probably takes its values from a composed domain. This situation could indicate an overloaded attribute, in which case you should examine that attribute for more business rules. It may also mean that the domain as recorded is incomplete, you need to adjust the reference data set, or you need to correct inaccuracies in the attribute.

If the attribute appears to belong to more than one domain, it may signal the existence of a subdomain lurking among the known domains. A subdomain is a set of values that is a subset of another domain. This subdomain may represent another business rule, or you might infer that two similar domains may represent the same set of values, in which case the domains might be merged. In all cases, heuristic algorithms can help consolidate the available information to make better inferences.

•Mapping Identification. A mapping membership rule applies for two attributes if each attribute has an associated domain membership rule, and each pair of values in each record exists in the defined domain mapping. The simplest method for identifying a mapping is to collect all unique value pairs as they appear in the data set. This set can then be analyzed to determine if the mapping is one-to-one, one-to-many, many-to-one, and so on. Mapping memberships represent consistency relationships between the values within each record and can be used for validation of data records on entry to data warehouse.

A one-to-one mapping represents an interesting value determination rule within the data set that probably represents an embedded business rule. It implies that for each unique occurrence of a value of the first attribute, the second attribute must match what the mapping describes. Therefore, given the value of the first attribute, the second attribute’s value is predetermined, so this rule can be used for both validation and automated completion. These kinds of mapping memberships — also called functional dependencies — may exist between composed sets of attributes as well.

Rule-Based Data Warehouse Certification

As I explained earlier, you can use discovered data quality and business rules for data warehouse certification. Given a set of data quality rules, you can assign a score to the quality of the data imported into a data warehouse for certifying warehouse data quality. (See Figure 3.) Then you input the data quality rules into a rules engine. Associated with each rule is a validity threshold percentage based on the users’ expectations of quality. As you feed records into the engine, any relevant rules are tested. If no rules fail, the record is successfully gated through to the warehouse. If any rule fails, the record is output to a reconciliation system. The count of failures and successes is maintained for each rule.

After you’ve imported the data, each rule’s validity percentage is computed and a data quality certification report is generated. If all validity percentages exceed the associated thresholds, the warehouse is certified to conform to the users’ data quality requirements. Otherwise, you must analyze the output incorrect records for the root cause of the failures. This analysis and correction is part of a business workflow that relies on the same set of data quality and business rules used for validation. After reconciliation, the data is re-sent through the rules engine, and the validity report is generated again. This process continues until certification is achieved.



FIGURE 3 Warehouse data quality certification.


Rules Have It

By analyzing data sets for domain and mapping membership, you can identify and extract embedded data quality and business rules. In this process, you improve the quality of the enterprise data resource through information consolidation while discovering and documenting embedded business rules. While we looked at only one example of the use of these rules (data warehouse certification) here, many opportunities exist for discovering more complex quality and business rules.



David Loshin (loshin@knowledge-integrity.com) is president and CTO of Knowledge Integrity Inc. and has 12 years of experience in high-performance computing, data quality, and data mining. He is the author of High Performance Computing Demystified (AP Professional, 1994), Efficient Memory Programming (McGraw-Hill, 1998), and the forthcoming Enterprise Knowledge Management: The Data Quality Approach (Morgan Kaufman, 2000).





IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo JitterPlug Into The Cloud
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet EvolutionPyramid Research
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space