CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise





October 30, 2002

The Enigma of Similarity, Part 2

Using meaningful names in your data warehouses makes your data more valuable

by Barry Grushkin

What's in a name? Actually, quite a lot. Names are highly meaningful to people, but the names in our data warehouses are generally just arbitrary markers. However, using context-specific names in analytic projects that mimic how the human brain computes relationships can result in a major increase in return on investment. Don't underestimate the value of using names in meaningful, quantitative ways in projects ranging from CRM to ERP to product targeting, investment analysis, drug development, and so forth.

Continuing my discussion on the nature of "similar" ("The Enigma of Similarity," Sept. 3, 2002), I'll demonstrate several useful and appropriate ways to quantify the similarities and differences of names. These relationships, in turn, can be used in conjunction with your existing quantitative variables to calculate the similarity of individuals, products, investments, chemicals, or the many types of records that might be in your data warehouse.

One of the applications of measures of similarity (and its inverse, distance) is the organization of records (customers, products, and so on) into appropriately homogenous groupings. In so doing, the relationships you're trying to understand — predictability of earnings, purchasing level, cost, effectiveness, stability of income, allegiance, and so forth — come out far cleaner, statistically more significant, and more appropriately defined for each group. Studying apples and oranges separately helps you understand the properties of each better. Putting everything together can create confusion.

For example, if you first segment customers by socioeconomic measures, you usually find different but highly explainable buying patterns in each separate segment.

Analytically defined groupings often offer valuable ways to look at your business processes that you wouldn't have discovered with your predefined business segmentations, such as regions, times, departments, class of product, and so forth.

Measures of similarity of names can also be used to generate new and potentially useful hierarchical business activity descriptions. Using hierarchical clustering applications can generate relevant OLAP dimensions. For example, I have offered companies whole new product organizations driven by which products are bought by the same kinds of people. It's not uncommon for clients to recognize that the discovered structures make so much common sense that they decide to reorganize teams to have one focus on each of the analytically recommended breakdowns.

Working With Names

Data warehouses are full of nominal variables, or names; so are our brains. But humans and data warehouses handle names very differently. In the amazing database in our head, a name gains meaning and value by structural aspects of use. Other systematic relationships function as metadata.

For example, the real meaning of wrench includes how you conceptualize what you can do with it and how you use it in conjunction with other objects.

On the other hand, names in data warehouses are often just arbitrary tags that generally don't get used quantitatively or, if used at all, are implemented only as independent nonquantifiable markers of differences. Sales in New York are different from sales in New Mexico. Comparison measures, such as words or letters in common, are generally meaningless. With ZIP Codes, which are really nominal variables masquerading as numbers, a small difference in the digits won't tell you about the differences between the neighborhoods they represent.

Taking a cue from what the brain does, you can attribute quantifiable meaning to a name by giving it usage and structure evaluations appropriate to your applications.

How, then, can you create similar measures between ZIP Codes that tie in the needs and structure of an application?

If socioeconomic factors are central to your application, you might make use of numbers such as those seen in Figure 1. Figure 1 graphs the average family income vs. the age of head of household for three different ZIP Codes in Boston. It gives a measure of the comparative socioeconomic profile of the three neighborhoods.

The deep blue line shows the profile for my ZIP Code, 02130. The ZIP Code 02131 is on the other side of the tracks. Its profile, shown in pink, is quite different from mine. One digit difference can make a lot of difference. However, the yellow line for 02140, 10 miles away, shows a neighborhood that seems much more similar to mine. If you use, for example, the average absolute distance between these lines, you have a very useful measure of the socioeconomic distance between these neighborhoods.








IE Weekly Newsletter
Subscribe to the newsletter
    Email Address