The Enigma of Similarity, Part 2Using meaningful names in your data warehouses makes your data more valuableContinued from Page 1 The numbers in such graphs offer ways to compare application-relevant features or structural differences between neighborhoods. The distance measures are application-specific feature detectors of difference. Such feature detectors offer much more added information to differentiate, for example, individuals, than the raw, original variables. Individuals may look much the same on other measures age, income, and family size but big differences in the neighborhoods they choose to live in can mark real differences in their consumption patterns and even where they shop. Two people may have the same income and therefore look the same with respect to this measure, but if they come from ZIP Codes that are very different, they need to be recognized as different. A family earning $100,000 annually in an exclusive neighborhood may be the poorest one on the block, but in another neighborhood, the richest. Their buying patterns will be very different. If, however, you only look at individuals' incomes relative to the average in their neighborhoods, you could have two people with twice their neighborhood's average but with very different incomes. Thus, it's only by combining measures (for example, income and neighborhood profile) that you can get a good handle on who you should consider to be the same or different. Other methods have even less loss of information because they keep the distributional information, but this point should be left for another discussion. What I've done in numerous applications is integrate three or four indicators into a single measure for each name by summing the squares of each. For example, the types of variables that you can get from the U.S. Census at differing levels of geographic aggregation include age, income, race, household structure, household tenure, year structure was built, occupants per room, value of units, housing costs as percentage of income, school enrollment, education attainment, marital status, fertility, language spoken at home, ancestry, employment status, and occupation. Another value of quantifying names in this way is that you can still get important information that helps differentiate individuals, even though the overall data is only available in larger aggregated summaries. If in the end you create distance measures between records based on both the original quantitative variables and the quantified names at differing levels, you have created something that has remarkable parallels to one of the ways the human brain so successfully organizes similarity and difference. The power of the human capability can be seen in the way people see caricatures as similar to the original person. Here, higher-level organizations of general forms and relations match up, while specifics are minimized. The similarity method I've discussed is motivated by the recognition of the power of this multileveled, structure-of-use method that we can attribute to human similarity judgments, such as the ability to recognize family resemblance. Differing sets of parts and combinations of parts at differing levels overlapping in different ways allow us to see sets of people as coming from the same family even without a single feature in common. Next time, I'll delve deeper into the applications that come from understanding the complexity of the notion of similar, motivated by the complex solutions found in the information processing power of our brains. Barry Grushkin [BLG23@Cornell.edu] is chairman and CTO of The Machine Intelligence Development Co., a group specializing in sophisticated data mining and business process optimization.
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
| ||||||||||||||||||||||||||||||||









