Guide to the TechWeb Network

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home




June 13, 2001



Mining a Demographic Mother Lode

Mine census data to enhance your organizations's marketing programs and customer relationships

By Seth Grimes

continued from Page 1

Note also that filers don't always respond truthfully and accurately to the census. For instance, some individuals reportedly did not list their ancestry completely in order to avoid diluting the count of members of their primary racial or ethnic group. Weighting race membership could have helped, but unfortunately the survey did not record whether people who reported themselves as African American and White are, for example, 7/8 the first and 1/8 the second vs. 1/8 the first and 7/8 the second. Adding to the problem, public officials decried a supposed privacy invasion, condoning nonresponse, even though they had said nothing when given the chance to review the survey forms before they were finalized. The bottom line is that you shouldn't expect 100 percent accuracy.

Using Census Summary Data

Census data comes in a pretty basic form. If you want to do extensive analyses, you'll need to either purchase data sets repackaged with value-added analytic and display tools or load the data sets yourself into your favorite analytic package. There are a number of hindrances, however. First, the USCB distributes metadata only in a document and not in a form that computer programs can use directly. And the number of computed values for each geographic area is more than a desktop tool like a spreadsheet can handle, so the data sets are segmented into files containing 252 or fewer fields, not counting four fields that identify each record. Nonetheless, with a modest time investment you should be able to find a huge value in the published summary data.

The data volume will pose a challenge to users who want to go beyond browsing the American FactFinder Web site. The redistricting data files for the whole of the United States, uncompressed, take up about 60GB of disk space. Subsequent data files will be up to about 40 times that size, although GZip compression will save a lot of space.



Rate This Article

Comments:

Optional e-mail address:

If you plan to do advanced analyses, start by reading the data product documentation. It's available online and describes the survey methodology, hierarchies of geographic areas, data file layouts, and statistical-table contents.

The data lends itself to OLAP-style slice-and-dice analysis, but you'll need to design a set of cubes that pulls data from multiple tables grouped by universe (subject population), such as persons, households, and heads of households. You'll need to handle hierarchies of geographic areas that contain multiple branches and up to eight levels of depth and tables that contain both basic values and subtotals and totals. And a given data set may be ragged because of suppression of results for confidentiality protection. A better alternative could be loading the data to a RDBMS - or you can use specialized tools like those the USCB used to create the data sets - where ragged rows, complex dimensional hierarchies, segmented data sets, and disparate universes won't be a problem.

The data would prove an ideal complement to existing analytic systems if you can establish how to match contents (by mapping variables) and how to allocate the census data sets' summary values to individual records. Even the USCB faces comparability issues given survey changes between 1990 and 2000; in particular, the Census 2000 survey lets you choose up to six races rather than just one. In other cases, you may find that your age or income categories, for instance, may not match the census's. Further aggregating data to make the categories match means losing information and isn't numerically feasible in any case for derived values like medians. To overcome data allocation difficulties, which similarly involve the risk of losing information, you may need sophisticated techniques such as record linkages that assign distributions rather than single values from the summary data to individual records.

This column is only a brief introduction to Census 2000 data. I can only hint at the breadth of data available and sketch how you can use the data and the issues you'll face. If you work with data about U.S. residents - on a local or a national scale - you'll find that enriching your data with census results is well worth the effort.



Seth Grimes, [grimes@altaplana.com], is a principal of Alta Plana Corp., a Washington, D.C.-based consultancy specializing in large-scale analytic computing systems.




Resources


American FactFinder
IBM Global Services
SAS Institute
Space-Time Research
U.S. Census Bureau (download data)
Online sidebar on Census 2000 Analysis System








IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo Jitter
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet Evolution
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space