Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Digital Library
Subscribe
Home




January 30, 2001



Address Space

Turn your data warehouse into a geographic information system


Ralph Kimball

A chasm divides the data warehouse community from the geographic information systems (GIS) community, a chasm I described in my previous column. (See Data Webhouse from the Jan. 1, 2000 issue at IntelligentEnterprise.com ) Although both of these groups talk about data warehouses and about making decisions from their data, they rarely venture onto each other's turf and the apparent overlap of vocabulary is, in fact, misleading.

Data warehouse people are mostly concerned with vast text and number databases, derived from the marketing and operational data sources of the organization, and GIS people are mostly concerned with getting their vast, two-dimensional descriptions of locations and paths on the surface of the earth into an analytic framework.

My interest in all of this is not to try to bring the two communities together. I have a much more narrow agenda, and that is to encourage text-and-number data warehouse project managers to pick off some of the low-hanging fruit provided by our GIS colleagues. With comparatively little effort, we data warehousers can attach some very interesting GIS tools to our existing data, thereby extending the analysis paradigms of the data we already have.

Using GIS tools, we can clean up and effectively use the millions of addresses we already store for all our customers. We can invoke new graphical presentation tools that allow us to see two-dimensional patterns that simply can't be detected in spreadsheets and conventional reports. And we can attach some new verbs to our existing SQL and OLAP databases without modifying the underlying data that let us ask spatially enabled questions, such as "Find all the customers that are within or near a group of counties."

In my previous column, I described the experience of going to ESRI's boot camp training for its Visual Basic MapObjects product. MapObjects implements the final presentation layer for the end user. MapObjects can draw maps and understands data sets. But beneath MapObjects, there is a lot of necessary machinery for attaching ordinary dirty data. The two most interesting pieces of machinery are the address standardizing piece and the geographic query extensions piece. Let's take a look at what it takes to actually use these capabilities and also at the other serious players in this game besides ESRI.

Automatic Address Standardizing

MapObjects provides two objects that work together to let an application interpret ordinary street addresses. The standardizer object applies user-specified patterns to a single street address or a table of street addresses, converting the raw addresses into its best guess as to the fully parsed form. A simple pattern for street addresses within a city might look like the following:

HN 12; House Number
PD 2; Pre-direction
PT 6; Pre-type
SN 30; Street Name
ST 6; Suffix type
SD 2; Suffix direction
ZN 20; Zone

The two character abbreviations at the beginning of each line are the parse tokens of possible address contents. The number is the maximum byte count for each parse token, and the English text after the semicolon is the meaning of the token. This format represents the maximum generality recognizable in an address if the analyst decides to use this pattern.

A street address fits this pattern if some or all of the tokens can be assigned to the address, in the order given previously. Studying this pattern makes it clear that the more the street addresses fit this pattern, the more accurate the address standardizing will be.

After you have standardized the addresses in MapObjects, it passes them to the geocoder object, which attempts to match each parsed address to a standard street network database. In MapObjects, a suitable street network database would be an ESRI shapefile attributed with road names. An ESRI shapefile is a presentation-ready database, in this case consisting of line segments, with an open-ended number of textual and numeric attributes associated with each segment.

Shapefiles are one of the main formats for representing a map layer and can be created in many ways. In the case of a standard street network covering the whole United States, you would expect to purchase the required shapefiles from a geographic data supplier. Look up "TIGER line files" on www.google.com for a lot of leads on data suppliers that add value to the TIGER Census Department data on U.S. streets. (The Census Department itself has a long list of vendors that add value to this data and supply an impressive array of address matching and geocoding products. You can find this list at www.census.gov/cgi-bin/geo/vendors?A. ESRI is not the only vendor in this space!)

When you run the geocoder object, it returns a set of candidate real locations on its street network that map each of your candidate addresses with varying degrees of certainty. Remember that your candidate addresses may be incomplete, have variant spellings, or be corrupt. Just because you have parsed the address doesn't mean you have a good address. Normally you would accept the geocoder's match if a single address on the street network matches your candidate, and the confidence score the geocoder returned is sufficiently high. If no location on the street network has an acceptable confidence score or if multiple locations match your candidate, then you need to think about improving the quality of your original data.

After passing a whole table of parsed and tokenized addresses to the geocoder, if all goes well, you get back a set of location objects that you can immediately plot on the visual map or write back into a text-and-number geographic data set as "the answer."

I described in some detail the ESRI MapObjects procedure for handling real customer addresses, in order to give you a feel for what it takes to translate typical dirty customer addresses into a useful, standardized form for presentation and analysis. The architecture I described is typical of most of the products that perform address matching.

Geographic Query on Standard Databases

It is possible to add spatial analysis semantics to your existing relational databases running on Oracle, DB2, SQL Server, Informix, or Sybase.

ESRI is capable of placing a geographic interpretation layer between the user and any of these databases with its Spatial Database Engine (SDE) product. The charm of SDE is that you don't need to buy a separate database for geographic processing.

SDE can be configured in a variety of modes. The SDE server can directly manage geo-specific tables such as street maps and shapefiles, and at the same time provide virtual links to your own underlying, unmodified tables residing on one of the conventional relational databases. Or, SDE can physically alter and populate your underlying tables with geo-specific attributes such as points, lines, and polygons.

SDE exposes an enhanced SQL API to clients, including MapObjects. So within MapObjects, you can construct queries effectively against your normal warehouse data that can determine the relationship between geographic features. Such relationships might be crossings or intersections, shared boundaries or points, or the containment of one feature within another. References to SDE-managed data attributes can be mixed freely within a single SQL expression that is otherwise referring to your normal data. SDE's SQL extensions let you ask complex distance questions, such as the distance between a point location and some other extended boundary-defined area. Another difficult distance question is how far a complex, extended area is from a line feature such as a road or a pipeline.


Rate This Article

Comments:

Optional e-mail address:

SDE can impose constraints on geographic features such as lines to guarantee at all times that they form a properly constructed network of a certain topology. SDE also manages what ESRI calls dynamic spatial overlay, which splits the spatial analysis operations away from the data retrieval operations while pipelining these operations so that SDE can overlay two data sets without first fully extracting both of them.

SDE defines an architecture as well as a set of spatial query operations. Other vendors, such as IBM and Oracle, have implemented the full set of ESRI SDE query operations while proposing their own software architectures, or in some cases, mixing their own products with ESRI's. If you are tempted to spatially enable your existing data warehouse, it would certainly be worthwhile to investigate IBM's DB2 Spatial Extender, as well as Oracle's Spatial Cartridge and the Informix Spatial DataBlade.

The Right Fit?

If you are a GIS professional sitting on top of oceans of geospatial data, then the approach I've described is probably not for you. You need to start with a mainline GIS solution. But if you are a text-and-numbers data warehouse manager already storing millions of addresses and other attributes of physical locations, then consider the techniques shown in my last two columns to pick the low-hanging fruit that our GIS colleagues have generously provided for us, without modifying your existing data architecture and without changing your other data warehouse applications.

Ralph Kimball co-invented the Star Workstation at Xerox and founded Red Brick Systems. He has three best-selling data warehousing books in print, including the newly released The Data Webhouse Toolkit (Wiley, 2000). Ralph teaches dimensional data warehouse design through Kimball University and critically reviews large data warehouse projects. You can reach Ralph through his Web site at www.ralphkimball.com.






IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







Techweb
Informationweek Business Technology Network
InformationweekInformationweek 500Informationweek 500 ConferenceInformationweek AnalyticsInformationweek Events
Informationweek MagazineGlobal CIOIWK Government ITbMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingPlug Into The CloudDr. DobbsContentinople
space
TechWeb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0Mobile Business ExpoNoJitter
Black HatGTECEnergy CampCloud ConnectGov 2.0 ExpoGov 2.0 Summit
space
Light Reading Communications Network
Light ReadingLight Reading AsiaUnstrungCable Digital NewsInternet EvolutionPyramid Research
Heavy ReadingLight Reading LiveLight Reading InsiderEthrnet ExpoTelco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems and TechnologyInsurance and TechnologyWall Street and TechnologyAccelerating WallstreetBST SummitBuyside Trading SummitIT Summit
space
Microsoft Technology Network
MSDNTechNetTotal IT ProTotal Dev ProNET Total Dev Pro CommunitySQL Total Dev Pro Community
space