Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home




February 1, 2002

Finders, Keepers

Point-spatial searching is an emerging requirement for RDBMS-based applications — so don't get left behind

By Marty Himmelstein

Continued from Page 1

A point-spatial search returns a set of attributes for each returned point. The spatial attributes will include the lat/lon coordinates and the address. This data can be passed to post-processing applications, such as map rendering and routing (driving directions). As with geocoding, these are separate applications that require their own specialized data and algorithms. The relationship of maps and routing to point-spatial searching is analogous to graphing packages that visually represent the results of SQL queries.

POINT-SPATIAL SEARCHING IN AN RDBMS

After data is geocoded, it is ready to be loaded into the database. Well, almost: The problem with using latitude and longitude coordinates directly for point-spatial searching is that it is horribly inefficient. The inefficiency is of little consequence for small and simple data sets but becomes an unbreachable hurdle as the size and complexity of the data increases.

To perform a spatial search it is necessary to put boundaries on the area to be searched. Such a boundary is usually rectangular (even if subsequent steps cull the results to a circular area) and is called a "bounding box." To express a bounding box with lat/lon coordinates requires four values: the lower left lat/lon and the upper right lat/lon (or vice versa).

The access plan for the SQL involved requires two range searches for the spatial predicates alone, one for the latitude range and one for the longitude range. Even with a composite index on latitude and longitude, that can force the database engine to examine a tremendous number of index entries. Many entries will pass the first range test (whether it be lat or lon) but fail the second test. Furthermore, after reducing the set of rows to those within the bounding box, it is still necessary to perform additional predicate tests. (You don't search for all businesses in downtown New York City, but only for certain businesses, such as restaurants, or restaurants with a specific name, or in a certain price range.) For most applications, nonspatial predicates limit the number of relevant rows to a small percentage of the rows that satisfy the spatial predicate. Successful point-spatial searching requires the efficient processing of a combination of spatial and nonspatial attributes.

Several techniques have been developed to provide efficient access to 2D data. The z-ordering linear quadtree works well with standard RDBMSs by decomposing a 2D space into rectangular cells and assigning each cell a unique one-dimensional identifier. These identifiers can be efficiently indexed with either B-trees or hash indexes, the two most common RDBMS index structures.

The basic idea of the linear quadtree is to partition the 2D space of interest into rectangular quadrants. In our case, the 2D space of interest is a political map of the earth, or some portion of it, such as North America. It is convenient to assume that we start with a single undivided rectangle that covers the entire 2D space and contains all of the points to be indexed. Because a single rectangle that contains all of points to be indexed isn't helpful, the rectangle is decomposed into four equal quadrants, informally labeled NW, NE, SW, and SE. If, after a new quadrant is created, it still contains more than a certain maximum number of points, the quadrant is further decomposed into four more subquadrants. This recursive decomposition continues until a newly created quadrant contains no more than the maximum number of points.

For most applications, points are not uniformly distributed in the 2D space. Therefore, the density-based decomposition strategy leads to some quadrants that are larger than others. In a data warehouse that tracks sales of suntan lotion by location, for example, a quadrant in Florida might represent an area well less than a mile per side, whereas it will only take one or two quadrants to represent Alaska, with the Northwest territories thrown in. (See Figure 1.)

Instead of density-based decomposition, quadrants could be set to some uniform predetermined size. The disadvantage of uniform quadrant sizes is that in some areas quadrants will have very few points, making the examination of most quadrants wasted work. Or, some quadrants will have too many points, requiring extra work to examine individual points. A main goal of all SAMs is to reduce the number of objects for which costly geometric operations have to be performed. A SAM that doesn't sufficiently reduce the number of points to be examined is, therefore, not doing its job.

To assign a unique label to each quadrant, we replace the informal compass descriptors of NW, NE, SW, SE, with the characters 0, 1, 2, 3 (or something equivalent, it doesn't matter). For example, instead of an unwieldy "SE quadrant of the NE quadrant of the NW quadrant," we assign the identifier 132. (See Figure 2.) This, in a nutshell, is how a single-dimensional value is assigned to a 2D area. The term z-order refers to the order in which the quadrants are assigned unique identifiers. (See Figures 2 and 3.) Consider that the shorter the identifier, the larger the area it represents; each position in the identifier represents the next deeper level of the quadtree. (Linear quadtree refers to the fact that only the bottommost quadrant identifiers need to be stored in the database. Even though quadrants are recursively generated, recursion is not required at access time.)



Rate This Article

Comments:

Optional e-mail address:

The lexicographic ordering of the character string identifiers is valuable for certain index operations and SQL text functions. For example, a wildcarded LIKE predicate for quadrants that begin with '132' will find all quadrants with the '132' prefix. Such operations enable index range scans, which are more efficient than table scans. More important, however, is that quadrant identifiers can be used in equality predicates, which for virtually all RDBMS systems is the most efficient way to access data. Composite indexes, in which the first part is a quadrant identifier and the second part is some nonspatial attribute, allow for extremely fast resolution of complex queries with both spatial and nonspatial components.

KEEP IN MIND

This short article can't cover all the basic aspects of point-spatial searching. For example, in the process of executing a spatial query, it is necessary to build a list of quadrants to be examined, and perhaps to order that list in increasing distance from the search center. You must account for the earth's curvature to ensure that distance calculations are accurate. Nevertheless, these operations aren't that time-consuming as long as the SAM does an effective job of filtering the objects that need more intense scrutiny.


Marty Himmelstein [marty@longhill.com] is the founder of Long Hill Consulting Inc., a database consultancy. In private industry,he developed and implemented large-scale point-spatial systems and is the coauthor of a patent in the field.


RESOURCES

GDT: www.geographic.com

IBM's DB2 Spatial Extender: www-4.ibm.com/software/data/datajoiner/spatial.html

NavTech: www.navtech.com

Oracle's Spatial Data Option: www.oracle.com/ip/deploy/database/oracle9i/index.html?cm_spatial.html

PostgreSQL: www.postgresql.org

Vicinity Corp.: www.vicinity.com







IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo Jitter
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet Evolution
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space