Guide to the TechWeb Network

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home




September 18, 2001



Integrating Structured Data and Text

Treating text as a relational application is a viable alternative for many data warehouses

By David Grossman and Ophir Frieder
Edited by Erik Thomsen

Continued from Page 1

REAL-WORLD APPLICATION

The Sire Approach

A Scalable Information Retrieval Engine (SIRE) was developed using the concepts of integrating structured data and text. The National Institute of Health (NIH) chose the SIRE approach as the basis for searching medical citations for its National Center for Complementary and Alternative Medicine because, in addition to typical search engine functions, it contains all the basic DBMS features, such as concurrency control, recovery, access control, and portability. More important, the medical citations index can now modify a document that has already been indexed for search. Such updates are difficult or impossible for a typical inverted index, but easy with DBMSs.

With SIRE, the system is easily extended to access other structured data in databases at NIH. Back at the lab, we are working to add XML functionality to SIRE and are building a prototype that uses XML-QL (a popular XML query language) that should be ready at the end of 2001.

Text as a Relational App

Numerous benefits exist for treating text as a relational application. For starters, you don't need to acquire, install, or integrate a text package into the data warehouse to support access to a few text columns. For example, almost every warehouse has a "comments" column or two that lets users enter whatever unstructured data they feel is relevant to the transactional record. But searching these text columns with a LIKE isn't really a good idea because it is typically implemented as a sequential scan. Building your own inverted index in a native OS file system may seem like an efficient alternative, but then you get to write a few thousand lines of code to do all the file manipulation, concurrency control, and access control that already comes with a relational database management system.

Treating text as a relational application also opens the door to parallel processing - something that has eluded the commercial text world because of the inherently sequential nature of the inverted index. The downside, obviously, is that extra overhead happens when you use a relational application, but didn't we go through this argument in the '70s when people were griping that the relational approach was too slow and the best thing to do was to stick with ISAM files?

More Next Month

In the next column, we'll show how you can implement more complex text functionality (such as relevance ranking) and give some more details, performance statistics, and tuning hints on this approach. The bottom line is that treating text as a relational application is a viable alternative for many data warehouses, and it has been deployed in a number of real-world applications. We suspect that as the need for integration of structured data and text increases, more applications will consider solutions similar to the one discussed here.



Rate This Article

Comments:

Optional e-mail address:

David Grossman [grossman@iit.edu] is an assistant professor of computer science and Ophir Frieder [frieder@iit.edu] is the IITRI professor of computer science at the Information Retrieval Laboratory, Illinois Institute of Technology.


RESOURCES

Frieder, O., A. Chowdhury, D. Grossman, M. C. McCabe, "On the Integration of Structured Data and Text: A Review of the SIRE Architecture," DELOS Workshop on Information Seeking, Searching, and Querying in Digital Libraries, Zurich, Switzerland, December 2000.

Grossman, D., D. Holmes, and O. Frieder, "A Parallel DBMS Approach to IR in TREC-3," Overview of the Third Text Retrieval Conference (TREC-3), NIST Special Publication 500-225, April 1995.

Grossman, D. and O. Frieder. Information Retrieval: Algorithms and Heuristics. Kluwer Academic Press, 1998.

Grossman, D., D. Holmes, O. Frieder, D. Roberts. "Integrating Structured Data and Text: A Relational Approach." Journal of the American Society of Information Science, February 1997.







IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







techweb
Online Communities TechWebInformationWeekLight ReadingIntelligent EnterprisebMightyNetwork ComputingDark ReadingDigital LibraryWall Street & Technology
Byte & SwitchNo JitterInternet EvolutionLight Reading's Cable Digital NewsContentinopleUnStrungBank Systems & TechnologyAdvanced TradingInsurance & Technology
Face-to-Face Events
InteropWeb 2.0 ExpoWeb 2.0 SummitVoiceConBlack HatCSISoftwareEntrprise 2.0 ConferenceGTEC
Mobile Business Expo
InformationWeek 500 ConferenceBuy Side Trading XchangeBuy Side Trading SummitBank Executive SummitInsurance Executive SummitTelcoTVEthernet ExpoOptical Expo
Magazines  
InformationWeekWall Street & TechnologyInsurance & TechnologyBank Systems & TechnologyAdvanced TradingMSDNTechNetSmart EnterpriseThe Architecture JournalDatabase Magazine
 
Research & Analyst Services  
Heavy ReadingInformationWeek ReportsInformationWeek Analytics