Guide to the TechWeb Network

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home




August 31, 2001



Brace for Incoming

Importing external data deserves its own full design and development life cycle

By Warren Thornthwaite
Edited by Ralph Kimball

Pressure to include external data in our data warehouses bears on us now from a combination of forces that have arisen in the last few years. First, we want to leverage the wealth of information we have about our customers. Second, much more data has become available: We work more closely with our business partners and third parties, who provide us with data about our products, customers, and markets that we are unable to generate in-house. The third driver is the Internet. The ready availability of data transfer tools for the Internet has lowered the barriers to exchanging data with business partners.

But don't think using external data is just another data extract and loading task. In reality, managing external data involves special awareness and procedures that may not be needed within the confines of a single organization.

TYPICAL DATA INTEGRATION PROCESS

So, what does integrating data from a business partner typically involve? This new data source has to go through the full design and development life cycle, except you probably do not have access to the source system or its developers. Plus, you have to find (or set up) a secure, reliable landing site for the data that both you and your business partner can access. Even worse for those of us who feel data warehouse dependability is critical, you now have to manage a process in which someone else - someone outside your organization - owns half of the process.

Like all recommendations in this column, this one starts with planning and design. First, you need to negotiate the nature of the extract. Will it be a snapshot at a point in time or a set of transactions? Will each file you receive be a full refresh or an incremental load? What is the time increment if it is incremental?

Next, you need to negotiate the file's format and contents. In this case, since you are the recipient of data from a source system you can't access directly, you need to ask lots of questions: What are the names and meanings of all the file's elements? What is the file's natural key? What are the relationships among the different fields? Can a field be null? When is it null? What does that mean?

Dust off your data warehouse naming conventions to make sure the field names make sense. For example, what does the field called "Date" mean? Is it the Subscription Start Date, or the Last Billed Date, or the Extract Batch Date? If it is the Subscription Start Date, is it the Original Subscription Start Date, or the Most Recent Subscription Start Date? If the file will be used as a dimension table, you may need to manage it as a slowly changing dimension.

Make sure the data includes lookup tables that provide descriptions of any codes in the main integration table. Refreshed copies of these lookup tables should be included with each load to ensure referential integrity with the main data set. Finally, it's also useful to add a unique batch identifier to each record in case you need to pull it back out of the database. You may want to manage this on the receiving end.

In addition to the contents, you will need to decide on file naming conventions. These should include a descriptive name, the extract date, and the source.

Your process needs to handle a test load. Distinguishing a test load from a production load could be done at the file level, either in the metadata or the file name itself.

Your process may also need to handle versioning. Data sources often change how they define an element or the list of elements included in the extract. Your data integration partner should include a version number in the metadata so your process can handle the incoming data set appropriately. You should also make sure you've agreed on a process to resynchronize the two data sets. If you miss a load or your partner sends you bad data, you need a way to get back on track.

Once you agree on the nature and contents of the data set, you need to capture this information in a schema definition and associated documentation. XML is emerging as the ideal tool for this. A single XML file can contain multiple tables, along with the schema definition itself. Your data partner can send you a fully self-contained data set that has the schema, the metadata, any lookup tables, and the data set.

ARCHITECTURE

Once you have specified the contents, you have to negotiate the architecture. This negotiation involves deciding who will provide the host site. You may find out that your company policy forbids inbound data access, for security reasons. You may have to set up a special machine to host the process. This machine will need to be accessible to your data integration partner and to you. You also need to decide on the transport layer. FTP is the most common choice, but alternatives such as email, HTTP, and secure copy are becoming more popular. Figure 1 shows a typical architecture to support an FTP-based integration.






IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo Jitter
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet Evolution
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space