CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise





May 28, 2002

Agility Training

Bringing real-time data into your strategic business apps is becoming a necessity. But where do you begin?

By Justin Langseth

Today, most data warehouses are refreshed, at best, on a daily basis. But more and more companies are realizing that day-old data, while acceptable for many applications, can be a real handicap for the most time-sensitive decision-support needs. Almost any data warehouse can benefit from real-time data, especially those driving CRM, supply chain, financial, and homeland security applications.

However, most teams implementing real-time warehouses quickly encounter a set of new, unfamiliar challenges. In this article, I'll describe the top five issues that your team would likely face in the real-time race and provide some hints for tackling them.

Challenge 1: Real-Time ETL

The first problem is getting the data loaded into the data warehouse in real time. Most extract-transform-load (ETL) processes are performed monthly, weekly, or nightly. Going from nightly to more frequent loading is more complex than turning up the "schedule knob" a few clicks.

Executive Summary

Justin Langseth

Nightly loads are no longer frequent enough for many just-in-time manufacturing, supply chain, and CRM applications. Although achieving the Holy Grail of a continuously loaded warehouse is challenging with today's technologies, now is the time to get started.


Strategy.com: Real-Time Pioneer

The challenges addressed here were first encountered by my technology team at Strategy.com, which was formerly a division of BI tool-maker MicroStrategy Inc.

The Strategy.com system was essentially one big real-time data warehouse, designed to provide proactive alerting and reporting services to consumers. We loaded data from financial markets to drive stock change alerts and from weather data providers for tornado and other foul-weather alerting. We also provided services based on real-time traffic, sports, and newsfeeds.

All the data for these services was loaded into a large, six-node Teradata database system throughout the day. We used a combination of near-real-time and true real-time data loading technologies, all custom built in C++ and Visual Basic. We stumbled onto most of the challenges discussed in this article, and slowly but surely figured out ways to get around them.

At the peak of our service, we were providing real-time alerts to more than 350,000 users through email, wireless, and voice technologies, and we had capacity for more than one million users. While the business model of our dot-com didn't work out as planned (whose did?), the technology was rock solid. While some custom development may still be required to mitigate some of the challenges of real-time warehousing, advances in real-time ETL and other related technologies are now making things much easier.

Most ETL tools are designed for batch loading in three steps: extract the data, transform it, and load it. This approach works well when you have a few hours overnight when the data warehouse is offline. But what about doing it throughout the work day?

Two good approaches are possible. First, you can use traditional ETL tools and approaches more frequently than daily or weekly — say, a few times a day, or maybe hourly. If you can extract the source data without interfering with production operations, it can then be transformed as usual. Loading into the data warehouse is a bit trickier, as the system can't be taken offline for the load. But the new data can be loaded into a temporary set of tables, identical in structure to the historical tables. When the load is complete, the fresh tables are swapped with the previous intra-day tables, and the warehouse is instantly up to date.

You can use this near-real-time approach in conjunction with most existing warehouses to provide for fresh data hourly or several times per day. Although this strategy isn't truly real time, at least you can implement it using traditional ETL tools and techniques.

But if you require true real-time data in the data warehouse, the data needs to be trickle-fed continuously from the source system. This type of loading is beyond the abilities of batch-oriented, traditional ETL tools. Fortunately, a new class of products is on the market — such as Acta Technology Inc.'s ActaWorks and DataMirror Corp.'s Data Transformation Server — designed specifically for this type of real-time ETL. These tools open a virtual "data pipeline" between the source system and the warehouse, and ETL data as it changes in real time.

A second option for true real-time ETL is to custom-build a process to move and transform the data. One approach here is to use database triggers on the source system to fire off a process whenever data is added or changed. Open standards such as Java Messaging Service can transport the data, which can then be transformed and inserted directly into the data warehouse fact tables. Another possibility is to use database replication combined with SQL scripts to perform a similar function.

Both of these alternatives may work for some applications, but using a special-purpose, real-time ETL tool will help ensure scalability as well as support special operations such as the proper recording of intra-day record deletes in the data warehouse.

Although a real-time ETL tool or process may perform the mechanics of real-time data loading, the presence of constantly changing data in the data warehouse throws a monkey wrench into some of the other traditional warehousing methodologies. Data modeling, database tuning, report execution, and proactive information delivery also need to change for the data warehouse to become truly real time.

Challenge 2: Data Modeling for Real Time

The introduction of real-time data into a data warehouse raises some interesting data modeling issues. For example, if your warehouse has some or all of its data aggregated at various levels based on a time dimension, you should consider that the aggregated information may be out of sync with the real-time data.

For example, if an aggregate exists at the month level, including a month-to-date value for the current month, this preaggregated value will become outdated as soon as the first intraday load is completed. Therefore, either the current day's data can't be aggregated at higher levels, or these aggregates need to be updated for each incoming set of intraday data.

Also, some metrics such as month-to-date and week-to-date may "behave" strangely with a partial day of data that continuously changes. Although these challenges can be overcome at the application level, the main issue regarding modeling revolves around where the real-time data is stored and how best to link it to the rest of the data model.

In his Feb. 1, 2002 column "Real-Time Partitions," Ralph Kimball introduced the concept of storing intraday data in a real-time partition. This type of partition represents a separate set of tables that contain the partial day's data that is as up to date as the ETL process allows. These tables mirror the data structure of the regular warehouse fact tables.

You can join these tables into the rest of the data warehouse for query purposes by using either tool-level or database-level table partitioning features. They can also be wrapped with the rest of the warehouse by using a union view to join the historical and intra-day data into a single table as far as the query tool is concerned.

Another option is to model the real-time data using separate attributes and keep it logically separated in the query tool. This is an easier approach, but it comes at the expense of slightly more complexity for end users.

Challenge 3: Query Tools vs. Real-Time Data

When real-time data is loaded into an appropriate data model, the next trick is to get query and online analytic processing (OLAP) tools to peacefully coexist with constantly changing data. This is no small feat as most query and OLAP tools are built on the core assumption that data is static between loads.

The first problem typically encountered relates to report caching. Most tools use some level of data and report caching so that they can quickly respond to requests that are identical or similar to previous requests. This approach provides a huge performance boost for most data warehouse projects.

The problem, however, is that query tools and OLAP servers usually expire or refresh these caches when the data warehouse is reloaded. They usually don't have the capability to selectively expire caches based on a partial warehouse reload, as would occur with a near-real-time, intraday load schedule. Also, most tools lack the ability to intelligently cache only reports and data derived from the historical tables, but not from the real-time tables. Until tools are enhanced with this type of logic, the alternatives are to expire all caches upon each intraday load, or, if the tool allows, to disable caching for the reports that involve real-time data.

Report inconsistency is another common problem. Inconsistency stems from the fact that most OLAP tools perform complex analytics using multipass SQL: They'll translate a single report request into between three and 20 distinct SQL statements, which are run in series, one after the other. The results of the early queries are stored in temporary tables that are then used to produce the final result set.







IE Weekly Newsletter
Subscribe to the newsletter
    Email Address