Agility TrainingBringing real-time data into your strategic business apps is becoming a necessity. But where do you begin?
By Justin Langseth Continued from Page 1 But what if the data changes between these passes of SQL? For example, imagine a relatively simple report showing today's sales activity for a retailer summarized at the product-category level, with a percent-to-total metric. This report would translate into several passes of SQL, one of which would compute the grand total for all product categories, and then another that would compute the subtotals at the product-category level. If a new sale is recorded after the first SQL statement is running, but before the second SQL statement begins, the new sale will be reflected in the result for the second pass but not for the first. That will cause the report to show the grand-total metric calculated without the new transaction, but with the category sales subtotals including the new sale. This result is a report where the dollar sales figures don't add up to the total, and where the percent-to-total values are incorrect and actually add up to more than 100 percent. There's nothing like numbers on a single report not adding up to cause users to second guess the quality of a data warehousing implementation!
Unfortunately, the only way to get around this problem, besides resorting to simply near-real-time data loads or teaching users to tolerate the inconsistent results, is relatively complex and involves a class of tools not yet on the market. The ideal solution would be a system that creates a snapshot of the data at the instant before the query begins, and that allows all the SQL passes to operate off that snapshot. Holding these snapshots in system memory instead of on disk can also dramatically improve system performance and scalability. Challenge 4: ScalabilityQueries run against data warehouses tend to be long and complex and can take several minutes or more to execute. The scalability of most data warehouses is proportionate to the number of these complex queries being run simultaneously by various end users. This limit is typically described as the number of "concurrent users" that a data warehouse can handle. These concurrent-user numbers generally assume that only OLAP queries are being run against a
data warehouse. However, when a warehouse is continuously loaded with new data while OLAP queries
are run, CPU, disk, memory utilization, and load on the warehouse can dramatically increase. All the
concurrent There are several ways to mitigate this scalability impact. The simplest is to use near-real-time loading approaches instead of continuous feeds. If real-time data is trickle-fed into a holding table, and then every several minutes swapped with the production intra-day partition, near-real-time results can be achieved without loading and simultaneously querying the same tables. Other relatively simple options include adding more database processors or memory, or limiting real-time access to a subset of the user base. For applications that need to be truly real-time, the intraday data can be stored in an external in-memory database (IMDB), such as those available from TimesTen Performance Software or Polyhedra. Queries that only involve real-time data can then be routed to the IMDB. Queries that require both real-time and historical data can first assemble a snapshot of the required real-time data, image it quickly into a temporary table in the warehouse, and then run the query. You can also do that in reverse, where any needed historical data is imaged on demand into the IMDB - if the amount of historical data is small enough. These approaches are possible today using database gateway technologies, but this type of heterogeneous database access is likely to become a core feature of many query and OLAP tools in the near future. Challenge 5: Real-Time AlertingMost data warehouse alerting applications are used to distribute email versions of reports after the nightly data warehouse load. The availability of real-time data makes alerting applications much more appealing, as users can be alerted to real-time conditions as they occur. Currently, all data warehouse alerting technology works on a schedule or an event basis, as opposed to on a continuous, real-time basis. Consequently, to perform true real-time alerting using these tools, a custom-built additional process must continuously monitor the incoming data and trigger the events when appropriate. To approximate real-time alerting using off-the-shelf tools, the schedule can be set to trigger every five minutes or so. For a warehouse that is loaded on a near-real-time basis, all you need to do is to set the alerting schedule to trigger right after the data is refreshed. For a warehouse that is updated on a truly real-time basis, using an n-minute cycle schedule will introduce a certain amount of latency, as an alert can't be triggered until the next cycle window comes around after the threshold condition is met. If the cycle time is low enough, and the alert is to be sent via email anyway, a five-minute latency may be acceptable for many applications. It's tempting to set the cycle threshold as low as possible, but this can introduce certain complications. Frequent alert cycles will impose more load on the data warehouse, which can affect performance. Also, if a cycle begins before the previous cycle has completed, users may receive duplicate alerts. Regardless of whether alerts are generated using near-real-time batch cycles or by a real-time triggering system, it's critically important that your users' alert thresholds are properly managed. For example, imagine users who ask to be alerted by email when the inventory level of any item in their stores drops to 5 percent of the normal level during the course of a day. When the inventory drops to this threshold, an alert should be generated. However, if the alerting application checks for threshold conditions every five minutes, and the inventory threshold is still valid, should the same alert be generated again? In most cases, users only need to receive such alerts once. The problem occurs when static threshold definitions, which are fine for systems that are loaded on a nightly or weekly basis, are applied to systems that are updated more frequently. It's unlikely that the user wants the same alert every five minutes. It might be better to alert users once levels reach 5 percent, and then maybe again at 2 percent, and then one final time when stock runs out completely. Or a brokerage customer may want to know when a certain stock exceeds $20, and then be alerted again at every two-dollar increment thereafter. This type of threshold management is necessary for real-time alert systems to be accepted by users. Unfortunately, it isn't natively supported by current warehouse alerting tools. Until this support is added, the best approach is to use the tools' postservice plug-in abilities to run custom SQL or procedures to directly update the users' thresholds based on the current data conditions. Time for Real TimeMany challenges I've discussed here will become more manageable over time, as database, ETL, OLAP, reporting, and alerting tool vendors begin to add features to their systems to better support real-time data streams. In the meantime, it's important to ensure that real-time warehousing systems are well planned and designed and thoroughly tested under realistic data and user-load conditions before they're deployed. The benefits of real-time data warehousing are becoming clearer every day. With the right tools, designs, advice, approaches, and, in some cases, tricks, real-time data warehousing is possible using today's technologies and will only become easier in the future. In any case, the time to begin planning and prototyping is now. Justin Langseth [langseth@claraview.com] is a founder and CTO of Claraview LLC. Claraview provides system architecture, project management, and implementation consulting services to companies and government agencies implementing real-time data warehouses and alerting systems.
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
| ||||||||||||||||||||||||||||||||









