Guide to the TechWeb Network

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home




May 24, 2001



Mass Movement

Scalable data integration is crucial as businesses face increasing data volumes and transactions

By Philip Russom

Continued from Page 1
Big Data Hits the Scene


To describe the extremes of data integration that are now common, a new term has recently entered IT parlance. "Big data" - at least, the way most integration specialists define it - involves moving several gigabytes of data at a time and integrating it with multiterabyte databases.

Out there on the bleeding edge, a number of forward-looking companies are already coping with the biggest of big data:

  • A consumer packaged goods manufacturer maintains a data warehouse of granular data on 50 million customers, from which it generates million-record direct-mail lists.
  • An application service provider serves a half million targeted ads every day from a database of banners that doubles in size every three months.
  • A credit reporting company receives and integrates more than a terabyte of new data each month, and the volume of incoming monthly data will soon reach two terabytes.

Of course, the problem with big data is that it keeps growing. So you must implement infrastructure that scales up to meet today's requirements, as well as tomorrow's.

The common element across the real-world cases listed here is that they all leverage the parallel capabilities of hardware, operating systems, databases, and integration tools. Massively parallel systems are the only way to scale up to big data - and keep scaling as it gets even bigger.

Data Integration Challenges

As if rapidly rising data sets, transaction volumes, application numbers, and user counts aren't troublesome enough, the situation is further exacerbated by barriers to the performance of data integration strategies. (I charted these trends as the horizontal axis in Figure 1.)

Shrinking batch windows. The tradition of executing data integration as batch processes in the dark of night - when user and system activity are lowest - may soon be a limited option. For example, IT departments typically conduct data integration in support of e-commerce at night for functions like dot-com order fulfillment and product catalog data movement, but nightly "batch windows" are shrinking as business in general becomes more global and as e-business is increasingly conducted over the geography- and time-zone-free Internet. Corporations fuel the data integration crisis by asking IT professionals to integrate unprecedented volumes of data, but give them ever-decreasing amounts of time in which to accomplish this Herculean feat.

Complexity of distributed environments. The emergence of client/server in the 1980s greatly increased the complexity of computing architectures. Following the infusion of Web-based architectures in the 1990s, computing architectures have pushed forward into a new extreme of distributed computing. In terms of data integration, many corporations face a daunting list of source and target components that must be integrated. As the number of components grows, so does the number of data transformations, staging areas, and data caches.

Preserving network bandwidth. In fully digitized businesses, network bandwidth is already in peril. Numerous applications and processes depend on the network and, as data integration efforts scale up, the regular movement of massive data sets may degrade network performance considerably. Employees, customers, and partners in e-business environments have little patience for sluggish applications - to the point that a loss of performance may lead to a loss of revenue. Although new network technologies - which double bandwidth capacity - appear every 18 months or so, companies cannot afford the time and expense of constant upgrades. These restrictions force companies to choose data integration processes that make lesser demands on bandwidth.

Rising performance expectations. As recently as the mid-1990s, data integration processes could run daily or weekly, and no one worried much if the process failed occasionally. But those days are long gone because business people in today's fast-paced markets need to analyze enterprise performance today instead of waiting a day or a week to review the results. Furthermore, once you provide self-service systems to employees, customers, and partners, they expect the most up-to-date information possible. Therefore, data integration processes must run frequently and without error.

Scalable Technology Requirements

As these trends in volume and performance move forward, they push data integration toward a crisis of scalability. To achieve scalable data integration under extreme conditions, IT personnel need to adopt different types of scalable technologies:

Parallelism, the top priority. The most important technology for scaling data integration is parallelism. In other words, running multiple computing processes simultaneously in parallel simply allows them to complete sooner than running sequentially. Here are some "parallelized" features to look for in scalable data integration tools:

  • Bottleneck-free parallel processing. While most data integration tools support some form of parallel connectivity (largely for loading data into a database or extracting it from a database), the parallel flows of data typically converge into a single thread inside the tool for the transformation process, creating a bottleneck that prevents scaling. Therefore, a data integration tool should "parallelize" all its internal functions, even those for transformation and aggregation.
  • Multithreading and SMP support. Enabling parallelism requires multithreaded server-based software that can allocate jobs and processes among separate CPUs on hardware supporting SMP.
  • Automatic job distribution. A data integration tool should automatically recognize, create, and manage multiple threads. Most tools today, however, require time-consuming hand coding to distribute jobs across CPUs. Ideally, the tool should also distribute processes across computers rather than merely across CPUs within a single computer, as is usually the case.
  • Parallel processing platforms. A few data integration tools can act as platforms for "parallelizing" a variety of data movement applications. This function is useful for large IT organizations that use a combination of nonparallelized third-party applications, custom applications, home-grown C or C++ programs, and legacy Cobol programs.

Scalable extract and load. When you're extracting data from source databases and loading it into target databases, your scalability depends on the query optimizers and bulk loaders of the source and target databases involved. Unfortunately, the capabilities of query optimizers and bulk loaders vary greatly from one brand of DBMS to another and sometimes among releases of the same brand. Most of them, however, support multiple performance modes. To avoid scalability-limiting bottlenecks, the programs you write and the tools you use should support the highest performance mode, even if you must alter your application design and data structures to achieve it.

Extract and load can also be accomplished with SQL statements transported via ODBC. When query optimizers and bulk loaders are not available, ODBC is a fine stop-gap measure, but the poor performance of most ODBC drivers makes them unacceptable to organizations that demand scalable data integration.







IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo Jitter
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet Evolution
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space