http://www.intelligententerprise.com/010507/feat2_1.jhtml


Mass Movement


Scalable data integration is crucial as businesses face increasing data volumes and transactions

by Philip Russom
Executive Summary
Philip Russom

Ongoing trends are driving up data volumes and making data integration increasingly difficult to perform. If your company moves and integrates a gigabyte or more of data daily, these escalating trends will soon lead to a data integration crisis that could adversely affect your important business processes.

Therefore, to achieve scalable data integration under extreme conditions, your IT organization should adopt techniques such as parallelism, as well as consider simply adding more hardware.

How do you define scalability? The definition depends on what you're trying to achieve with your IT systems. For many, the top priority for scalability is high-speed processing that enables great numbers of transactions per second. For others, the primary need is a system that can scale up to large user counts or voluminous data storage. However, in the world of data integration, scalability means massive data throughput. You may be facing a data integration crisis sooner than you think.

Today, two groups of trends are driving the business demand for data integration and the challenges to its implementation. (See Figure 1.) On the one hand, data volumes escalate daily because of increasing numbers of online transactions, applications needing integration, and users who demand access to integrated data. On the other hand, data integration becomes harder to implement because of shrinking batch windows, increasingly complex distributed computing environments, the need to preserve network bandwidth for applications, and rising expectations for performance.

As these trends move forward, they will drive many corporations toward an impending crisis of data integration scalability. In this article, I examine these trends as well as the technologies and IT practices that can help you prepare for a possible data integration crisis.

Trends Driving Data Integration Volume

The greatest challenge to data integration scalability is the explosion of data volume that almost all companies are currently experiencing. But similar challenges come from escalating numbers of transactions, applications, and users. (These trends appear on the vertical axis in Figure 1.)

Burgeoning data volumes. Many corporations already move gigabytes of data daily. This movement is part of a data-oriented strategy for integrating internal applications and transacting with business partners, as well as for loading decision-making data into data warehouses, corporate portals, and knowledge management systems. As companies operate more and more like e-businesses over the next few years, the daily volume of data movement will increase to hundreds of gigabytes, pushing the scalability of data integration technologies to their limits.

More transactions online. Direct-to-consumer online transactions are increasing steadily, although not as quickly as numerous failed B2C e-tailers had hoped. The most dramatic increase, however, concerns B2B trade. Market research firm Gartner Group forecasts that B2B online transactions will amount to $6 trillion in 2004. Dollar value aside, the number of transactions will increase correspondingly. Because B2B online transactions typically rely on data integration technology, scalability will be an important issue.

Increasing numbers of applications. The steady "digitization" of corporate processes - a long-running trend accelerated by e-business - brings new applications online regularly. These applications must integrate with existing systems so they can contribute to "information synergy" across the enterprise or else they become nonleveraged silos. For instance, data integration tools and best practices typically link newer front-end applications (perhaps for e-commerce or other customer-facing tasks) with older back-office systems (for shipping, billing, inventory, and so forth). The IT organization in any company - large or small - must cope with an ever-expanding list of applications that require some level of integration.

Growing user communities. E-business is about performing business processes online, which improves efficiency and shares information broadly for better decision making. As the practice of e-business rises, the number of users logged into applications increases as well. And a large portion of these users need to access integrated data in corporate portals, data warehouses, and caches of cross-channel customer or partner data. IT in the Fortune 1000 today is already supporting thousands of users. As user communities grow, so will their demands for pre-integrated data.

Big Data Hits the Scene


To describe the extremes of data integration that are now common, a new term has recently entered IT parlance. "Big data" - at least, the way most integration specialists define it - involves moving several gigabytes of data at a time and integrating it with multiterabyte databases.

Out there on the bleeding edge, a number of forward-looking companies are already coping with the biggest of big data:

  • A consumer packaged goods manufacturer maintains a data warehouse of granular data on 50 million customers, from which it generates million-record direct-mail lists.
  • An application service provider serves a half million targeted ads every day from a database of banners that doubles in size every three months.
  • A credit reporting company receives and integrates more than a terabyte of new data each month, and the volume of incoming monthly data will soon reach two terabytes.

Of course, the problem with big data is that it keeps growing. So you must implement infrastructure that scales up to meet today's requirements, as well as tomorrow's.

The common element across the real-world cases listed here is that they all leverage the parallel capabilities of hardware, operating systems, databases, and integration tools. Massively parallel systems are the only way to scale up to big data - and keep scaling as it gets even bigger.

Data Integration Challenges

As if rapidly rising data sets, transaction volumes, application numbers, and user counts aren't troublesome enough, the situation is further exacerbated by barriers to the performance of data integration strategies. (I charted these trends as the horizontal axis in Figure 1.)

Shrinking batch windows. The tradition of executing data integration as batch processes in the dark of night - when user and system activity are lowest - may soon be a limited option. For example, IT departments typically conduct data integration in support of e-commerce at night for functions like dot-com order fulfillment and product catalog data movement, but nightly "batch windows" are shrinking as business in general becomes more global and as e-business is increasingly conducted over the geography- and time-zone-free Internet. Corporations fuel the data integration crisis by asking IT professionals to integrate unprecedented volumes of data, but give them ever-decreasing amounts of time in which to accomplish this Herculean feat.

Complexity of distributed environments. The emergence of client/server in the 1980s greatly increased the complexity of computing architectures. Following the infusion of Web-based architectures in the 1990s, computing architectures have pushed forward into a new extreme of distributed computing. In terms of data integration, many corporations face a daunting list of source and target components that must be integrated. As the number of components grows, so does the number of data transformations, staging areas, and data caches.

Preserving network bandwidth. In fully digitized businesses, network bandwidth is already in peril. Numerous applications and processes depend on the network and, as data integration efforts scale up, the regular movement of massive data sets may degrade network performance considerably. Employees, customers, and partners in e-business environments have little patience for sluggish applications - to the point that a loss of performance may lead to a loss of revenue. Although new network technologies - which double bandwidth capacity - appear every 18 months or so, companies cannot afford the time and expense of constant upgrades. These restrictions force companies to choose data integration processes that make lesser demands on bandwidth.

Rising performance expectations. As recently as the mid-1990s, data integration processes could run daily or weekly, and no one worried much if the process failed occasionally. But those days are long gone because business people in today's fast-paced markets need to analyze enterprise performance today instead of waiting a day or a week to review the results. Furthermore, once you provide self-service systems to employees, customers, and partners, they expect the most up-to-date information possible. Therefore, data integration processes must run frequently and without error.

Scalable Technology Requirements

As these trends in volume and performance move forward, they push data integration toward a crisis of scalability. To achieve scalable data integration under extreme conditions, IT personnel need to adopt different types of scalable technologies:

Parallelism, the top priority. The most important technology for scaling data integration is parallelism. In other words, running multiple computing processes simultaneously in parallel simply allows them to complete sooner than running sequentially. Here are some "parallelized" features to look for in scalable data integration tools:

  • Bottleneck-free parallel processing. While most data integration tools support some form of parallel connectivity (largely for loading data into a database or extracting it from a database), the parallel flows of data typically converge into a single thread inside the tool for the transformation process, creating a bottleneck that prevents scaling. Therefore, a data integration tool should "parallelize" all its internal functions, even those for transformation and aggregation.
  • Multithreading and SMP support. Enabling parallelism requires multithreaded server-based software that can allocate jobs and processes among separate CPUs on hardware supporting SMP.
  • Automatic job distribution. A data integration tool should automatically recognize, create, and manage multiple threads. Most tools today, however, require time-consuming hand coding to distribute jobs across CPUs. Ideally, the tool should also distribute processes across computers rather than merely across CPUs within a single computer, as is usually the case.
  • Parallel processing platforms. A few data integration tools can act as platforms for "parallelizing" a variety of data movement applications. This function is useful for large IT organizations that use a combination of nonparallelized third-party applications, custom applications, home-grown C or C++ programs, and legacy Cobol programs.

Scalable extract and load. When you're extracting data from source databases and loading it into target databases, your scalability depends on the query optimizers and bulk loaders of the source and target databases involved. Unfortunately, the capabilities of query optimizers and bulk loaders vary greatly from one brand of DBMS to another and sometimes among releases of the same brand. Most of them, however, support multiple performance modes. To avoid scalability-limiting bottlenecks, the programs you write and the tools you use should support the highest performance mode, even if you must alter your application design and data structures to achieve it.

Extract and load can also be accomplished with SQL statements transported via ODBC. When query optimizers and bulk loaders are not available, ODBC is a fine stop-gap measure, but the poor performance of most ODBC drivers makes them unacceptable to organizations that demand scalable data integration.

Network and I/O considerations. To help preserve network bandwidth, some data integration tools support compression algorithms to streamline data sets being moved over a network. Of course, almost all data integration tools provide graphical environments for defining sophisticated scheduling of data integration processes to make the most of shrinking batch windows.

One strategy followed by many IT departments is to colocate the data integration server and the target database server on the same computer (usually a large Unix box). This shared location greatly speeds up the bulk load process because memory and the local file system are far faster than communications over a network. Furthermore, colocation reduces the amount of data moved over a network, thus preserving bandwidth for other applications.

High availability. A data integration process designed for a large, heterogeneous environment can be very complex because it depends on the precise execution - in a rigidly determined order - of a long series of events that access a variety of systems. Considering the complexity and numerous dependencies of this process, it's no surprise that it occasionally fails to complete successfully - all the more reason why data integration tools need foolproof recovery mechanisms. For instance, so-called "check-pointing" can identify the precise point of failure, so a job can restart from there. With batch windows shrinking, there's no time to restart the job from the beginning.

Operating system issues. A large percentage of data integration solutions (whether home-grown or vendor-created) run on Microsoft Windows NT/2000 because of the affordability of Intel-based computers. In fact, many corporations move gigabytes of data daily with Windows-based data integration servers running on "parallelized" SMP computers. Yet, many IT shops feel that Unix and mainframe operating systems are more capable of supporting the farthest extremes of data integration, which is in the range of tens of gigabytes per day.

Mainframes are often the source of gigantic data sets, yet they typically generate large text files with little or no preprocessing performed on the mainframe. This lack of processing is a shame because the CPU of a mainframe is capable of dozens of MIPS (millions of instructions per second), a computing resource so vast that it is seldom fully utilized. IT organizations that must integrate mainframe data should look for tools that transform and aggregate data natively on the mainframe so it can contribute to scalable data integration.

Survival Tips

Based on the trends and technologies discussed here, what can IT personnel do to survive the impending crisis in data integration?

Parallelize everything. A truly scalable solution for data integration consists of many components - computer hardware, operating systems, DBMS, and data integration servers. The data integration process is only as fast as its weakest component, and parallelization is the enabling technology for speed. Select components based on the strengths of their parallel technologies. If you have nonparallel legacy programs, consider data integration platforms that can parallelize these. Unless every component fully leverages parallel technologies, you may end up with bottlenecks, which prevent the high scalability that many companies will need as they face a data integration crisis.

Rely on hardware. Hardware is a tried-and-true weapon in any battle with scalability. When selecting components for a data integration solution, be sure they automatically scale up as new hardware is added in the form of CPUs, memory, or disk arrays. This way, IT can keep pace with burgeoning data volumes and throughput requirements by simply adding more hardware.

Furthermore, the cost of computer hardware is decreasing, while data volumes are increasing. (See Figure 2.) This means that the expanding hardware component of a scalable data integration solution can be relatively cost effective.

But calculate your budget carefully. You'll need to decide which hardware commitment gives your organization the best cost-to-performance ratio over time - a few monolithic servers that you can expand individually or a farm to which you add small servers as needed. Either way, the hardware components should support parallel technologies, which are key to the scalability strategy of any data integration solution.



Philip Russom, Ph.D. [www.philiprussom.com] is an independent industry analyst and consultant based in Waltham, Mass. He was formerly the director of business intelligence at the Hurwitz Group.


RESOURCES

Related Articles on IntelligentEnterprise.com:

"Power to the People"
"Pillar of the Community"
"It's About Data Integration"
www.IntelligentEAI.com


Return to Article