Why analytical scalability is so critical to e-business success
by Richard Winter
In the April 10 issue, More Than You Hoped For, I discussed clickstream warehousing. I focused on the huge volumes of data involved, and I predicted why they will continue to grow, and why most enterprises will choose to retain most or all of the clickstream for continuing analysis. These factors alone add up to an immense scalability challenge.
In a confirmation of that initial analysis, Topher Heigham, director of database technology at NetGenesis, offered the following scenario: Today, a large (but not unusually large) Web site will get 100 million hits per day. Web servers log about 200 bytes of data for each hit. That is, we must analyze 20GB of raw data per day. Leading sites today are keeping this data or keeping summarizations of it for three years.
Depending on how long sites keep the raw data and what level of summary they retain in the longer term, this storage will result in a clickstream warehouse of about 3 to 20 terabytes (TB) at a steady state. Of course, the state is not a steady one virtually all sites are seeing growing traffic volumes enabled by wireless devices, mobile devices, Web appliances, broadband connections, and rapidly growing worldwide user populations. So, most sites with 100 million hits per day need to plan for an approximately 6 to 40TB of stored data.
According to Dr. David Reiner, NetGenesis vice president of product strategy and development, keeping all the raw data long term today would be unusual. In fact, many sites reduce from raw hits to page views as soon as possible, then aggregate further after 60 to 90 days. This strategy brings the estimated size down further.
But, however you slice it, it is clear that e-business sites will measure their clickstream warehouse volumes in at least terabytes. A particularly large-scale e-business operator or a portal site will have much higher clickstream volumes, as Microsoft exemplified in January when it logged data at about 20 times the rate Ive shown in the preceding example. As one specialist with whom I spoke in January reported, working with a site logging as many as a billion clicks per peak hour suggests a three-year clickstream that could approach a petabyte (PB), or 1000TB.
So, the clickstream volume, even if you assume that you only retain a portion after parsing, filtering, and initial analysis, is formidable and will increase stored data volumes well beyond previous levels in many data warehouses. Not many data warehouses support query, analysis, data mining, or reporting on even 6TB of data today, to use the low-end estimate of size for the sake of discussion. And not many IT executives would express confidence that their current platforms and architectures are up to that job even as they go about creating e-business sites that will generate 100-million clicks or more a day.
I believe that many businesses face a widening gap over the next several years: a disparity between their needs and their capabilities for effective analysis of a Web site activity, as Figure 1 shows. While closing the gap requires a complete infrastructure for e-business intelligence and although this infrastructure requires advances in several areas I believe that the creation of a sufficiently scalable infrastructure is in fact the most critical and problematic aspect of the gap.
FIGURE 1 Widening gap between database capabilities and requirements for Web analytics.
The nature and magnitude of the scalability problem goes beyond even what the immense data volumes suggest. What really cranks up the scalability requirement is the depth of analysis required to extract business value from clickstream data. And because developers must accomplish this analysis in a timely and cost-effective manner, I believe the only practical solution is to push back many major components of the scalability problem into the database.
According to Dr. Reiner, the key to the problem is getting beyond clicks and page views toward something that is meaningful to the business. Todays site operators are typically living with reports about site traffic, most popular pages, and other measures that tell little or nothing of interest to business executives. The business decision makers want reporting against measures of customer profitability, product profitability, customer satisfaction, and return on investment. Those who get periodic reports about page views and site visits usually have no idea what to make of them.
In addition, e-marketers who must make decisions about how to organize and present information on the Web site want answers to questions such as: What pattern of behavior by a Web site visitor is followed by purchase of a high-margin fashion product? And, which changes to the Web site facilitate the high-payoff behaviors?
As it turns out, such innocent questions are devilishly difficult to answer from clickstream data. Here are some of the issues involved that have particular implications for scalability, given the huge data volumes we must process:
Multisource data integration. What we think of as a single large Web site consists of many servers, each producing its own stream of data about its inputs, outputs, and actions. Web site service is usually functionally partitioned into user registration, personalization, ad service, commerce, media, catalog, chat room, email, and other services that may be specialized by application so there is logically a server for each of these purposes. These servers run software from different companies and therefore do not log activity using the same data formats, structures, and concepts. But, in addition, servers of a particular type are often replicated as usage grows and disperses geographically (for example, you may need to locate a complete server complex in each of several countries; you may need to replicate a given server within a country or region to handle increased usage). So putting the separate logs back together each day is well beyond all the Kings horses and all the Kings men and is in itself a major undertaking even with todays technology. The logs and data streams are numerous, large, and sport integration challenges specific to the Web environment. For example, the clocks of all the servers may not always be fully synchronized. And, a single user in a single session will typically interact with many separate servers. It is essential to tie those separate interactions together in the right order to get a clear picture of the behavior of any one visitor. These are challenging integration problems, compounded by the immense size of the logs and the pressure in the world of e-business to do everything quickly.
Users are often anonymous.Until users make the decision to register and begin visiting regularly under registered identities, they are anonymous. At the same time, the behavior patterns that result in converting anonymous users to voluntarily registered users are critically important to the successful Web site. Thus, it is important whenever possible to analyze data associated with anonymous as well as registered users and to make the most of that information. Dealing with the complexity and uncertainty of data about the large number of anonymous visitors to a major Web site adds an additional dimension to the analysis.
Data must be mapped and aggregated from the level at which it is initially logged to a level at which it can be meaningfully interpreted. For example, servers log clicks and keystrokes, but even basic analysis starts with page views. Whereas the former has to do with each action with the mouse or keyboard, and in some cases, data about how the page is assembled, the latter has to do with what fully assembled pages the user visited in the site and what they are about. But, the mapping and aggregation goes beyond page views. For example, if the Web site is offering a promotion on a particular Caribbean vacation, that vacation will actually be associated with a group of pages, which NetGenesis calls a Superpage. So, one key type of analysis is really about Superpages, which represent yet another level of aggregation and mapping. The same is true in every dimension of the analysis. For example, to understand customer behavior, you may need to examine not only all the visitors who look at vacations to Antigua but all visits related to Caribbean vacations or warm weather beach vacations. Thus, you may need to aggregate data about specific products to get data about product groups or product categories. This aggregation along dimensions such as product, store, channel, customer, and time is typical of data warehousing in other commercial contexts. With Web analytics, however, the data volumes are unusually large, therefore defeating approaches that work in other contexts.
Changing hierarchies.One of the strengths of Web-based commerce is the mediums fluidity. It is much easier to change or personalize a Web site than it is to change the layout of 2,000 stores. It is quicker, cheaper, and surer to change the way an e-commerce application presents a discount offer than it is to teach 5,000 agents in call centers a different way of presenting a discount. But, the mediums very fluidity makes analysis more difficult. If it will help sales, increase customer satisfaction, or accomplish any one of 10 other business objectives, e-marketers will quickly and readily reorganize their product groupings and categories. The products appearing on a given page will change, as will the images presented with the product description. New products will be introduced rapidly. Prices can be changed with a frequency unheard of in stores. Competitors will leave and enter markets, changing tactics rapidly. Every aspect of the e-business environment is changing so rapidly and frequently that it is far more difficult to relate cause and effect.
The need to look back. If users could predict what they needed to know, you would be able to map, index, and aggregate data in terms of the specific known needs and have to retain a lot less data. One of the cruel ironies of the situation is that you have a critical need to look back and re-analyze data, often in terms of attributes or factors you didnt know were critical at the time of data capture. One reason for this is that most sites make money only on repeat customers. In a 1999 study, Forrester Research reported that the average e-commerce site spent $250 to acquire a customer and made $25 profit on the customers first transaction. On average, the press has reported that Amazon needs to retain a customer for 2.5 years before it makes money. The problem is that you dont know which customers will be profitable repeat customers until they make enough purchases to establish the pattern. At that point, you want to look back at their history with the site and identify other customers following a similar pattern. You cant be sure what data will be significant in this analysis until after the customer has become a repeat customer. So, e-business uniquely favors retaining comprehensive histories and push the likely data warehouse sizes away from the low end of the range Ive defined in this columns introduction and well into the range of 10TB for many large e-commerce sites.
These are a few examples of what makes scalability a particular challenge in the analysis of large, rapidly growing clickstreams to produce business insights. In general, e-business intelligence makes more intensive demands on the database platform for scalability and performance than I have seen in other data warehouse applications. Some of these requirements are to be able to do the following efficiently on exceptionally large volumes of data:
Load, aggregate, and index data rapidly
Implement complex and rapidly changing relationships to deal with complex and uncertain user identities and numerous, frequently changing dimensions and hierarchies
Deal with a large number of attributes and efficiently focus on them in complex and unpredictable ways efficiently
Support high-volume, deep analytic processing.
To perform these functions well, a database engine needs strong optimization, highly parallel operation, highly scalable index structures, excellent implementation of analytical functions, and efficient access techniques. In particular, as Topher Heigham points out, the combination of handling large volumes of incoming data and deep analysis produces conflicting indexing and tuning requirements.
Of course the database product is hardly the full story here. The product you use to analyze the data both pre- and post-database load must itself be scalable and exploit the database effectively. That means you need both rapid intake and integration of new data, and rapid drill down, aggregation, and analysis of stored data major challenges to the Web analysis products.
I urge e-business executives to take notice of this point: Database scalability now looms large as the next major challenge to e-business intelligence. To succeed over the next two to three years, they will need to realize that they cant sit still and expect vendors to close the gap Ive shown in Figure 1. They need an architecture, infrastructure, and strategy to achieve the database scalability so critical to e-business goals. Thats a tall order, but if they define their requirements, actively develop their options in advance, and measure them early and often, they can get there.