http://www.intelligententerprise.com/010130/feat2.jhtml

The Model Customer

With Web site complexity increasing, direct marketing principles can suggest a more robust approach to clickstream analysis

By Thomas F. Richebacher

What do you really know about your Web site customers? If you think you're getting enough data from your server logs, think again. Typical server logs don't give you enough information about your visitors to support important strategic decisions. And without a complete customer view, you'll be left guessing about user behavior.

In today's fast-paced economy, your company must be able to develop and extract value-added data from your Web site and integrate it with offline information to provide managers with data that moves them away from scorekeeping and toward strategic analysis. Unfortunately, while the need for better information is increasingly vital, Web site complexity is growing exponentially. Web sites are deployed on geographically distributed servers and traffic on popular sites keeps going up. Static "brochure" pages are long gone. Instead, dynamically developed Web pages are based on query strings or profile criteria. While some of this complexity is technology driven, competitive pressures and higher customer expectations are the main market drivers.

To succeed, your company must be able to gauge its ability to meet its own objectives, as well as its users'. Your measurements have to be results-oriented, designed to support decision making, and, because Web sites are dynamic, flexible enough to support continuous improvement efforts.

Web Site Data Collection Today

Current Web site data collection methods and analysis rely primarily on Web server logs that collect data on user actions such as page requests, clicking on pictures, sending email, or filling out forms. Log analyzer tools summarize this data as counts: most requested pages, exit-and-entry pages, popular paths traveled, visitor counts, referrer counts, and so on.

Traditionally, this information was used to ensure smooth site operation, monitor Web traffic, and determine potential bandwidth problems to meet customer demand and response time objectives. Because of its realtime delivery, marketers are now also using this information to evaluate response to marketing campaigns, adjust ad placement, and redirect traffic. However, three problems exist with using Web server log data as the basis for strategic analysis:

Data volumes. Microsoft.com alone receives 60 million page views, 300 million hits, and has 4.1 million users per day. The group of sites that it operates (including msn.com, microsoft.com, expedia.com, and hotmail.com), generates approximately 200GB of data per day or 73TB per year. If Microsoft wished to develop unified user clickstream profiles across all sites, it would first have to combine all server logs, sort, and process them -- a time-consuming, and perhaps impossible task, within reasonable time constraints.

Granted, most firms will not have to support Microsoft's data volume. But even if a site generates only one one-hundredth of Microsoft's traffic, 2GB per day still need to be processed and stored. That is no small task -- especially when you should keep data for at least four years for analytical purposes. This volume is expected to increase. A 1999 Forrester study, Online Retail Data Strategies (May 1999), states that 20 out of 54 online retailers expect data volumes to increase tenfold within two years. When wireless application use achieves critical mass in 2004, the click volumes will really increase.

Incomplete data. During the Internet "Stone Age," approximately 1993, Web sites were usually based on a Unix file system and static HTML, which made it easy to track user behavior. You simply followed the page request entries from the Web server log.

Things have changed: Now every site with a search engine accepts user input and matches it against the text in a file system or database to create personalized pages. Here's the problem: In all such cases the Web server log records only the name of the program that processes the input, but the nature of the information returned to the user bypasses the log altogether. In effect, what users input or see is hidden in the communication layer between the browser, server, and back-end application -- making it impossible to intelligibly interpret user actions.

Lack of integration with non-Web data. Web site data does not exist in a vacuum. You need to develop information within its full context to pinpoint strengths and weaknesses.

When your company uses multichannel distribution or communication methods, you must examine your Web site's effectiveness in relationship to all direct (mail, phone, email, and so on) and indirect (distributors, retailers, VARs) channels. You can't assume that a Web site is profitable because revenues exceed costs; rather, revenues migrate across channels according to customers' preferences and business policies that encourage the use of more efficient channels. With an increase of customer contact points, the focus of customer analysis moves from detecting lifts to understanding shifts.

What Can You Do?

Obviously, you need to take into consideration the increased complexities resulting from more elaborate Web site designs, increased traffic, and channel proliferation during the last five years when you collect and analyze data from Web sites. And you can no longer rely on log file data alone.

There are situations in which most log-file data needs to be stored because of an outside agency's requirements. For instance, ABC audit requirements for magazine publishers specify that log variables (HTML tags of pages requested, protocol, browser software, and so on), should be retained for four years for users that complete critical transactions, such as a purchases or address changes.

But in general, log files are becoming too large to process, too incomplete, and in their raw form, impossible to combine with offline data. Therefore, we need an alternative method to develop data.

Take a Conceptual Perspective

Traditional direct marketing principles can assist you here. They can help you understand the significance of data definition. Data definition is the first step toward analysis that allows tracking behavior from individuals, modeling response and profitability, developing customer segments, and estimating lifetime value.

The main problem with applying direct marketing principles to Web sites is that they evolved within a very controlled environment. The direct marketing firm decides on the offer, the copy, and the list, and therefore the data model that describes how variables are defined, collected, and relate to each other is unambiguous. Unfortunately, this is not true on the Internet.

Web site users select themselves through search engines, links, banner ads, or other media. They enter and leave sites from anywhere and decide when and what they want to see. The users are in control, not the Web site publisher, and this means that literally infinite variations of data elements can exist. While variation has its analytic advantages, too much variation becomes meaningless. Also, storing every possible variable creates the nemesis of too much data and too much processing time.

Does this mean that direct marketing principles are useless on the Internet? No, it means that these principles need to be proactively extended and applied during the site's conceptualization and design stages.

The three most basic steps to accomplish this goal are:

Defining objectives and critical activities. Knowing your objectives is central to understanding what data you should collect and how to define it. But decisions about which activities to measure are even more important because objectives overlap more easily than activities. Objectives overlap because every firm needs to accomplish similar things and know:

  • Who its customers are
  • How to track online surfing and shopping habits
  • How to manage content
  • How to change advertising and promotion strategies; and
  • How to develop personalization.

But while every firm needs this knowledge, each will acquire it in a different way based on its own philosophy and values. Ultimately you'll create differentiation only after you decide which activities are unique and critical toward achieving your firm's objectives and how their impact will be measured. (See Sidebar, "Who Is Your Customer?" for some basic examples.)
WHO IS YOUR CUSTOMER?
DEFINING USER CRITERIA

  • If Jeff bought an item during the last year, we would call him a customer.
    But what if he bought something eight years ago?

  • Rick made a purchase three months ago, returned the item, and received a refund. Does that make him a customer?

  • What about Tiffany, who purchased something on the same day as Rick, never returned the merchandise, and asked for a refund?

  • And let's not forget about Doug, who made his first installment payment
    on a treadmill and was never heard from again.

  • Then there are Tom and Al, both new customers, who just received their first mail-order shipment, are late with their payments, and have already ordered another item from the company's Web site.
  • Although each of the definitions shown in the sidebar has its merits, it is also clear that the outcome of any future analysis that is based on them will differ. The simplest solution would be to decide on one definition. But what if you change your mind six months later? The crux of the matter is that knowing individual customer identities is insufficient; knowledge about their behavior is what counts -- we want to define their activities.

    In the sidebar examples, each customer carried out or omitted an activity whose nature can be captured. You can store this information in an order-status field and design different customer handling plans based on order status. In a multichannel environment, you would perform the additional step of combining the order status information from different channels and assign an overall customer status score that would drive customer contact strategies.

    Summarizing data. After you have identified activities and defined corresponding data elements, the next step is deciding on the level of summarization. This is the key to data reduction because here you decide on data groupings. These groupings are based on objectives or activities. For example:

  • User motivation. If a Web site is primarily goal-directed, such as a search site, summarization at the visit level will be important. But if the goal of the site is to measure enduring involvement, cumulative measurement is more appropriate.

  • User segment. A cataloger might summarize data on four levels depending on customer status. (See sidebar, "Summarizing by User Segment," page 35.)

  • Content. Web site publishers that rely on revenue from advertising and links are aggregators. They collect users with a common interest and try to move them toward vendors that want to sell their products to them. To be successful these sites have to summarize visitors by content, length of exposure to specific ads, links clicked, and so on.

    Regardless of the summarization method, you must maintain consistency among measurements. If you track revenues at the individual customer level, you should store costs there as well.

    Converting data. You want to extract the essence of Web site activities, not the clutter. In the context of collecting Web site data, this goal implies gathering only what is critical in an easy-to-analyze format that is compatible with other data sets.

    For instance, instead of storing actual text values based on form selections, you could collect indicator variables that represent significant form content categories. In its simplest structure, the form selection CAR, SUV, USED, on an automotive Web site might be assigned a 1, while CAR, SUV, NEW, a 2. Another way to achieve similar but more detailed results would be to assign each individual form selection its own indicator. The first selection could then be 111 and the second 112. Each digit represents an individual selection.

    Why would you want to do that? Text values are hard to analyze. By converting them, into numerical indicator values, it becomes easier to analyze the data. Ideally, indicators would be in the form of binaries; they are the easiest to integrate into statistical procedures. Binaries are simple on/off switches that indicate whether a certain condition was met: A value of 1 means that the condition holds and 0 that it does not. Indicators are often used to customize content or flag customers that have started a critical action such as loading a shopping cart. The dilemma of defining data at this binary level is a proliferation of variables and consequently greater storage needs.

    You can use similar conventions with navigational data. Instead of collecting entire URL addresses, information could be maintained by page type or content category. This approach reduces processing time, storage demands, and also makes it easier to determine significant pathways.

    Clearly, the three tasks I've outlined here are far beyond what you can accomplish with a browser and Web server alone. You don't just want to format text; you want to detect interaction -- which means you need tools that tap the communication process.

    Web Site Communication

    Basic interaction on the Internet involves four steps (see Figure 1):
    1. Users create text strings either by entering data directly in a text box, or by selecting options from HTML forms.
    2. The text string is transmitted via the GET or POST methods of the HTTP protocol to the Web server.
    3. The Web server accepts the incoming text and processes it. In its simplest form, a page request, the server finds the requested file and returns it to the browser.
    4. The browser displays the results.

    However, communication is not always that simple. For example, Web sites that accept orders over the Internet have to accomplish a variety of jobs such as approve credit, check inventory, and collect order information. To accomplish these tasks, the server has to communicate with back-end resources; it needs a communication method and language.

    Communication Method and Language

    Web servers communicate with back-end applications through Common Gateway Interface (CGI), the human interaction equivalent to phone, fax, email, or mail. CGI is not a language; it is a communication method. And, just as in human interaction, the communication method is language independent. Phones don't care what language they transmit, and CGIs don't care what programming language is used to transmit instructions.

    Essentially, a CGI transmits a program that parses incoming HTTP GET and POST requests, processes the content, if a back-end system like a databases is accessed, instructions are submitted; then it waits for a response, develops a response, and sends information back to the browser. (See Figure 2.)

    In Web server logs, CGI requests replace the entries for predefined HTML pages. The log entry for a specific page such as GET Content/WhitePaper/widget.html is replaced by an entry for a CGI, GET /cgi-bin/widget.cgi.

    The first log entry, a page request, clearly shows what the user asked for and received. You can go back to the HTML page and see what's in it. The second log entry, the CGI request, reveals neither. All you know is that the user had a dialog with your Web site, but you have no idea what the user requested, the application it was requested from, or what the user actually saw.
    SUMMARIZING BY USER SEGMENT
    FOUR WAYS TO CATALOG BY CUSTOMER STATUS

  • For actual customers -- users who purchased something -- multiple records representing customer activities or visits are kept at the transaction level.

  • Individuals that have bought nothing, but identified themselves by name and can be traced through cookies, are also maintained at the transaction level.

  • Data from individuals that can be traced through cookies, but who have neither purchased anything nor identified themselves, is kept in single records summarizing number of visits, visit duration, and page categories viewed.

  • All other users are treated as one group and summarized at predefined time intervals to understand overall behavioral and navigational patterns.
  • Making Visible the Invisible

    Obviously, you cannot manage Web sites without knowing what users do or see. The only way you can expose what is hidden in CGI is by tapping directly into the interaction stream. Luckily, this is fairly easy: It entails nothing more than including (in the same program that calls on the back-end applications) instruction on what data to grab and where to store it.

    The beauty of this process is that you are collecting data at the point of interaction. You can go back to Steps 1, 2, and 3 of your conceptual design and apply the criteria developed there to help you create the instruction to extract only the data you want. The disadvantage of this method is that by not keeping the raw data, you are prevented from verifying or restructuring the data at a later point, and the additional processing might slow down a Web site. The more data that you collect and manipulate, the greater the potential negative impact on Web site performance. Thus, decisions about Web site architecture and data development must go hand in hand.

    Figure 3 illustrates how a very basic database configuration for sites that accept orders and collect user interaction might look:

    • Database A accepts and returns inventory-related information.
    • Database B identifies customers and based on historical information, returns customized content.
    • Database C acts primarily as a container. It accepts clickstream and content summarization from nonidentifiable customers.

    From a technical and analytic perspective, having separate databases perform different types of work makes sense. Each database can be tuned and optimized for its specific purpose. But you have to be careful when you design such a separation of duties that the data distribution over many databases still makes sense. It is important to maintain a coherent segmentation.

    The Next Step

    Bad information creates bad decisions while no information creates speculation. It's easy to confuse having volumes of data with volumes of information. Unfortunately, establishing a data collection framework and the ability to boast about high quality data does not make you smarter. But what it does do, which was formerly missing, is create confidence in the data's completeness, accuracy, and relevance. This approach removes speculation (no more guessing) from the equation, minimizing the number of poor decisions based on bad data.

    When data integrity is established, the next step is investing in analysis. Usually the first two things companies want to know are how they can convert prospects into customers and how to create customer loyalty. Both objectives require analysis whose outcome enables a firm to develop gradual customer qualifications resulting in specific actions. In a traditional direct marketing environment, these actions are primarily related to contact strategies; in a Web site context, they also include personalized content presentation. This is where you start reaping the benefit of Web site data definition efforts. Developing personalized Web content or personalized messages requires the integration of financial, marketing, operational, and statistical information. It means knowing what customers have seen and done, not just during the current visit but in a historical context, online and offline. Personalization is based on segmentation, pattern recognition, and the detection of behavioral shifts. Without a complete customer view, you are left guessing instead of personalizing -- a shaky approach at best.

    See Clearly

    It is easy to confuse having volumes of data with quality of information. But just because data sources, movement, and storage have increased doesn't mean we are any smarter. Acquiring knowledge requires a framework, which in turn provides the context for all data collection and consequent analytic efforts. Without such a framework, you may have mountains of data, but you won't understand their value.



    Thomas Richebacher (thomas.richebacher@eds.com) is an analyst for the customer relationship management service line at Electronic Data Systems (EDS), a global services company. He specializes in the development of customer intelligence through data analysis.

  • Return to Article