CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise



January 20, 2000 Volume 3 - Number 2


These Key Dimensions Make Customer “Gesture” Behavior Comprehensible

The Special Dimensions Of the Clickstream


Ralph Kimball                

The most exciting new data source in the data webhouse is the clickstream: the river of clicks on our Web sites. The clickstream contains a record for every page request from every visitor to our site. In many ways, we can imagine that the clickstream is a record of every gesture each visitor makes, and we are beginning to realize that these gestures add up to descriptions of behavior we have never been able to see before.

We expect that the clickstream will identify successful and unsuccessful sessions on our sites, determine happy visitors and good prospects, and see what parts of our Web sites are effective at attracting and retaining visitors.

We can wrangle and bring the clickstream data source into a data mart for analysis just like every other data source in our environment. Of course, any time we bring up a data mart, we are very careful to hook that data mart to the conformed dimensions and facts of the overall enterprise. If we do that with the clickstream, then the clickstream will participate gracefully in the overall distributed webhouse.

But the clickstream data in its raw form only gives us some of the dimensions we need for our most powerful analyses. If we aren’t careful, we will be disappointed in our ability to do the kinds of analyses we mentioned in the first paragraph. A raw entry in the page event log from the Web server only gives us:

• Date/time of the page request

• IP address and possibly cookie ID of the visitor (if they accept cookies)

• Page object being requested (the whole page or an object on the page)

• Type of request (almost always “Get” or “Submit”)

• Context from where the page request was made (the so-called referrer)

• Browser version making the request (usually Netscape or Internet Explorer).

This data doesn’t tell us very much. We are a long way from inferring behavior just by staring at this bare-bones description of the individual page event. We would like to clean this low-level data up and present this low-level data with perhaps the following dimensions:

• Date of the page request

• Time of the page request

• Visitor

• Page object

• Request

• Session type

• Session ID (a degenerate dimension tying all the records of a given session together)

• Referrer

• Product/service.

Let’s look closely at a few of these dimensions, that have unique requirements related to the clickstream. The background motivation for the design described here is beyond the article’s scope and can be found in other articles in this series or in my recent book, The Data Webhouse Toolkit (Wiley, 2000).

Date/Time Dimension(s)

The date and time of the page request both need to be expressed relative to a single standard time zone such as GMT that does not vary with daylight savings time. The calendar date needs to be its own dimension, and thus will have only a few thousand records at most. The time-of-day dimension has 86,400 records, one for each second of a given day. The time dimension allows day spans and labels to constrain on arbitrary time. You may optionally repeat these two dimensions as separate “roles” to record the Web site and/or visitor’s wall clock time. Note that you must take special care if you are merging page events from the Web logs of separate physical servers because you then must align the clocks on these servers to within a second or less!

Visitor Dimension

The visitor dimension is challenging because you probably have three kinds of visitors: first, a huge pool of completely anonymous visitors identified only by their IP addresses. The IP address is only of moderate value because it only identifies an outbound port on the visitor’s Internet service provider. These ports may be dynamically reassigned, so we cannot track such visitors from session to session, or sometimes even from within a session. A second and more useful type of visitor is one who has agreed to store a cookie we have provided. This cookie then becomes a reliable identifier for a visitor machine, because we ask to see the cookie on every page request. (We can only look at our own cookie). With a cookie, we can be pretty sure that a given machine is responsible for a session, and we can determine when the machine will visit us again, assuming the user hasn’t deleted the cookie file. Finally, the third and most valuable level of visitor is the human-identified visitor who not only has accepted our cookie but sometime in the past has revealed their name and other information to us. Realistically, we may not be certain that the same human being is sitting at the remote PC, but at least we know that person’s “representative” is there.

This visitor dimension is, of course, huge. We may want to collect visitors of the first IP-only type into pools the visitor’s domain and subdomain defines just to cut down on the mindless proliferation of these visitor records. We then encourage such visitors to accept cookies so we can sort out individual behavior. Maybe some of our pages aren’t accessible without a cookie. For the third kind of visitor, we have to merge the cookie ID with our visitor name and demographic data during the ETL process.

Page Object Dimension

The page object dimension is one of two dimensions that the Webhouse team must really work on if the clickstream source is going to be useful. The program must describe the page by more than its location in the Web server’s file system. In some cases, the path name to the file is moderately descriptive of the page’s content and purpose, but it is a classic mistake to try to use a file system both for uniquely locating files and describing their content. Instead, any given page must be associated with a set of textual attributes that describe and classify the page, regardless of where it is stored in the Web server’s file system or how it is generated. The attributes should be drawn from structured lists whose rules the data warehouse team creates, so that the attributes can most usefully drive analyses. For instance, the attributes of a given page could be Type="Introductory Product Information" and Product= "Datawhack 9000." Some group needs to take responsibility for assigning these attributes. If the Web page designers understand the importance of clickstream analysis, then they can assign attributes to all the pages. If the Web site team won’t pay attention to the needs of the Webhouse analysis, then the Webhouse team must assign the attributes. This may be a challenge, of course, in really huge Web sites with tens of thousands of pages.

Ideally, the raw clickstream log hands back these page attributes, but the webhouse team may have to merge these attributes into the clickstream later in the ETL cycle.

The object part of the page object description will become much more interesting as extensible markup language (XML)-enabled pages become more widely used. Again, we hope that the Web server logs reveal the page objects’ XML tags.

Session Type

The Session type is the other important clickstream dimension that the Webhouse team must really work on. The session type is a high-level diagnosis of the complete session. Plausible types “Product Ordering,” “Quick Hit and Gone,” and even more interesting diagnoses such as “Unhappy Visitor,” or “Recent, Frequent, Intense Return Shopper.” Perhaps we have both local and global session descriptors for parts of complex sessions.

How on earth do we assign these session labels? In this case, we can’t expect the Web server to provide this context. The webhouse team must figure out the diagnosis in the ETL process. But maybe it’s not as hard as it seems. Here’s a little snippet from an actual Web visitor’s session on my own site, www.ralphkimball. com. I plead guilty to using file names to describe content, but cut me a little slack. I’m the entire IT department at my company, and I have an applications backlog. I have modified the IP address for confidentiality:

suborg.company.com session of 10/19/99:
09:27:29 /index.html
referrer = AltaVista, search =
"Data Warehouse Classes"
09:27:30 /rka.gif
09:27:48 /class.htm
09:27:55 /dwd-class.htm
09:28:37 /register.htm
09:28:50 /dwd-schedule.htm
11:15:12 /index.html
11:15:20 /rka.gif
11:20:55 /startrak.htm

Someone at “Company.com” (a fictitious name) found my site from a search for “Data Warehouse Classes” on AltaVista. So my homepage, index.html, was effective in drawing this qualified visitor to my site. In the first second the visitor requested my company logo, rka.gif. 19 seconds after the initial page hit, the visitor had found a link to my class description page. That’s pretty good. It means the homepage was presented quickly and the navigation choices were clear. After spending only seven seconds on the main class description page, the visitor requested a detailed description of my regular class, Data Warehousing in Depth (dwd-class.html). The visitor studied this page for 42 seconds and then went to my How-To-Register page, which at that time contained an 800 telephone number. I like this session. After 13 seconds on the How-To-Register page, the visitor went to the class schedule page.

We can’t be sure how long the visitor spent on the class schedule page because the next entries are more than an hour and 45 minutes later. We can only be sure the visitor did something else in the intervening time. In fact, unless we have a cookie identifier with this session, we cannot be completely sure the same person made the last three page requests, although I would guess that this is the case here. If so, then the return session is very significant. It represents a return visitor, or someone who found my Web site useful and maybe even bookmarked it.

Building a session diagnosis tool for the ETL process is clearly an interesting challenge. It is a blend of data extract, pattern recognition, and link analysis. When you look at your own session logs, presented like the one here, you will come up with many ideas for diagnosing sessions. This requirement will turn into a complex and evolving one. Rather than committing at the start to a single major data mining approach, it would be better, in my opinion, to write a few simple heuristic rules in your ETL data flow for diagnosing sessions and then accumulate experience over time with the different kinds of sessions. Then you’ll be in a better position to choose a sophisticated tool to help you diagnose sessions.

Provide Page Object And Session Dimensions

The point of this column is to make sure that you put effort into providing page object and session dimensions for your clickstream. Yes, both of these dimensions are a lot of work, but if you leave them out, you can’t tell what pages the user visited, or whether they had a productive session. These dimensions are the keys to analyzing Web behavior, and we will return to these design issues later in response to new developments in the industry. Hang on to your seat. We’re on Internet time now…

RESOURCES

AAAI: www.aaai.org

The Analytical Solutions Forum: www.tasf.org

Artificial Intelligence and Mobile Robots:
Case studies of successful robot systems
by the American Association for Artificial Intelligence, David Korten-kamp, R. Peter Bonasso, and Robin Murphy, eds. (AAAI Press/MIT press, 1998)
Cambrian Intelligence by Rodney A. Brooks (MIT press, 1999), pp. 4-5.



Ralph Kimball, Ph.D., co-invented the Star Workstation at Xerox and founder of Red Brick Systems, works as an independent consultant designing large data warehouses. He is the author of The Data Warehouse Toolkit (Wiley, 1996) and the newly published The Data Warehouse Lifecycle Toolkit (Wiley, 1998). You can reach him through his Web page at www.ralphkimball.com.





IE Weekly Newsletter
Subscribe to the newsletter
    Email Address