Guide to the TechWeb Network

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home


June 22, 1999, Volume 2 - Number 9

Things to think about on your summer vacation

STIRRING THINGS UP


Ralph Kimball


About once a year I feel the need to write a provocative column. This is one of those columns. I don’t mind if you say, “Kimball is full of horse feathers”; maybe there are a few horse feathers here. But I hope you will say, “Now that’s a different perspective. Maybe there’s a grain of truth in that.” So here are some topics that are on my mind:

The Web challenges the current view of the data warehouse with multiple new requirements. Data warehousing has matured noticeably during the 1990s. We now have a lot of successes, a fair number of failures, and a lot of hard-won, accumulated experience. But the bigger story these days is the Web revolution. The Web revolution represents a fundamental free-fall in communication costs, and every time communication costs drop precipitously, society changes. This phenomenon is happening as we speak.

The data warehouse must play an integral role in the Web revolution as the analysis platform for all the wonderful behavior data coming in from the clickstream, as well as for the many Web sites that rely on the data warehouse to customize and drive the end user’s Web experience in real time. The data warehouse is taking central stage in the Web revolution, and it requires restating and adjusting our data warehouse thinking. This “data Webhouse” must:

• Be designed from the start as a fully distributed system, with many independently developed nodes contributing to the overall whole. In other words, there is no center to the data Webhouse.

• Not be a client/server system, but a Web-enabled system. This means a top-to-bottom redesign. A Web-enabled system delivers its results and exposes its interfaces through remote browsers on the Web. These browser interfaces must use a strong form of security based on two factors, a password and a physical token, to establish connections through virtual private networks to a network directory server. The directory server authenticates and passes the user to an authorization server, which in turn gives the user permission to use data warehouse resources through one or more Web or application servers. The application servers, in turn, connect to various multimedia database engines using private, strong security. The end users never connect to the multimedia database engines directly. The Web-enabled system is thus a six-tier model consisting of the browser and five servers: directory, authorization, Web, application, and database.

• Deal equally well with textual, numeric, graphic, photographic, audio, and video data streams because the Web already supports this mix of media.

• Support atomic-level behavior data to at least the terabyte level in many data marts, especially those containing clickstream data. Many behavioral analyses must, by definition, crawl through the lowest level of data because the analysis constraints preclude summarizing in advance.

• Respond to an end-user request in roughly 10 seconds, regardless of the complexity of the request. I know, I know. This is impossible. But the Web is setting an expectation, and we can’t ignore the tide coming in.

• Include the user interface’s effectiveness as a primary design criterion. The only thing that matters in the data Webhouse is the effective publication of information on the Web. Delays, confusing dialogs, and the lack of the desired choices are all direct failures.

The data Webhouse necessitates a well-distributed architecture. Increasingly, Web-enabled data warehouses will be sets of lightweight, flexible data marts implemented on a widely heterogeneous mix of incompatible technologies. We need to take seriously the scientific issues of hooking these data marts together (it can be done), rather than arguing that these independent data marts shouldn’t exist. Welcome to the Web.

Centralized, monolithic designs will become more difficult to pursue in these widely distributed data warehouse environments in the same way that centralized designs are difficult in widely distributed computing environments and networks. The problem is that it is too expensive and time-consuming to plan a fully centralized database, and these idealistically motivated designs are difficult to keep in synch with dynamically changing real-world environments. Because our data Webhouse designs encompass not only our internal operations but also our business partners in the supply chain and even our customers, we simply can’t mandate a fully centralized approach. There is no center.

However, in moving more aggressively to a distributed design approach, we can certainly avoid the old stovepipe argument that distributed systems represent out-of-control separate efforts that can’t be connected. The solution to the stovepipe argument is a flexible framework of common definitions (conformed dimensions and conformed facts) that let us stitch the individual data marts together. Interestingly, we need to centralize the actual definition, implementation, and replication of the conformed dimensions and facts out to the working data marts logically only, not physically. In my writings, I have extensively described this distributed framework, as first implemented in the early 1980s by A.C. Nielsen, the syndicated data supplier. This column is not the forum for my thundering on about conformed dimensions and facts, but there is a large body of practice using these ideas, which can be found in my own books and other writings.

The biggest ERP vendors need to follow data warehousing rather than take it over. Some of the biggest ERP vendors have declared that they are the data warehouse — probably because they sense the importance of making decisions using enterprise information —and they want to control that part of the market. However, their data warehousing instincts so far have been counterproductive. The scope of the data warehouse will always be larger than any ERP system. The data warehouse — and now the data Webhouse — is a publishing platform for information arriving from many sources and many directions. The data Webhouse has to be a comfortable home for data from all these different places. The data Webhouse must be distributed, flexible, fast, and end user-oriented. In my opinion, the biggest ERP vendors have not taken these requirements seriously.

If you are a data warehouse project manager and you have been told to use your ERP system as the centralized data warehouse because of the significant investment you have made in that ERP system, you have my sympathies. I think you should extract data from the ERP system, just like any other data source, and present that data as one or more data marts that participate effectively with other conformed data marts representing non-ERP sources.

The user interfaces of most data warehousing tools have been designed by vendors’ software developers who haven’t implemented enough data warehouses. My favorite professional activity is sitting next to end users and watching them use a computer. For me, this goes way back to my roots. My graduate thesis in the late ’60s was a large LISP program that tutored calculus students on a time-shared PDP-10 with a graphics terminal. This program learned superior problem-solving strategies by watching the students. At Xerox PARC, I spent 10 years helping design the user interface for the Star Workstation. Along the way, we set up a laboratory for watching end users struggle with our computers. Since my Xerox days, I have frequently encountered end users who are overwhelmed and nonplussed by unnecessarily complex user interfaces. But all too often there has been no effective, tight feedback loop to get the vendors to change their products. Field-support people working for the software vendor collect user suggestions that are transmitted occasionally back through sales channels to product marketing. Product marketing negotiates once or twice a year with development to include features in the next software release. The feature suggestions and usability enhancements have to compete with new tool development, which headquarters executives usually drive. Individual feature suggestions are often rejected because they are too small on their own to seem worthwhile. Or, worse, they are interpreted as requests by “unsophisticated” users who should have read the manual.

In my opinion, this is all going to change. The Web is an unforgiving crucible that measures user effectiveness directly. The clickstream supplies the evidence in a way that we can’t avoid. We see every gesture the user makes, and some of the gestures aren’t pretty. Users arrived at the page and left in 10 seconds. If they didn’t click on the page, they didn’t see what they needed. If they left the page before it finished painting, it was too slow. If they don’t come back to the page, they can’t use it. The Web is finally making usability important.

Entity-relation (E-R) models are neither unique nor complete. In my opinion, the data warehouse community still has not sorted out the right places to use various forms of modeling. In my next column, I am going to comment on a range of modeling and architecture arguments circulating in the data warehouse marketplace. But, in the spirit of stirring things up this month, I want to focus on one specific issue that fascinates me.

E-R models that drive redundancy out of data sets are a wonderful benefit for transaction processing. E-R models that serve as targets for cleaned data are useful because they are a goal. But this goal becomes attainable only after the cleaning steps are finished. In my recent thinking about E-R models, I have come to suspect that they are neither unique nor complete. Given a set of data entities — describing, for instance, all the employees and organizations in a complex enterprise — there may be no unique E-R model that describes all the relationships among the data. There are many simultaneously overlapping and alternative, many-to-many and many-to-one relationships, and these can be represented in more than one way — or at least I suspect that this is true. If so, then a given E-R model is only one chosen interpretation of the data. It is not a unique description.

More seriously, a given E-R model makes no claim to be complete; it makes no claim to wring out all the relationships among the data and show them on the diagram. In some sense, an E-R model is only the set of things the modeler happens to think about, discover, or is willing to document.

Finally, an E-R model almost always is an ideal model, not a real model. Has anyone ever done a full E-R model on dirty data? In other words, has anyone modeled the dirt as an objective of the design? I think a full model of dirty data done this way would break the diagram into tiny useless pieces. But one person’s dirt may be another person’s priceless truth.

There is too little critical thinking in the data warehousing field. Authors and speakers aren’t judged as critically as they are in other disciplines, and they aren’t held accountable for what they say. Don’t let it be said that I shrink from tough topics. My wife, Julie, was a professional speech and language specialist in a former career, and she now studies very carefully what people (including me) say. In a recent email to a colleague, she made the following remarks:

“Do people really think critically about what they read and hear, or do they blindly accept it? Ultimately I believe that critical thinking by any customer base improves the quality of any marketplace. There is a kind of collective leadership and power associated with [this customer base when] they exercise their critical thinking skills. It would be nice to encourage the IS community to be more demanding of those they listen to.”

Personally, I believe all the assertions in this column. If you want to make a brief, to-the-point remark in response to any of these topics, write a letter to this magazine and my editors may publish it. If you want to kick your ideas around at greater length in a moderated forum, I recommend the data warehousing list at www.datawarehousing.com (select “List Server”). There are more than 1,500 data warehousers on the list, and the thoughtful contributions and the lack of flaming and advertising have impressed me.



Ralph Kimball, Ph.D., co-inventor of the Xerox Star workstation and founder of Red Brick Systems, works as an independent consultant designing large data warehouses. He is the author of The Data Warehouse Toolkit (Wiley, 1996) and the newly published The Data Warehouse Lifecycle Toolkit (Wiley, 1998). You can reach him through his Web page at rkimball.com.





IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo Jitter
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet Evolution
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space