Guide to the TechWeb Network

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home


October 1998, Volume 1 - Number 1

Ten years ago, when we began focusing on data warehousing, we were primarily concerned with defining a data warehouse as something different from an operational system.

Brave New Requirements for Data Warehousing


Ralph Kimball                


Ten years ago, when we began focusing on data warehousing, we were primarily concerned with defining a data warehouse as something different from an operational system. This view was clear whenever we talked about the requirements for a data warehouse. Ten years ago, we needed to remind everyone that a data warehouse was a centralized, static copy of operational data. We viewed the warehouse as a cross between a historical archive and an archaeological dig. Thou shalt proceed cautiously to assemble a complete data warehouse before releasing anything to the public. Thou shalt not write into the data warehouse.

In the intervening 10 years we’ve learned a lot. We’ve built many successful data warehouses, but we’ve also had some disappointments and failures. Technology has improved enormously. We have spectacularly more powerful computers. We have both OLAP and relational database engines that are devoted to getting the data out rather than putting the data in. We have developed powerful data warehouse modeling skills, especially in the area of dimensional modeling. We now have a “bus” architecture for our data marts that lets us connect the data marts together similar to the way that we connect the components of our personal computers to the “bus” in the computer. As IS consumers, we’ve survived a whole generation of back-end and front-end tools, and we now have more informed opinions about tools and their vendors. We have passed out of the “best of breed” experimentation phase and have come to realize that each of us needs to focus on a small number of end-to-end vendors to keep our data warehouse environments under control.

As businesspeople, we are no longer content to just see the “top line” view of our enterprises, where we barely drill down from the annual report. Instead, we demand to see detailed customer behavior down to the individual ticket line item and the individual button click at the ATM. Fortunately, in most cases, our data warehouse systems are now big enough to store all the ticket line items and button clicks. It’s an interesting chicken-and-egg debate as to whether this demand for atomic detail begat monster data warehouses or the other way around.

At the same time that we’ve been demanding more detailed data, we have also insisted on a broader, more meaningful view. We no longer manage business on volume alone; profitability is now key to our business management. For the data warehouse provider, profitability is much more difficult than volume because profitability almost always requires a fully integrated view of the business, where the costs incurred in all the phases of the business are correctly allocated back to products, customers, geographies, and time periods.

For all these reasons, data warehouse managers and implementers have had a collective allergic reaction to the difficulty of building a monolithic, centralized data warehouse. It just seems too hard and too much work. In many ways, the sheer responsibility and intellectual challenge of taking on the whole enterprise data warehouse has been too much. This allergic reaction has a name: data mart. Somehow we have to cut the data warehouse implementation task down to human proportions.

Looking at all these developments, it becomes clear that the data warehouse game has changed completely from where we started 10 years ago. We really can’t keep modifying the old set of requirements anymore. We need to stop, wipe the slate clean, and articulate a new set of requirements. That’s where a new set of requirements comes in for the modern data warehouse. It’s a pretty daunting set of requirements, but before proposing a solution to all of them, let’s try to understand their architectural effects in a little more detail.

Decentralized, incremental development. We’re forced to accede to the reality that departments and divisions are going to create their own mini data warehouses to answer urgent business questions. If we admit that we can’t stop these developments, we need to provide a discipline and a framework for them so the rest of the enterprise can leverage their work. Technically, this means that whatever this common framework is, it must allow an individual data mart team to proceed with its implementation without knowing in detail what the other data mart teams are doing. An individual team must be able to select technology independent of other teams.

Anticipation of continuous change as business needs and available data sources evolve. Our design approach must recognize change as a constant. We want a design approach that has no built-in preferences for the business questions we happen to ask this month versus the business questions we will ask next month. We certainly don’t want to adjust our data schemas if we think of new questions to ask. We do not want to adjust our schemas if we add new descriptors to our basic entities such as customer, product, or location. We want to be able to add new numerical measurements to our data environment and add new dimensions—all without having to modify our database schemas. The requirement that we hold our database schemas constant is extremely important. If we can hold to this goal, then existing applications will continue to work, even after we add the new data to the environment.

Rapid deployment. The requirement to build the parts of the data warehouse rapidly probably mandates the first requirement for decentralized and incremental development. It is hard to imagine a rapid deployment of a centralized, monolithic data warehouse where the whole enterprise data warehouse has to be in place before the data warehouse can be used. Beyond this, rapid deployment also means that the techniques for building the parts of the data warehouse are well understood, predictable, and simple. It would help if all the parts of the data warehouse looked the same and had the same structure. Then we would know how to load these parts, index these parts for performance, select tools to access these parts, and query these parts.

Seamless drill down to the lowest possible atomic data. We know we need atomic data in most of our data marts. We know that our users want to see customer behavior, which is often at the individual user transaction level. We also know that our users want to make precise cuts through the data even if they ask for aggregate behavior. How many people use the ATM between 5 and 6 p.m. at ATM locations near their work but not near their homes? As we descend from aggregated to more atomic data, we want our access methods and query tools to function seamlessly. In a proper dimensional framework, drilling down is nothing more than adding a row header to a request. Everything else stays constant. Above all, drilling down must not mean that you leave behind one user interface and change your mindset and training in order to get more detail.

The parts (data marts) adding up to the whole (data warehouse). The requirement that the data warehouse is composed of nothing more and nothing less than the sum of the data marts is largely a consequence of the previous requirements. The separate subject areas are going to be implemented in a distributed fashion. Each data mart is going to contain its underlying atomic data. We don’t want to duplicate the numerical measurement data in multiple places around the enterprise; this data is overwhelmingly the largest part of any data mart. The surrounding text-like descriptors (dimensions) are often a tiny fraction of the overall data storage, so they can afford to be replicated in multiple places around the enterprise. This bimodal view of the world is very important in our brave new architecture.

When the data marts add up to the whole data warehouse, they must function together so we can drill across data marts to assemble integrated views of the enterprise. Drilling across, similar to drilling down, has a very specific technical interpretation. In order to get numeric facts from different data marts to line up across the row of a report, the row header that “controls” the line of the report has to be defined in the context of each data mart. More specifically, the row header has to mean the same thing in each data mart. It doesn’t help to have a western zone in one data mart that means something different from the western zone in another data mart. Or maybe there is no western zone in the second data mart. These issues have to be addressed before the data marts are built.

The parts (data marts) implemented on diverse, incompatible technologies. Because our data warehouse may be built from distributed data mart efforts, it is obvious that the various groups will show up at the finish line with different technologies. The hardware will be different and the database engines will be different. Some may be OLAP and some may be ROLAP. The OLAP and ROLAP systems will differ in small details of access methods. In spite of this, we demand there be an overarching framework that allows robust drilling across the separate data marts by an end-user application. Even more aggressively, we would like any end-user application to be able to perform this drill across whether or not the application was designed with drill across in mind. Architecturally, this means there needs to be some kind of consolidating layer between an application and the actual database engines in each data mart. We’re beginning to sniff out the outlines of our architectural solution.

2437 availability. We can no longer afford to have our data warehouses offline for extended periods of time while we perform back room cleaning, loading, and indexing chores. Somehow, we need more of a hot-switch approach that lets us take our time with these back-room activities, while at the same time continuing to support access to yesterday’s data. Then we need to switch over to today’s data. The downtime should be measured in seconds. The hot switch also needs to be done so that the drilling across operations described in a previous paragraph remain consistent. In other words, if we make a change in the definition of something that might be a row header in a report, we must be careful to replicate this change in a synchronous fashion to all the affected data marts.

Publishing data warehouse results everywhere, preferably over the Internet. Our users are mobile, both on a long- and a short-term basis. They will move from building to building, city to city, and country to country demanding the same user interfaces and the same level of service. The same user may log in from an internal network at headquarters, a remote location in the field, or from home—all on the same day. In the last few years, the Internet has grown to provide a ubiquitous transport medium for our communications and data. It’s likely to be much cheaper to use the Internet in all these situations than it would be to provide dedicated or dial-up telephone lines.

The other factor driving us to Internet solutions is the reality that every query and report-writing tool is developing a browser-user interface. Vendors are being forced to do this because everyone wants Web-based deployment.

Securing the data warehouse results everywhere, especially over the Internet. The obvious downside of the ubiquitous Internet solutions is the enormous concern over security. Data warehouse results must be handled securely, or we warehouse managers lose our jobs and our companies get sued. At the very least, the data warehouse must reliably authenticate the identity of the requesting end user and must handle the interactive sessions in an entirely private, highly secure fashion. This security architecture must be built into the design of the distributed data warehouse.

Near instantaneous response to all requests. There’s no such thing as an “acceptable” response time being measured in hours or even minutes. The only truly acceptable response time for a request is instantaneous. Data warehousing as an industry is in the middle of a steep learning curve on which query response times are dropping very rapidly as we learn how to use indexes, aggregations, and new query technology. This rapid set of changes is reminiscent of the rapid changes we experienced in transaction processing performance in the 1980s, but it also means that data warehouse managers must constantly reevaluate the solutions if the response times are not quite instantaneous.

Ease of use, especially for computer nonenthusiasts. The final requirement in our list is actually the most important requirement. End users simply won’t use something that is difficult to use. Or perhaps a tiny subset of technical enthusiasts will use something complicated. Or perhaps we will end up with a priesthood of application developers who are the only true users. In all these scenarios, we’ve missed the potential of expanding data warehousing to the majority of possible end users. Ease of use is more than motherhood. We need end-user interfaces that are recognizable, memorable, high performance, and are based on templates that can be invoked or modified in a single button click.

The 11 brave new requirements I have described here are both exciting and scary at the same time. They are exciting because if we can achieve them all in a data warehouse implementation, we are likely to have a cost-effective solution that really works and will stand up to the test of time as our environments evolve. They are scary because they are unconventional and because there aren’t obvious models in many of our minds for addressing these requirements, especially all at once in a single project. In my next column, I will describe the data warehouse “bus” concept in more detail and show how this concept provides the foundation for addressing all these requirements.



Ralph Kimball, Ph.D., co-invented the Star Workstation at Xerox and founder of Red Brick Systems, works as an independent consultant designing large data warehouses. He is the author of The Data Warehouse Toolkit (Wiley, 1996) and the newly published The Data Warehouse Lifecycle Toolkit (Wiley, 1998). You can reach him through his Web page at www.ralphkimball.com.





IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo Jitter
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet Evolution
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space