Guide to the TechWeb Network

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home


November 1998, Volume 1 Number 2

In my last column, I described a set of brave new requirements data warehouse implementers and owners face.

Coping with the Brave New Requirements


Ralph Kimball                


In my last column, I described a set of brave new requirements data warehouse implementers and owners face. None of these requirements is unfamiliar or unexpected, but they have been gathering force and coherence for several years, and it was appropriate to step back to take a snapshot of the data warehouse state of the art. Glancing back at the list of all the requirements together reveals how much our perspectives have changed over the last 10 years.

Data Marts and Dimensional Modeling

The data warehousing industry is beginning to respond implicitly to all 11 of the requirements. This response does not imply, however, that the industry is going in 11 independent directions. Almost all the requirements are affected by two broad data warehousing themes: data marts and dimensional modeling. If you look at these two themes together, all the requirements I outlined can be addressed simultaneously.

To see how to do this, let's extract two key principles to guide our data warehouse designs:

1. Separate your architectures. Draw a very clear line between the back room and the front room. The back room is where you extract, transform, and load the data, and the front room is where you make the data available for presentation.

2. Build your presentation-oriented data marts around dimensional models, not ER models.

Figure 1 shows a typical horizontal cut through an overall data warehouse environment. At the extreme left you see the traditional transaction-oriented legacy systems. The data warehouse responsibility starts when you extract data from the legacy systems and enter it into the data staging area.


 

Figure 1. The enterprise data warehouse, showing the data staging area and the data presentation area.


 

The data staging area is the complete back-room operation for the data warehouse in which you clean, prune, combine, sort, look up, add keys, remove duplicates, assemble households, archive, and export. The data arriving in the data staging area is frequently dirty, malformed, and in a flat-file format. If you're lucky, the data arrives in pristine third normal form, but that's rare. When you're done cleaning and restructuring the data, you can leave it in flat file form or store it in third normal form. The data staging area is dominated by flat files, simple sorting, and sequential processing. Third normal form relational representations are wonderful, but they're mostly the final result of a lot of hard work done in nonrelational formats.

The key architectural requirement for the data staging area is that it is off limits to the end users. The data staging area is like the back room of a car repair shop. Customers are not allowed in the back room; it isn't safe and the shop doesn't have insurance to cover an injury occurring to a customer in the back room. The mechanics are busy fixing the cars and do not want to be diverted to serving the customers directly. In the same way, the data staging area of the warehouse must be off limits to all forms of end user inquiry. We must not distract ourselves by having to provide availability of data, indexes, aggregations, time series, synchronous integration across subject areas, and especially user-level security.

The presentation area is the complete front-room operation for the data warehouse. As its name implies, the presentation area is on stage and available at all times for end-user inquiry. All forms of end-user inquiry are serviced by the presentation area including ad hoc querying, drilling down, reporting, and data mining. We don't talk about drilling down into an atomic data store located in the back room if that atomic data store is just another name for the data staging area.

The presentation area is broken into subject areas, which are called data marts. Each data mart is organized entirely around effective presenting, which, in my opinion, means dimensional models. Presenting encompasses all inquiry and analysis activities including ad hoc querying, report generation, high-end analysis tools, and data mining. All the dimensional models in all the data marts look somewhat similar, and this suite of dimensional models must share the key dimensions of the enterprise. We call these the conformed dimensions.

Plugging the Data Marts into the Data Warehouse Bus Architecture

Figure 2 shows how to attach several independent data marts together with conformed dimensions -- a consistently defined set of dimensions that all data marts that wish to refer to these common entities must use. Conformed dimensions typically include such things as calendar (time), customer, product, location, and organization. We also have to consider conformed facts, which involve any measure that exists in more than one data mart. Perhaps revenue, for example, is defined in several data marts. To conform several instances of revenue, we must insist that the technical definitions of each instance be the same so that separate revenues can be compared and added. If two versions of revenue cannot be conformed, they must be labeled differently so users won't compare or add versions.


 

Figure 2. The dta warehouse bus architecture, showing a series of independent data marts connecting to the conformed dimensions of the enterprise.


 

Imagine that the data warehouse bus is like the bus in your computer. The bus in your computer is a standard interface specification that lets you plug in a CD-ROM, a disk drive, or any number of special purpose cards. Because of the bus standard, these peripherals work together smoothly even though they're manufactured at different times by different vendors.

In the same way, the data warehouse bus is a standard that allows separate data marts to be implemented by different groups in your enterprise at different times. By adhering to the standard (conformed dimensions), the separate data marts can be plugged together. They won't smoke, and they can even share data usefully in a drill-across report because the row headers on the report will mean the same thing across each of the data marts.

The data warehouse bus architecture combines the two key concepts (data marts and dimensional models) I mentioned earlier. You separate the overall architecture of the data warehouse cleanly so the staging area and the presentation area perform distinctly different functions. The bus architecture, which defines the data presentation area, relies on the predictable similarity of the dimensional models to let the enterprise hook all the data marts together. This approach is a framework for addressing all the brave new requirements in the data warehouse industry.

Given the overall architecture and this new vocabulary, let me comment briefly on each of the brave new requirements.

Decentralized, incremental development. The discipline of defining conformed dimensions and conformed facts before embarking on separate data mart projects is the secret to decentralized and incremental development.

Anticipation of continuous change as business needs and available data sources evolve. The similarity of all dimensional schemas lets us anticipate the effects of unexpected changes. When users start asking questions that are geographically focused rather than focused on product category, nothing much changes in the dimensional data mart. Location and product are simply two dimensions connected symmetrically and equivalently to most of the same fact tables. Queries that constrain and group by location have the same form as queries that constrain and group by product. Query tools and query strategies don't have to be reprogrammed when the user community begins asking new kinds of questions. New data elements can be added to dimensional schemas in such a way that old applications continue to function without modification. You can add new data elements in this graceful way as new dimensional attributes, new additive facts, and entirely new dimensions.

Rapid deployment. Once the conformed dimensions and conformed facts have been established, separate data mart teams can proceed independently of each other. In many cases, it makes sense to build the first tables in each data mart as dimensional images of single underlying sources. The data warehouse bus architecture provides a formula combining these single source fact tables into higher-level combinations involving multiple sources. The charm of building single-source fact tables is that this is the fastest possible path to a partial deployment of data warehouse data.

Seamless drill down to the lowest possible atomic data. Drilling down is nothing more than adding a row header to an existing report. In a conformed dimension architecture, these row headers are known to be available in the dimensions and will have a consistent meaning as we descend from more aggregated fact tables to less aggregated ones. Remember that the most atomic data is the most naturally dimensional data, because most single-valued attributes exist for each fact table record at this level.

The parts (data marts) adding up to the whole (data warehouse). This is a direct consequence of defining the data marts as complete logical subsets of the data warehouse. We have added substance to this definition by showing the structure of each data mart (dimensional) and how to connect them together (the bus architecture).

The parts (data marts) implemented on diverse, incompatible technologies. We can relax and let the data marts be incompatible at the lowest hardware and software levels because we don't insist on making the hardware and software talk to each other directly. By performing separate queries to each data mart (using so-called multipass SQL and its equivalent for OLAP databases), we simply combine the answer sets in a higher-level applications layer. This approach also has the significant benefit that the separate queries avoid a host of complex logical problems associated with trying to join fact tables with different cardinalities.

2437 availability. By thinking clearly about the separation of the back room from the front room, we see that the requirement for 2437 availability refers to the data presentation area. The first step to achieving 2437 availability is to implement the data staging area on a separate machine or on a separate process from the data presentation area. The final output of the data staging area is a set of load files for the data presentation area. However, loading and indexing these files into the final presentation database may be a lengthy process that takes the presentation database offline. To avoid going offline for lengthy periods, you can use a file renaming strategy. Each morning's database load goes into a "temp" file that starts as a complete copy of the normal presentation database. When the temp file has been loaded and indexed, the system goes offline for a few seconds while the current production database table is renamed, and the temp file is named as the production database table. This scenario is important to the data warehouse daily cycle, and it gets more intricate in the presence of partitioned tables and aggregates.

Publishing data warehouse results everywhere, preferably over the Internet. Although individual IS data architects and some industry consultants are not completely convinced that simple queries and dimensional approaches are required, the tools vendors have moved heavily toward these approaches because these approaches work. Additionally, the tools vendors are nearly all providing Web-enabled user interfaces. In some senses, most data warehouse owners will find themselves presenting their data warehouses over intranets or even the Internet, whether they plan for it or not.

Securing the data warehouse results everywhere, especially over the Internet. The downside of using the ubiquitous Internet transport medium is the exposure to security problems. The data warehouse owner is especially vulnerable because of the sensitivity of much of the underlying data, and because, ironically, of the success of the warehouse in publishing the data effectively to all the end users.

Near instantaneous response to all requests. Improving response times for end-user queries includes disciplining ourselves to use simple, predictable database structures, augmenting the use of database indexes with database aggregations, and using multipass SQL instead of monolithic complex SQL. All these approaches use the dimensional model approach heavily, and there is a growing body of experience and technology in these areas based on dimensional assumptions. It is interesting that (as far as I know) there seems to be no serious scholarly argument for meeting the objectives of very fast response times with entity-relation data models and traditional cost-based optimizers. No one has proposed a general framework for aggregate navigation in the entity-relation world, either.

Ease of use, especially for computer nonenthusiasts. This final requirement completes the circle. The original design motivation for fact tables and dimension tables in the dimensional approach was pioneered by General Mills and Dartmouth College in the 1960s and later commercialized by A.C. Nielsen in the 1970s. Their concern, as I understand it, was first and foremost to represent the data in an understandable way for the benefit of end users. As early as 1980, Nielsen figured out that the way to tie separate data sources together was by using "conformed dimensions."

The 11 brave new requirements for the data warehouse yield a mixture of bad news and good news. The bad news is that the requirements are aggressive and demanding, and they don't fit easily into traditional design techniques and management approaches. The good news is that with a data mart and dimensional design approach, we can make serious progress against all the requirements.



Ralph Kimball, Ph.D., co-invented the Star Workstation at Xerox and founder of Red Brick Systems, works as an independent consultant designing large data warehouses. He is the author of The Data Warehouse Toolkit (Wiley, 1996) and the newly published The Data Warehouse Lifecycle Toolkit (Wiley, 1998). You can reach him through his Web page at www.ralphkimball.com.





IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







techweb
Online Communities TechWebInformationWeekLight ReadingIntelligent EnterprisebMightyNetwork ComputingDark ReadingDigital LibraryWall Street & Technology
Byte & SwitchNo JitterInternet EvolutionLight Reading's Cable Digital NewsContentinopleUnStrungBank Systems & TechnologyAdvanced TradingInsurance & Technology
Face-to-Face Events
InteropWeb 2.0 ExpoWeb 2.0 SummitVoiceConBlack HatCSISoftwareEntrprise 2.0 ConferenceGTEC
Mobile Business Expo
InformationWeek 500 ConferenceBuy Side Trading XchangeBuy Side Trading SummitBank Executive SummitInsurance Executive SummitTelcoTVEthernet ExpoOptical Expo
Magazines  
InformationWeekWall Street & TechnologyInsurance & TechnologyBank Systems & TechnologyAdvanced TradingMSDNTechNetSmart EnterpriseThe Architecture JournalDatabase Magazine
 
Research & Analyst Services  
Heavy ReadingInformationWeek ReportsInformationWeek Analytics