Throughout my career as a data warehouse designer and software developer, many of my most satisfying moments have involved programming intricate data warehouse solutions. In spite of my efforts over the years to act like a manager or as an educator, programming keeps rearing its head and consuming my attention. In the 1960s, before data warehousing, I programmed FORTRAN analysis programs at the San Diego Gas and Electric Co.; in the 70s, I worked on BCPL (a variant of C) at Xerox PARC when I worked on the Star Workstation; in the 80s I worked on Metaphors graphical pipes, a facility known as capsule programming, which I invented at that company. And in the 90s Ive programmed a witchs brew of Visual Basic and SQL in various forms. My Star Tracker Visual Basic tool, distributed along with my original book and through DBMS magazine as a demonstration query tool for data warehouses, has been delivered to more than 160,000 locations. I have probably hand coded 100,000 lines of SQL in various other data warehouse environments. And yet I dont think of myself as a true software developer, and certainly not as a good software developer. Nearly all my programming has been on snippets, modules, or components, ranging from single SQL statements embedded in query tools, to SQL code generators embedded in complex graphical user interface development environments. I have never had the luxury of working as a C coder in a software development shop with a single language development environment. Such a dream environment would be replete with all the bells and whistles including a powerful compiler, automated software librarian, regression testing suite, and release management system. Somehow the data warehouse world just isnt that clean . My instinct tells me that we all have mostly similar experiences. We live with a funny paradox. On one hand, our programming projects are fractured little efforts that never seem as serious as the work our C-programming software vendor colleagues do. But on the other hand, we are certainly building some of the most complicated distributed computer systems on earth. Just to remind ourselves how challenging some of our designs can be, lets list the characteristics of a typical large data warehouse system from a software developers perspective. A data warehouse (or webhouse, if you prefer) is: Profoundly distributed, physically and logically. The warehouse extends from the extract systems, through the transformation and loading systems, into the presentation-oriented data warehouse DBMSs, and finally out to the users desktops through the front-end tools. Even a relatively simple data warehouse environment will have half a dozen separate scripting languages, programming languages, configurable applets, and parameterized user interfaces. Constantly changing. As software developers, we rarely have the luxury of working to a stable design specification from the beginning to the end of a project. In a way, we shouldnt complain about this too visibly. In most cases, if we are besieged by changing business requirements and data sources, at least we know we are close to the end users heartbeat. Required to deliver instantaneous results. (Or at least present the illusion of instantaneous results.) The Web in particular has forced this expectation of performance on us. Our response as software developers is to constantly open the code and change our systems to resolve performance bottlenecks. A faithful mirror of business rules. Along with our transaction processing systems, the data warehouse is expected to be the authoritative, quality-assured repository of all the current and past business rules that allow us to publish data that our end users can interpret meaningfully. In some ways, finding a way to store, read, and write all these business rules is our highest calling as data warehouse professionals. Viewed against these requirements, maybe we shouldnt be apologetic about our software development responsibilities. If anything, we have a harder job than our C-coding product vendor colleagues!
Software Development Tools Meet the WarehouseIn my previous two columns (There Are No Guarantees and Enforcing the Rules), I have talked about the various flavors of data warehouse business rules and about the emergence of the Unified Modeling Language (UML) as a foundation for expressing these rules. Given the software developers perspective I have sketched in this column, it seems clear that we need all the assistance we can get from methodologies and tools our C-coding product vendor colleagues use, assuming that we can apply them to our more complicated task. But in spite of this logic, I am cautious about recommending methodologies and CASE tools from traditional environments for data warehouse software development. Methodologies scare me the most. So often I see a methodology becoming the goal of a project, rather than just being a tool. Recently someone asked me: What is the difference between a methodologist and a terrorist? I answered, At least you can negotiate with a terrorist. Having qualified my recommendations, there are at least six areas in which data warehouse software developers can and should make more use of existing software development tools. These areas are not a single fabric that we can weave together, but rather distinct islands that we need to manage quite independently. Even within each of the areas, local technology choices probably mean even more separate tools. The six areas include: Business rules capture. Powerful CASE tools capable of expressing the UML are quite well-suited to capturing the full range of business rules, including the complex types of business rules I discussed in the previous two columns. However, I believe the most realistic goals for data warehouse business rule capture are communication and documentation. The process of elucidating and understanding business rules can be immensely helpful in the foundation-creating stages of a project, if we combine them with a pragmatic sense of when to stop modeling. We can convert simple business rules into database schemas or executable programs, but we should not expect significant automatic conversion of the most complex business rules into code. These complex business rules may always remain in a documentation-only form, and probably can only be enforced by thoughtful review by a human designer. My favorite CASE tool for expressing the UML and documenting high-level business rules is Microsofts VisioModeler, included with Visio Enterprise 5. VisioModeler uses an object-role modeling approach to go far beyond conventional entity/relationship (E/R) CASE tools, so that relationships between entities as well as their attributes and methods can be captured. To learn more about this product and the others mentioned in this article, go to Google (www.google.com) and type in the name of the product. You will find the vendors product description pages as well as independent critical reviews. Database design. The data warehouse community has used CASE tools fairly successfully for designing relational database schemas. Although the E/R diagrams created in these tools definitely drive the creation and maintenance of DBMS tables, this capability does not really address the business rules of the previous bullet. CASE tools help us sort out column declarations and table key relationships, but not much else. I think of the database design task as mostly separate from the business rule capture task of the previous bullet. My favorite CASE tools for database design include the venerable ERwin from Computer Associates (formerly from Platinum Technology and before that, LogicWorks), and WarehouseArchitect from Sybase. ERwin and WarehouseArchitect both have extensive interfaces not only for building relational databases but also for exporting metadata to OLAP and to the better-known front-end tools from companies such as MicroStrategy Inc., Cognos, Brio Technology Inc., and Business Objects. Traditional software system design. In all the cases in which your IT organization takes on primary application software development in C, Visual Basic, or Java, you are a mainline candidate for the many powerful modeling tools that can drive your whole project from initial planning to final checkout. You can find a very good list of these tools on Google by entering UML Tools in the search specification and then linking to the Category URL at the top of the page. The tool referenced most often in Googles list is Together (see Resources), which provides a full UML diagramming suite, allows corporate and project-specific source code standards to be imposed on a project, adapts to third-party libraries, and has extensive reverse-engineering capabilities for translating existing C, Visual Basic, and Java code into and out of UML specifications. Extract-transform-load (ETL) system design. ETL systems are perfect candidates for the component and object approach because most ETL systems are just repackagings of predictable modules that perform parsing, sorting, joining, cleaning, and reformatting. However, the leading ETL vendors, including Informatica, Sagent, Ardent (now owned by Informix), and ETI have long since abandoned the straight-line code approach in favor of proprietary graphical objects. In all these products, the ETL software components are like little chess pieces that you drag and drop into the ETL applications. The attributes and methods of these objects are controlled through property windows that you open after selecting each graphical object. This approach is identical to the Metaphor Capsule facility I designed at Metaphor in 1983. We used Capsules for all end-user applications development in the more than 400 data warehouses we installed between 1984 and 1988. But I digress. The bottom line on the ETL tools is that the object-oriented approach the ETL vendors take is very successful, but each is proprietary, and generally not well-connected to the mainline UML software development tools. Metadata repository design. Almost every tool Ive mentioned in this column relies on a metadata repository to read and write the business rules and design specifications. Microsoft, IBM, Oracle, and others have all either defined a global metadata repository format or promised the marketplace that they will do so in the future. The world doesnt need multiple repository definitions. We need a single standard. Most analysts believe that Microsofts repository effort is the one most likely to succeed. In 1997, Microsoft turned over the content of its proprietary repository design to the Metadata Coalition, and worked with a group of vendors to define the Open Information Model (OIM), intended to support software development and data warehousing. Microsoft and many other vendors are actively programming their tools to read and write from the Microsoft Repository. Warehouse component communications design. The extremely distributed and heterogeneous nature of our warehouses makes it much more difficult to reliably communicate data and metadata between remote machines in a structured way. Nearly all the players in the applications development, UML, repository, and data warehouse worlds are pinning their hopes on extended markup language (XML) as a universal standard for exchanging data and metadata. While this exchange will probably happen, it will take a while. XML is powerful because it is simple and extensible. XML tags, reminiscent of HTML tags, can be declared and used in any XML document. The documents themselves are simple text. But in order for a group of organizations to use XML as the basis for communication, the organizations must first define a dictionary of XML tags. Intensive tag-definition efforts are taking place in many vertical industries but it may take a year or more before any of these industries are capable of widespread data and metadata sharing. Although Microsoft has again been the most aggressive and active player in building support for XML deeply into its products, XML does have widespread support from the other major data warehouse players, including Oracle and IBM.
Learning MoreThese topics are hard to fit into just one column. There is an immense amount of useful reading on all of these subjects available on the Web. Spend some time with Google searching for all the root words Ive mentioned: business rules, UML, and XML. Combine these root words with vendor names to get an idea of who supports which technologies. And remember to think like a software developer, because thats what you are.
Ralph Kimball co-invented the Star Workstation at Xerox and founded Red Brick Systems. He has three best-selling data warehousing books in print, including the newly released The Data Webhouse Toolkit (Wiley, 2000). Ralph teaches dimensional data warehouse design through Kimball University and critically reviews large data warehouse projects. You can reach Ralph through his Web site at www.ralphkimball.com.
| Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|











