http://www.intelligententerprise.com/010216/feat3_1.jhtml

Look Before You Leap

Prototyping your data warehouse can save your team time and money

By Vijay Saradhi & Martin Simoneau

Missing data, gaps in data, lost time, and mounting costs are just a few of the pitfalls that you may encounter while building a data warehouse or data mart. Despite the challenges (well-documented in popular books such as Ralph Kimball's The Data Warehouse Life Cycle Toolkit), a prototype can be a great tool in the development of any data warehouse if you properly execute it, carefully define objectives, and manage user expectations.

As we'll explain, a data warehousing prototype could focus on anything from a feasibility study to a micro model of the project's data extract-transform-load (ETL) process.

Why Prototype?

Implementing simple prototyping techniques is a sensible way to mitigate the adverse effects of unforeseen problems. The development cycle of a typical data warehouse or data mart includes various stages such as data requirement analysis, modeling, extraction, transformation, loading, report development, and testing. Because of the significant time lapse between the requirement analysis phase and final testing of the data warehouse, extensive gaps in data often go undetected until the beta-testing phase of the data warehouse. Discovering missing data at a later stage of the project often comes with a significant price tag in terms of money and lost time.

Prototypes illustrate the power and capabilities of data warehousing to the user community. They are especially helpful to companies that are attempting to build their first data warehouse. For example, a medium-sized financial institution could have as many as 5,000 or more data elements that are used throughout their enterprise. Unless your company is a startup, it is very likely that your source data is buried in legacy systems, with little or no documentation. Added demand for integration of electronic-business data (e-data) and traditional-business data (t-data) only makes these data warehousing efforts excessively complex, intimidating, and impractical to execute. With these conditions and challenging deadlines, even the most experienced data warehousing professionals are likely to miss addressing some of the critical functional and data issues in their logical and physical designs.

What kind of prototyping can be done in the warehousing and webhousing worlds?

Seven Classic Prototyping Scenarios

Make enterprisewide data architecture decisions. In 2001, the vision of most corporate CIOs appears to be threefold; namely: have an enterprise data warehouse, achieve enterprisewide application integration, and customize customer applications to achieve one-to-one personalization. One way to achieve this dream is to take a top-down approach; build everything centrally and feed the individual applications. The problem with this approach? It is a never-ending project.

The other common choice is to take a bottom-up approach: Build individual applications in different departments and try to integrate them in the end. Unfortunately, integrating these indigenously built applications often requires significant reengineering of one or all of the applications involved. Instead of waiting for each application development to finish and hoping that you will eventually be able to integrate all the applications, it is better to proactively assess the effort involved in the eventual integration. Clearly in this case, prototypes are an extremely helpful tool in assessing the complexities and streamlining the design of your project.

Evaluate the software. As an IT professional, you need to purchase several software tools from an ever-increasing pool of software vendors. Your software needs can span a wide variety of choices such as query tools, distribution tools, process automation and scheduling tools, ETL tools, administration tools, application servers, RDBMS servers, security administration tools, and so on.

Every vendor claims that its particular piece of software works well in a wide variety of hardware and software environments. But the truth of the matter is, each piece of software typically performs well in a particular set of conditions and hasa mediocre performance in other situations or could even pose unexpected problems in some situations. For example, a query tool designed for relational online analytic processing (ROLAP) type databases might not perform so well when the database is designed as a star schema. A multidimensional OLAP (MOLAP) tool might cause some serious network traffic problems when used in corporations connected via limited and shared bandwidth wide-area networks. Prototyping can proactively identify problems and limitations associated with any software.

Evaluate the hardware and its layout architecture. In general, hardware architecture layouts are harder to prototype and far less reliable unless extreme care is taken in simulating realistic testing conditions. However, there are a few situations where prototyping is easy enough to do.

CRITICAL REMINDERS

A data architect's typical problems



Unpredictable Queries

For an effective partitioning strategy, you need know the queries or at least the most frequently used queries. But typically, most of the queries that need to be supported by the warehouse are ad hoc queries, which are unpredictable. How do you solve this chicken and egg problem?

Choice of Partitioning Keys

You may have several candidate keys for partitioning your data. If you choose the wrong one, you might not be able to use parallel query options supplied by your database vendor or your disk access patterns might have hot spots.

Optimal Data Distribution
Suppose you are running a query that accesses data spanning multiple hardware boxes. In those cases, the slowest performing hardware box dictates the query performance. What is the best way to distribute data?

To avoid these problems it is very important to identify the critical queries and optimum hardware architecture well in advance. Prototyping could be your lifesaver.

If you chose Sybase IQ (vertical partitioning) as the RDBMS or designed your database on an SMP machine, you have a little more flexibility in terms of your partitioning design. However, both IQ and SMP are not magic bullets in solving very large data warehouse partitioning problems. Each of them comes with their own set of problems, which are beyond our discussion.

For example, you may be designing the hardware layout for your enterprise data warehouse and your data center is located in one country and corporate users are based in different locations across the world. In this case, where would you place your database and application servers to achieve maximum availability to most of the users with the best possible performance? Perhaps your DBA suggested using replication to distribute data to major nodes and distribute data from the nodes to clients (four-tier architecture). You can test the feasibility of the design by prototyping.

Evaluate the data model and test for data sufficiency. Your data modeler designed the logical model with the ability to support all the user's requirements. After finishing the modeling, she had to revise the model several times for a number of reasons such as unavailability of data, unclean data, and changes in business requirements and project scope. It is dangerous not to check your revised model for data sufficiency (making sure your revised model can still support user needs). Once the number of data elements exceeds a few hundred elements, it is very difficult and far less reliable to visually check the model for data sufficiency. With a prototype handy, you could accomplish this task fairly easily. (More on this approach later.)

Design the ETL process. Here's another typical scenario: the source data is a mainframe MVS file system and DB2, your target database runs on Unix, and you are not using an ETL tool. Your choices are to run your transformation on the mainframe, load the data into a mainframe database such as DB2 and run the transformation process in DB2, FTP your files to your Unix environment and run it there, or load the data to a staging database and run your transformations there. You could also mix and match any of these choices to gain optimum results. A prototype is a very useful tool in determining which alternative ETL designs works best.

Size the database. Some of the most common problems that plague data warehouses are underestimating or overestimating the database size. This often happens because your data model changed after you estimated the size and you didn't reestimate the size. If you are using a tool such as ERwin to do your estimates, you could also err on your estimates if you don't know how the tool estimates the database size.

With the ever-increasing size of enterprise warehouses and webhouses today offset by limitations to any server architecture and RDBMS, accuracy in database sizing becomes more critical. With prototyping you could greatly improve the accuracy of your estimates by including characteristics of real data and measuring (instead of calculating) the data object sizes.

Determine the partitioning strategy. The goal of partitioning is to achieve optimum performance out of your queries by distributing your I/O and taking advantage of all the database and hardware features available to you. You need to keep two issues in mind before deciding on a partitioning strategy: partitioning should serve as many queries as possible, and repartitioning and reorganizing large databases is costly and time-consuming. See the sidebar, "Critical Reminders," for some typical problems that every data architect faces.

Building a Prototype

Here's the scenario: You've finished your logical modeling, and now you want to test it for "data sufficiency"; in other words, you would like to know if you could generate all the reports that you promised to your users. Here the steps involved.

Step 1: Identify the focus of the prototype. As engineers would say, to find out how long a cow falling off a cliff would take to hit the ground, make an assumption ("imagine that the cow is spherical"). By identifying the focus of the problem (time to reach the ground), engineers are able to make assumptions, which reduces the complexity of the problem.

Here your primary focus is data sufficiency. Performance, tools, hardware, and software are not the focus of your prototype.

Step 2: Create database objects using your model and connect to the database using your tool. To test your model, you need a database. Ideally, you should test your model in a development environment that mimics your production data warehouse. However, it is possible that your team is still trying to decide on the target RDBMS or you don't have your development environment ready. In that case, get some space in any database that you could get your hands on. If nothing else is available, you could even use a Microsoft Access database. (If you are using Access, you could use username: admin so you don't need to enter a password.) Then forward-engineer your model and configure your OLAP tool to connect to the database. (But remember that unlike with mainstream databases, if you forward-engineer your model into Access, the referential integrity/ER relationships might not be created. If the relationships are missing, you might have to create them manually.)

In addition, consider the configuration effort required to connect to the database using the OLAP tool. It could be as simple as pointing to the database through an ODBC connection, or as complicated as creating a metadata database and going through a complicated configuration process that could also include installing middleware such as Net8, DB2connect, or DirectConnect (depending on your database).

Step 3: Load a good size sample of real data. Obviously, it is best to work with real data and with a large enough data sample. The required data amount could range from as few as 1,000 records to a month's worth of data, provided you keep in mind the size of your sample set while interpreting your results.

However, depending on the time period in which you are prototyping, it is very possible that you might not have access to large amounts of real data. However, it is well worth your while to put the effort into loading ample representative data into your test database. If you are using Access, it is easy enough to input data using an Excel spreadsheet or create a quick form in a few minutes using Access forms. If you have multiple sources of data, make sure you include all the sources during prototyping.

Step 4: Generate the required reports and run ad hoc queries. Generate all the user-defined reports. Make sure you not only run the SQL queries necessary to generate the reports, but also use your chosen ad hoc query tool to generate the reports. In the case of webhouses, you might even have to formulate questions based on your data (depending on how your data mining tool works).

To determine the probable ad hoc queries, make the prototype available to the users and ask them to run queries against the prototype. Log all the queries that are run against the database. If your prototyping platform is a mainstream database such as Oracle, DB2, Teradata, Informix or Sybase, it is very easy to log all the queries for a particular session or at the database level.

For example, with Oracle, you simply need to run utilities such as sql*trace or utilities that come with "tuning packs." They would easily give you the most frequently run SQL along with their query plans and performance statistics. Also, some of the query/analysis tools or business intelligence tools - as they are referred to now - come with administration tools to monitor your data warehouse. You can use these tools to obtain performance statistics or other information to help you gain some insight.

Step 5: Digesting the results from your prototype. Being able to run all the reports and ad hoc queries using your favorite tool means that your model has sufficient data elements to support all the user required reports and "most probable" ad hoc queries. Even though performance statistics are meaningless at this point, collect them anyway because when they are evaluated in the proper context, they still yield important insights into your warehouse architecture and data flow within your application.

Errors in Interpretation

There are several things that you should not do while interpreting the prototyping results. The following are a few of the important ones:

Don't draw unreasonable conclusions. Don't draw conclusions from aspects of your project that are not part of the focus of your prototype. Consider our earlier example of prototyping using an Access database. If you draw any conclusions about query performance, data quality, or timeline to finish, the ETL programming at best gives you a very rough estimate and at worst gives unreasonable expectations to your users and upper management. If your prototype doesn't include all the sources of data, there is potential for conflict when you try to integrate the disparate data sources later. Sizing estimates based on prototype only minimize the margin of error. They do not eliminate the error(s) completely.

Manage user expectations. It is impressive to see the results from your ad hoc queries appear on your screen in a split-second. But before you show these results to your users, make sure you prep them first or you could be making a tremendous mistake - because whatever you show them becomes the de facto standard. After examining your prototype, if your users are expecting the same query performance from your fact table, which will have 500 million rows, then they are likely to be disappointed despite your best efforts.

Last Word

Prototyping should give you very good insight into the nature of your warehouse. If something goes wrong despite all your efforts, at least nobody can blame you for not doing your homework. We wish you all happy prototyping.



Vijay Saradhi (vsaradhi@aatnet.net) is the chief technology officer for Advanced American Technologies (www.aatnet.net), a consultancy specializing in complex systems design, integration, data warehousing and webhousing.

Martin Simoneau (mcsimoneau@aol. com) is a director of information technology in the financial services industry. He is responsible for the design development and implementation of enterprise decision support systems, specializing in data warehousing solutions.

Return to Article