|
Few queries and reports perform any calculation fancier than summing or counting, with the occasional ratio thrown in for excitement. Query and reporting tool vendors have done an excellent job of making the simple things easy. They do a nice job of providing functionality that is missing from SQL, such as subtotals and market contributions. Some can perform more complex calculations, but doing so requires extracting data from the database onto a client desktop or middle-tier server.
In a lot of cases, this multitiered design is not a problem. Many companies have good internal network bandwidth, and power analysts seem to prefer having local data sets that they can play with as they wish. Yet to make greatest use of the power analysts' efforts, the output from their work must be fed back into the operations of the enterprise systems. Such closed-loop systems are becoming increasingly common in CRM, webhouses, system operations, and many other applications.
We are led down this path by the predictable choice of SQL as the query and analysis language for the data warehouse. But SQL is not rich and flexible enough to do analysis on its own, so our best analysts use it to pull data out of the warehouse and into their preferred tools. This creates the multitiered architecture.
The OLAP extensions to SQL-99 reduce this problem, but SQL is fundamentally not an analytic language. We need an analytic engine and language that are tightly integrated with the data store, yet robustly designed for scalability and programmability. In this way, we can avoid convoluted programming and the awkward multitiered architecture.
SQL Scrutinized
Can SQL do the job for production reporting applications? Is SQL an analytic language? No, no, and again no. It requires convoluted syntax to perform trivial "analytics," such as market share, moving average, rank, percentile, delta, or standard deviation. SQL, as its name (Structured Query Language) says, is a query language.
SQL-92 is fairly flexible, and can perform more computations than most people realize. As I'll describe later, a SQL expert could write a query that computes market share and moving averages. However, it seems highly unlikely that this is something you'd really choose to do, and you'd have to write the query by hand, as no query tool is going to provide much help. Let's choose a simple example.
As everyone who has looked at an investment Web site must know, moving averages are a common tool in financial analysis. A moving average, very simply, is the average of a measure over a rolling window. For example, to calculate a three-day moving average of a stock price you'd average the price for today, yesterday, and the day before. Standard deviation, a measure of volatility, is also a common measure in investment analysis. As a starting point for our discussion, assume you want to compute the moving average and standard deviation of stock prices over appropriate time intervals.
A fundamental characteristic of a relational database is that the order of rows in a table has no meaning. SQL can order rows on output, but there is no syntax for ordering rows so as to operate on them in a query. To perform a moving average calculation in SQL-92, you'd need to join the fact table to itself three times. A 200-day moving average would require a 200 way self join, which is ridiculous.
Analysts have been forced to abandon SQL, using it only to extract data from the warehouse and feed it into their personal systems.
SQL-99 OLAP Extensions
What about the much anticipated OLAP extensions to SQL-99? They are a genuine improvement to the query language. The primary feature of the OLAP extensions is the WINDOW clause. The WINDOW clause is designed to solve exactly this sort of problem: It's a way to specify in the query that you want to perform an action over the set of rows {today, yesterday, day before}.
Listing 1 shows the basic query drawn from the schema shown in Figure 1 using SQL-99 syntax.
The PARTITION BY clause is similar to a GROUP BY clause. It tells the "window" to start over when it sees a new stock. Without this clause, you'd be intermingling data from different companies. The ORDER BY clause ensures that you include the correct rows in the window. Remember that the relational database is inherently agnostic about the ordering of rows.
The ROWS 2 PRECEDING clause specifies that you want to perform some action on the current row and the two previous rows. A 200-day window, then, would simply replace the ROWS phrase with ROWS 199 PRECEDING. Finally, it's the AVG and STD functions in the SELECT list that indicate what action you want to perform on sets of rows you've grouped together in the window.
The SQL-99 syntax supports multiple windows in a query. Extending this query to compute moving average over one range, and standard deviation over a second range, would be as straightforward as defining a second WINDOW clause.
Just because your database engine implements the SQL-99 OLAP extensions doesn't mean it does so well. All it means to conform to the ANSI syntax standards is that your database engine accepts the syntax and returns correct results. Underneath the covers, the engine might be joining the fact table to itself.
More Advanced Queries
What if you wanted to look at investment returns rather than prices? The desired output is a series of each stock's daily return and standard deviation, computed relative to the beginning of the year. Consider the query in Listing 2, derived from Figure 1, which employs a subquery to get the starting price used to calculate the investment return.
There are several additional common computations that the new SQL-99 OLAP extensions support. The following items are excruciatingly difficult to generate without the WINDOW clause:
- Cumulative totals. It is straightforward to cumulate an additive measure, using the ROWS UNBOUNDED PRECEDING clause.
- Deltas. With some creativity, you could define a query that compares a measure for today with the same measure for yesterday, last week, or last year.
- Market share. You could express this query in a single statement with multiple windows. This is a very common requirement in reporting.
- Rank. The RANK windowing function returns an integer that indicates a rank within a partition. Ranks are typically used as alternative sort orders or filters in a report. In our investment example, you may rank stocks on investment return, volatility, and on trading volume.
It's important to note that neither example illustrated in this column is what I would call a complex question. These are complex queries that answer very simple questions.
|
|
|
|
|
| |||||||||||||||||||||||||||||||























