CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise



June 26, 2000, Volume 3 - Number 10


How vendors are implementing the proposed OLAP extensions to SQL-99

The Extra Mile


Richard Winter                

When it comes to business intelligence (BI), not much is more fundamental than analysis. Is there a correlation between frequency of Web site visits and customer profitability? Do our best-selling home office products contribute more to profitability than our best-selling consumer electronics products? Are the customers who spend the most on high-fashion products getting older?

Such questions would seem to be the very essence of database management, at least as applied to BI, and yet they have been devilishly difficult to express in SQL. And, if difficult to express, they have been even harder to optimize.

As I reported recently in “SQL-99’s New OLAP Functions” (January 20, 2000), help has begun to arrive in the form of SQL language extensions designed specifically to help with analytic applications. In that column, I reviewed the key data concepts and language constructs involved. In this column, I will discuss function, application, and the response of vendors whose products generate SQL. In general, as I have delved into the subject of analytic extensions to SQL, I have come to feel even more strongly that this advance is extremely important with respect to database scalability.

Table 1 lists the new SQL functions, using their names in the proposed standard. The principal statistical functions are correlation, covariance, cumulative distribution, ranking, percentile, linear regression, standard deviation, and variance. In addition, the same language amendment includes new and powerful numeric functions: ceiling, floor, natural logarithm, square root, exponent, and power.

Before I go on to what’s new, I want to set the record straight on a few points. First, the OLAP functions are still in the process of becoming a standard, as Fred Zemke of Oracle explained to me. Fred should know: He has done extensive work on the SQL standards committee, including all the actual drafting of the proposals and specifications.

Second, before IBM and Oracle got together and defined the approach that is now on the table at ISO, some similar SQL capabilities were on the market in the form of vendor-specific extensions to SQL. Red Brick Systems Inc. introduced several of the key capabilities as early as 1993, and they are in widespread use across the Red Brick customer base. Oracle was offering key elements of the same capability as part of its Time Series Option. NCR Corp. came out with a powerfully implemented set of analytic extensions to SQL, functionally close to what is in the standard, in a highly scalable implementation in Teradata V2R3, which became generally available in January 1999.

Third, these functions are not about OLAP as most people in our industry think of it today, as Amir Netz, architect and development manager at Microsoft, points out. That is, they are not about such concepts as multidimensional data analysis, hypercubes, and defined hierarchies. Rather, the concepts are about statistical analysis of data using such concepts as rank, moving average, percentiles, correlation, and linear regression. As such, perhaps they might be better dubbed “analytic functions”—a term that Oracle already uses in many of its internal documents.

Field Experience

Donna Hott, a customer advocate in product development at Informix, provided interesting examples of how customers are actually using the Red Brick analytic functions in production applications.

In the frequent shopper program for a retail grocery chain, the store used a query to identify the products purchased most frequently in the past week with coupons by the 20 percent of shoppers who spent the most money in the store. In this example, SQL’s PERCENTILE and RANK functions were embedded in the query, resulting in huge numbers of customer and transaction rows in the analysis, but only a few product rows actually delivered to the user. The grocery chain used the resulting analysis to determine whether recent promotions were successful in bringing in the shoppers who spent the most money.

According to Hamid Pirahesh, manager of the Database Technology Institute at IBM, DB2 customers have been enthusiastic about the rownumber and ranking functions. The ability to get the top n or bottom n rows that result from ranking is extremely valuable in practice.

I found this interesting, because DB2 previously had a “fetch n rows” function, which, when combined in a request with ORDER BY, provided a simple ranking capability. According to Pirahesh, however, customers want to use ranking in more complex requests. For example, it is often important to do the ranking within subqueries, and the query did not support the “fetch n rows” function in this case. Queries that rank the data by multiple criteria are also important. For example, you might want to know the top 10 products in each country, the top 10 products in each continent or region of the world, and the top 10 products in the world. This type of query has given rise to new concepts such as “rank within peers” and “rank within parent”—which DB2 7.1, due out in mid-2000, will support.

Relationship to Outboard Tools and Applications

Ash Jhaveri, program manager at MicroStrategy Inc., provides an interesting example of the scalability benefit associated with these functions. A common need in customer relationship management (CRM) is grouping customers into “bands” (for example, ntiles) and then applying the appropriate action to each band. For example, a supplier may need to group their clients into fifths (quantiles) according to purchase volume over the past year. The supplier would make one offer to the top 20 percent, a different offer to the next 20 percent, and so on.

Now, suppose you wanted to know what fraction of the second 20 percent of clients have $500 or more in available credit. In the past, the program would have extracted customer rows of interest (if not all of the customer rows) from the database and moved across the network to an outboard analytic engine. Today, using the new PERCENTILE function in SQL, we can identify the customers of interest and then perform calculations on them in the same statement. We can now perform the calculation in parallel, optimized within the SQL engine—moving only the result across the network to the client.

This ability reduces the time and work by a factor that is proportional to the customer table’s size—a factor that could range up into the tens of millions. Similar reductions on large transaction tables may result in data reduction factors of a billion or more.

MicroStrategy 7, due out shortly, supports SQL’s new functions and will generate code to exploit its presence in DB2, Oracle, and Teradata. MicroStrategy 7 will continue to support Red Brick, whereby the administrator can define “metrics” (roughly speaking, macros) that employ the Red Brick version of the new SQL functions, in which end users can access metrics by name.

According to Bud Endross, product manager for Express Server, Oracle Express will also exploit the new analytic functions in SQL. In fact, the Express Server development group participated in the specification of the function with exactly such use in mind. Endross remarked, “It is our general goal to have Oracle be an analytic platform. Whether customers execute certain analytical functions in SQL or within the OLAP server is really a matter of their choice. We’re simply pleased that they do it in Oracle.”

TABLE 1 OLAP functions proposed for addition to SQL-99.
CEILING
CORR
COVAR_POP
COVAR_SAMP
CUME_DIST
DENSE_RANK
EXP
FLOOR
LN
MOVING_AVG
MOVING_SUM
PERCENTILE_CONT
PERCENT_RANK
POWER
RANGE
RANK
REGR_AVG
REGR_AVGX
REGR_AVGY
REGR_COUNT
REGR_INTERCEPT
REGR_R2
 REGR_SLOPE
REGR_SXX
REGR_SXY
REGR_SYY
ROW_NUMBER
SQRT
STDDEV_POP
STDDEV_SAMP
VAR_POP
VAR_SAMP


NCR has introduced a suite of applications and tools that support its new “active warehousing” concept, including Relationship Optimizer—a CRM product. According to Stephen Brobst, chief technical officer for data warehousing at NCR, Relationship Optimizer and the related offerings comprise a new approach to CRM where “the capabilities for market segmentation and analysis of customer behaviors are integrated with predictive modelling” to better figure out who to target for promotions. It involves such techniques as quantiling, ranking, and analysis of frequency distributions—the very stuff of the new SQL and OLAP functions.

Your Mileage May Vary

A word to the wise: The database vendors are all moving in the direction of supporting more BI functions and performance. But, make no mistake, vendors will vary tremendously in their approaches to implementing all of these analytic capabilities.

To give one example, when a database engine calculates the results of a statistical function—say, grouping data into deciles—it makes an enormous difference whether we perform the operation in parallel start to finish, whether the resulting data must all be brought back to a single node for output at the end (because that one node becomes a bottleneck when you have a large volume of data), or whether we calculate the functions in a single pass of the data. And different products will implement these tactics to different degrees.

According to Todd Walter, chief technical officer for the Teradata DBMS, Teradata is parallel all the way. That is, the program does not bring the data back to a single node for output at the end, either in the deciles example or in a wide variety of other situations.

How well we integrate the new function with existing capability is also crucial. According to George Lumpkin, a product manger at Oracle, integration was a particular focus in the Oracle8i R2 implementation. For example, the program supports query rewrite for materialized views for queries containing analytic functions and for materialized views defined via analytic functions. It also supports CUBE and ROLLUP in conjunction with analytic functions.

Capabilities Are Real

I have used the new analytic functions in Teradata V2R3 (released early 1999), Oracle8i R2 (released early 2000), and DB2 6.2 (released mid-1999). Oracle8i R2 implements most or all of the standards. DB2 6.2 supports a subset of the standard; it is followed by a fuller implementation in DB2 7.1, due for release in mid-2000. Teradata V2R3 predates the standard; Teradata V2R4, due out in mid-2000, will comply with the standard, as well as supporting the original syntax NCR developed before the standard was available. I have not used the analytic functions in Red Brick, but they have been available for years and there is ample evidence of their production use at many Red Brick sites.

Summing Up

Given the rising importance of BI, CRM, supply chain management, and other analytic applications, the proposed Amendment 1 to SQL-99 is one of the most important database language developments in recent years. And, given the increasing volume of data, especially in e-business, the significance of the analytic functions—and the quality of their implementation—is greatly magnified.

Now, for the first time, a database engine can calculate a correlation, linear regression, or moving average function without ever moving the data outside the database. When a billion transactions are involved, that’s a big deal. Users of larger databases—especially those with a terabyte of data or more—will need to pay careful attention to how the functions they implement perform in their applications. Database products will vary significantly in this regard, and outboard products will vary in the extent to which they exploit the functions.

In general, I think users will be pleased with the extent of the statistical function delivered in the proposed Amendment 1 to SQL-99. And whether they end up calling them “OLAP functions” or “analytic functions,” I think a lot more valuable analysis is going to happen. Because we store and collect our growing volumes of data to gain insight from it, that increased analytic capability is going to be a good thing all around.

Richard Winter (Richard.Winter@wintercorp.com, fax: 617-338-4499) is a specialist in large database technology and implementation, and president of Waltham, Mass.-based Winter Corp. (www.wintercorp.com).

RESOURCES

 

Microsoft: www.microsoft.com
MicroStrategy: www.microstrategy.com
National Committee for Information Technology Standards (NCITS): www.ncits.org
NCR: www.ncr.com
Oracle: www.oracle.com
Red Brick Systems Inc./Informix: www.informix.com

 





IE Weekly Newsletter
Subscribe to the newsletter
    Email Address