Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Digital Library
Subscribe
Home



Ah, Swede Memories


Celko comes up with a third median and some other ideas while teaching in Sweden

I spent two weeks in Sweden as an invited lecturer at the University of Karlskrona/Ronneby. This school is in the "rust belt" of Sweden, which, like steel towns in the United States, is finding diversification necessary for survival. The city's approach has been to collaborate with the commercial sector to build a university next to a software research center. The school was still under construction during my visit. To give you an idea of how fast it is moving, it will graduate its first doctoral student before gaining accreditation! This is legal — really; it has to do with transfer credits and some other technicalities. But I am sure the poor student will be accused of forging credentials in the future.

I love Sweden. The people all speak English and about a third of the students there had been either exchange students or au pairs in the United States. It's like visiting relatives you don't see often enough.

I had forgotten how much I like teaching. And I've dreamt for years — probably inspired by the beer commercials featuring a bunch of guys partying with the Swedish Bikini Team — of beautiful, adoring Swedish girls kidnapping me. Of course, my fantasy did not involve staying up until 11p.m. every night for a week while jetlagged, helping students with their term projects and homework. My wife says I need to update my fantasies in light of my advanced age. What a killjoy! She does not appreciate how many years of my life I have invested in those fantasies.

The second week was spent helping with the DB-Days Conference. Jimmy Nilsson, the conference chairman, humbly did an excellent job in the face of many hardships. I got two good ideas out of the conference. One is pure theory, so I'll discuss it last, and the other is practical.

Celko's Third Median

One of the other speakers asked me how to compute the median in SQL. I whipped out a diskette and displayed the assorted solutions to this problem featured in the first edition of my book SQL for Smarties (Morgan Kaufmann, 1995), which all used Chris Date's famous Parts table. I had been happy with my two medians, but suddenly I could not let the problem go.

I started by looking at a picture of a numbered line of sorted values and seeing where the median would fall. The weight column partitions the table into three sections: the values less than, equal to, or greater than weight. Now the question is how to define a median in terms of the partitions.

Clearly, the definition of a median means that if the lesser and greater sections are equal, then weight is the median. If there are more greater values than half the size of the table, then that weight cannot be a median. Likewise, if there are more lesser values than half the size of the table, that weight cannot be a median. If lesser + equal = greater, weight is a left-hand median. Similarly, if greater + equal = lesser, weight is a right-hand median. However, if this weight is the median, both lesser and greater have to tally to less than half the table's size. The foregoing statements translate into the following SQL:

SELECT AVG(DISTINCT weight)
FROM (SELECT F1.pno, F1.weight,
SUM(CASE WHEN F2.weight < F1.weight
THEN 1 ELSE 0 END),
SUM(CASE WHEN F2.weight = F1.weight
THEN 1 ELSE 0 END),
SUM(CASE WHEN F2.weight > F1.weight
THEN 1 ELSE 0 END)
FROM Parts AS F1, Parts AS F2
GROUP BY F1.pno, F1.weight)
AS Partitions (pno, weight, lesser, equal, greater)
WHERE lesser = greater
OR (lesser <= (SELECT COUNT(*) FROM Parts)/2.0
AND greater <= (SELECT COUNT(*) FROM Parts)/2.0);

The reason for not expanding the VIEW in the FROM clause into a tabular subquery expression is that the table can be used for other partitions of the table, such as quintiles.

It is also worth noting that you can use either AVG(DISTINCT i) or AVG(i) in the SELECT clause. The AVG(DISTINCT i) will return the usual median when there are two values. This result happens when you have an even number of rows and a partition in the middle, such as (1, 2, 2, 3, 3, 3) which has (2, 3) in the middle and returns 2.5 for the median. The AVG(i) will return the weighted median instead. This result happens when you also factor in the number of appearances made by the two values in the middle of a table that has an even number of rows. The table with (1, 2, 2, 3, 3, 3) would return (2, 2, 3, 3, 3) in the middle, which yields 2.6 for the weighted median. The weighted median is a more accurate description of the data.

I sent this first attempt to Richard Romley, who invented the method of first working with groups when designing a query. Richard was proud that I had finally learned something from him. But, of course, he had to cook my solution. Let me take you through the steps so you can see the reasoning.

Look at the WHERE clause. It could use some algebra and, because it deals with only aggregate functions and scalar subqueries, you could move it into a HAVING clause. Moving things from the WHERE clause into the HAVING clause in a grouped query is important for performance, but it is not always possible.

First, let's do some algebra on the expression in the WHERE clause.

lesser <= (SELECT COUNT(*) FROM Parts)/2.0

Because there is already a lesser, equal, and greater value for every row in the derived table Partitions, and because the sum of lesser, equal, and greater must always be exactly equal to the total number of rows in the Parts table, we can replace the scalar subquery with this expression:

lesser <= (lesser + equal + greater)/2.0

But this is the same as:

2.0 * lesser <= lesser + equal + greater

which becomes:

lesser + lesser - lesser <= equal + greater

which becomes:

lesser <= equal + greater

So the query becomes:

SELECT AVG(DISTINCT weight)
FROM (SELECT F1.pno, F1.weight,
SUM(CASE when F2.weight < F1.weight
THEN 1 ELSE 0 END),
SUM(CASE when F2.weight = F1.weight
THEN 1 ELSE 0 END),
SUM(CASE when F2.weight > F1.weight
THEN 1 ELSE 0 END)
FROM Parts AS F1, Parts AS F2
GROUP BY F1.pno, F1.weight)
AS Partitions (pno, weight, lesser, equal, greater)
WHERE lesser = greater
OR (lesser <= equal + greater
AND greater <= equal + lesser);

Still looking at the WHERE clause, we can rewrite it with DeMorgan's law:

... WHERE lesser = greater
OR (equal >= lesser - greater
AND equal >= greater - lesser)

But this is the same as:

 ... WHERE lesser = greater
OR equal >= ABS(lesser - greater)

But if the first condition is true (lesser = greater), the second must necessarily also be true (that is, equal >= 0), so the first clause is redundant and can be eliminated completely:

WHERE equal >= ABS(lesser - greater)

So much for algebra. Instead of a WHERE clause operating on the columns of the derived table, why not perform the same test as a HAVING clause on the inner query that derives the table Partitions? This strategy eliminates all but one column from the derived table, it will make the query run faster, and it simplifies the query to this:

SELECT AVG(DISTINCT weight)
FROM (SELECT F1.weight
FROM Parts AS F1, Parts AS F2
GROUP BY F1.pno, F1.weight
HAVING SUM(CASE WHEN F2.weight = F1.weight
THEN 1 ELSE 0 END)
>= ABS(SUM(CASE WHEN F2.weight < F1.weight THEN 1
WHEN F2.weight > F1.weight THEN -1
ELSE 0 END)))
AS Partitions;

Neat, huh?

Infinite Databases

The second new idea I got at the conference was pure theory: tables with a countably infinite number of rows. If you are not into mathematics, this might get weird.

A few years ago some of us argued on CompuServe about whether summation and addition are the same. I argued that they are different because addition is a binary operation and summation can be done over a countable infinite number of rows. The countable infinity is also called Aleph Null and it is the cardinality of the integers. This stuff is covered in freshman calculus books under summation of a series.

The basic idea is that a series, as a whole, can converge to a value, but if you stop at any particular number of terms, you will not get that same answer. Infinite sets are not like finite sets.

Think of the hypothetical Turing machine equipped with infinite tape storage and the ability to read, write, and update storage on the tape according to a program. If you put a program and an input tape into that Turing machine, it will either halt or hang in a loop. The halting problem asks if there is some way to write a Universal Turing machine that will take a program and any valid input, crunch around a bit, and finally tell you if the program will halt in a finite number of steps or not. The assumption is that the Universal Turing will always halt with a "yes" or "no" answer for all programs; it never hangs in a loop itself.

But such a Universal Turing machine could not exist. If it were attempted, it would have a program of its own. We could take that universal program and whenever it came to a halt, we could put in a loop. Now put the program through its mirror image; if the Universal program halts, it does not halt and if it does not halt, then it does. This is a paradox that forces us to conclude that some things might be true, but we cannot compute them. (You can also look up Gödel's theorem for another version of this same idea.)

In an infinite database, we could construct a table such as this:

CREATE INFINITE TABLE Turing_Table
(machine TURING NOT NULL,
input INPUT NOT NULL,
step INTEGER NOT NULL);

where the database holds a row for all possible Turing machines, the machine attribute uniquely names or numbers each machine in some way, the input column holds an input string, and step is the number of steps the machine can

execute without halting for that input. Each row is computable; you would simply run the program for (n) number of steps and see what happens.

But if you execute:

 SELECT machine, input
FROM Turing_Table
GROUP BY machine, input
HAVING COUNT(*) < ALEPH_NULL;

you have the impossible Universal Turing machine in a table. Infinite databases are not like finite databases.

Puzzle:

Steve Hudak came up with a version of this problem on the Access Forum on CompuServe. He has a table that records daily stock information and he wants to get a report out of it. The table looks like this:

CREATE TABLE StockDetail
(symbol CHAR(5) NOT NULL,
high_price DECIMAL (12,4) NOT NULL,
low_price DECIMAL (12,4) NOT NULL,
close_price DECIMAL (12,4) NOT NULL,
close_date DATE NOT NULL,
CHECK (low_price <= high_price),
CHECK (close_price BETWEEN low_price AND high_price));

Let's give it some dummy data:

INSERT INTO StockDetail
VALUES ('abc', 23.50, 20.50, 20.50, '1998-07-27');
INSERT INTO StockDetail
VALUES ('abc', 23.50, 20.50, 22.50, '1998-07-29');
INSERT INTO StockDetail
VALUES ('abc', 25.00, 20.50, 24.75, '1998-07-26');
INSERT INTO StockDetail
VALUES ('abc', 30.00, 20.50, 22.50, '1998-07-28');
INSERT INTO StockDetails
VALUES ('pqr', 45.00, 30.00, 42.50, '1998-07-28');
INSERT INTO StockDetail
VALUES ('pqr', 43.50, 40.50, 40.75, '1998-07-27');
INSERT INTO StockDetail
VALUES ('pqr', 43.50, 40.50, 42.50, '1998-07-29');
INSERT INTO StockDetail
VALUES ('pqr', 45.00, 40.50, 44.75, '1998-07-26');
INSERT INTO StockDetail
VALUES ('xyz', 13.50, 10.50, 12.50, '1998-07-29');
INSERT INTO StockDetail
VALUES ('xyz', 13.50, 10.50, 13.50, '1998-07-27');
INSERT INTO StockDetail
VALUES ('xyz', 15.00, 10.50, 14.75, '1998-07-26');
INSERT INTO StockDetail
VALUES ('xyz', 20.00, 10.50, 12.50, '1998-07-28');

He would like to produce a query with:

1.  The stock ticker symbol
2.  The highest price at which each stock sold during the dates this table was kept
3.  The lowest price at which each stock sold during the dates this table was kept
4.  The last closing price on the last date for each stock.

The problem is that the answer to item 4 is based on one row of the table, while the other three are based on each group of stocks.

Answer:

There are several ways to do this. One way is to use a scalar subquery expression to get the last closing price:

SELECT S1.symbol,
MAX(S1.high_price) AS highest_price,
MIN(S1.low_price) AS lowest_price,
(SELECT S2.close_price
FROM StockDetail AS S2
WHERE S1.symbol = S2.symbol
AND S2.close_date = MAX(S1.close_date))
AS last_close_price,
MAX(S1.close_date) AS last_date
FROM StockDetail AS S1
GROUP BY S1.symbol;

Another approach is to use a CASE expression to convert anything but the last date into a NULL and then discard the NULLs with a MAX() function:

SELECT S1.symbol,
MAX(S1.high_price) AS highest_price,
MIN(S1.low_price) AS lowest_price,
MAX(CASE WHEN close_date
= (SELECT MAX(S2.close_date)
FROM StockDetail AS S2
WHERE S1.symbol = S2.symbol)
THEN close_price
ELSE NULL END) AS last_close_price,
MAX(S1.close_date) AS last_date
FROM StockDetail AS S1
GROUP BY S1.symbol;

I am not sure that this has any advantage over the first version, since both involve a subquery expression. It is worth mentioning that the following query will not work:

SELECT S1.symbol,
MAX(S1.high_price) AS highest_price,
MIN(S1.low_price) AS lowest_price,
MAX(CASE WHEN close_date = last_date
THEN close_price
ELSE NULL END) AS last_close_price,
MAX(S1.close_date) AS last_date
FROM StockDetail AS S1
GROUP BY S1.symbol;

If you look at the query, you will see that although last_date is defined in the containing query, it still violates the rule that you cannot nest aggregate functions inside aggregate functions.

A third approach is to build two separate queries and join them together:

SELECT X1.symbol,
X1.highest_price,
X1.lowest_price,
S1.close_price AS last_close_price
FROM (SELECT symbol,
MAX(high_price),
MIN(low_price),
MAX(close_date)
FROM StockDetails 
GROUP BY symbol)
AS X1(symbol, highest_price, lowest_price, last_close_date)
INNER JOIN
StockDetails AS S1
ON X1.symbol = S1.symbol
AND X1.last_close_date = S1.close_date;

A much faster solution uses conjugacy. Conjugacy is the mathematical trick of using a function to simplify a calculation, and then the inverse of the function to restore the original data in the final result. Logarithms and exponential functions are one example, as are trig and inverse trig functions. In this case, however, we will be using concatenation and SUBSTRING extraction with the CAST() operator.

SELECT symbol,
MAX(high_price) AS highest_price,
MIN(low_price) AS lowest_price,
MAX(close_date) AS last_close_date,
CAST(
SUBSTRING(
MAX(CAST(close_date AS CHAR(10))
|| CAST close_price AS CHAR(12))
FROM 11) AS DECIMAL (12,4)) AS last_close_price
FROM StockDetails
GROUP BY symbol;

In English, you convert the closing date and price into one string, find the MAX() of those strings, then extract the price and convert it back to a DECIMAL (12,4) representation.



Joe Celko is an Atlanta-based independent consultant. He is the author of three books on SQL -- SQL For Smarties (Morgan-Kaufmann, 1995), SQL Puzzles and Answers (Morgan-Kaufmann, 1995), and Instant SQL Programming (Wrox Press, 1997) -- and wrote the SQL for Smarties column for DBMS magazine. You can contact him via email at www.celko.com or 71062.1056@compuserve.com.





IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo JitterPlug Into The Cloud
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet EvolutionPyramid Research
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space