Puzzle Time
Solutions to puzzles that stem from real-world problems
April 10, 2000
Recently, I have been getting emails complaining that I am not doing enough programming puzzles in this electronic column. In an attempt to balance out the editorial content, this piece will be almost pure code. I chose some puzzles that illustrate some important SQL programming points. First, you should think in terms of processing entire sets, not single rows. Second, you can do a lot of work with function calls. The third point is that you can do incremental program improvements in SQL; SQL is not the "Wright Brothers school of programming. (That is, put it all together and push it off a cliff to see if it flies.)
Try to at least "scratch paper" an answer before you read the suggested answers. Pick up your number two pencils and begin.
Puzzle One
Quinten Martens sent this problem into a Compuserve forum in 1997. The problem involves tracking calls to a service center so that a company can predict its staffing needs. The important table looks like this:
CREATE TABLE AllCalls
(call_id INTEGER NOT NULL PRIMARY KEY,
rtn_id INTEGER NOT NULL REFERENCES rtn_list (rtn_id),
orignum CHAR(10) NOT NULL,
call_date TIMESTAMP NOT NULL,
duration INTEGER NOT NULL DEFAULT 0,
iscomplete INTEGER NOT NULL CHECK(iscomplete IN (0, 1)));
The only columns of interest are rtn_id and calldate. The column rtn (short for routing transit number) is a ten-digit number that we use to identify at which call center the call ended up and what kind of a call it was (for example, technical or billing).
We want a new table containing the columns rtn_id and calldate, so that we can run a crosstab that tells us how many calls we got on all RTNs at any given time of day in half hour increments.
The problem is that the AllCalls table contains a row for every single call, even if the caller gets a busy signal. This is where the iscomplete flag comes in. If the entry is 1, then we know the caller spoke to someone. If we just did my crosstabulation on that data, the results would be totally wrong because callers might have to call multiple times before they get through. That's why we tried to figure out the incidents.
Here's how we want the incident count to work:
- Every caller's first call is an incident, regardless of whether they actually got through.
- Consecutive incomplete calls and the first iscomplete call after the first call are regarded as the same incident.
- After a complete call, a new incident begins.
- If the call is an incident, add the rtn_id and the calldate into the new table.
The logic behind this is that a caller may give up after a few incomplete calls, but we still want to count it as one incident. Obviously it is impossible to get the incidents 100-percent correct because callers might be calling from multiple locations, but this is as close as we can get the count.
The first answers that were proposed for this problem involved using temporary tables to collect and filter data. But you solve the problem much better if you think in terms of sets, rather than procedural processing.
Johannes M. Becher came up with this answer:
"If I understand your table layout correctly, then orignum identifies the caller, that is, consecutive calls from the same caller would have the same orignum. This looks like the only column that can identify the caller, so I am assuming that's what it does.
"I am also assuming that the rtn_id and the calldate in a multiple call incident can be taken from any of the calls. (After all, if none of the calls of an incident get through, which one would you use to take your data from?)
"Now, as to counting incidents, if I rephrase your rules one to four, I can also count incidents as the number of iscomplete calls for a given orignum plus 1 if the last call for that orignum is incomplete.
"For example, these chains of events would be scored as followed:
"Incomplete / complete / incomplete = 2 incidents number of iscomplete calls = 1, last one is incomplete, so count = 2
"Incomplete / complete / incomplete / complete = also 2 incidents number of iscomplete calls = 2, last one is complete, so count is still 2
"Incomplete number of iscomplete calls = 0, last one is incomplete, so count = 1"
This is a bit easier to implement in SQL as it avoids concepts like after that (more on this below):
SELECT MAX(rtn_id), MAX(calldate),
(SELECT COUNT(*)
FROM AllCalls AS A2
WHERE A2.orignum = A1.orignum
AND A2.iscomplete = 1)
+ (CASE WHEN
(SELECT A3.iscomplete
FROM AllCalls AS A3
WHERE A3.orignum = A1.orignum
AND (SELECT COUNT(*)
FROM AllCalls AS A4
WHERE A4.orignum = A3.orignum
AND A4.calldate > A3.calldate) = 0) = 1
THEN 0
ELSE 1 END)
FROM AllCalls AS A1
GROUP BY orignum;
It looks strange, doesn't it? In plain English, the third element of the SELECT clause (the subquery starting on the third line) says "COUNT all calls for the caller in question which are iscomplete. Then [next level down] check if the call for the same caller with no other calls for the same caller with a later TIMESTAMP [that's the last call] is complete; if so, add 0, if not, add 1."
I am not sure about the performance, as I don't have a lot of data to test this. (I made some tests with a dozen cases or so, however, so I know this works.)
Why this strange approach? I tried to avoid problem definitions where the words "next" and "after that" appear, as they imply an ordering of the rows during processing a sure sign of procedural thinking. In my definition, I use the word "last," but that's okay as it can be covered by MAX() or equivalent SQL code, and it does not imply ordering the rows for processing.
You may decide for performance reasons that a stored procedure is better; I just wanted to show you the "set" way to solve this problemI have become a bit "set" in my ways.
Puzzle Two
This one comes from Richard Romley. Imagine that you have to program a change-making machine we will assume the machine uses only bills and not coins.
You have a table named Money that contains the denominations and values of the bills that are available in the machine:
CREATE TABLE Money
(denomination CHAR(25) NOT NULL,
amt INTEGER NOT NULL);
INSERT INTO Money VALUES ('one dollar bills', 1);
INSERT INTO Money VALUES ('five dollar bills', 5);
INSERT INTO Money VALUES ('ten dollar bills', 10);
INSERT INTO Money VALUES ('twenty dollar bills', 20);
INSERT INTO Money VALUES ('fifty dollar bills', 50);
INSERT INTO Money VALUES ('hundred dollar bills', 100);
How would we give change? Let's do a few examples to get the feel of things. To dispense $18.00, you would use a query that would produce the result:
qty denomination
=======================
1 ten dollar bills
1 five dollar bills
3 one dollar bills
To return $46.00, count out
qty denomination
=========================
2 twenty dollar bills
1 five dollar bills
1 one dollar bills
But now assume that we have run out of five dollar bills and fifty dollar bills and have to re-do the change again:
SELECT * FROM Money;
denomination amt
-----------------------------------------------
one dollar bills 1
ten dollar bills 10
twenty dollar bills 20
hundred dollar bills 100
Now if we rerun the same queries we should get:
Return $18
qty denomination
-----------------------------------------------
1 ten dollar bills
8 one dollar bills
Return $46
qty denomination
-----------------------------------------------
2 twenty dollar bills
6 one dollar bills
Answer:
You can do it in a single SELECT statement:
SELECT (CASE M7.amt
WHEN M1.amt THEN :cash
WHEN M2.amt THEN MOD(:cash, M1.amt)
WHEN M3.amt THEN MOD(MOD(:cash, M1.amt), M2.amt)
WHEN M4.amt THEN MOD(MOD(MOD(:cash, M1.amt), M2.amt), M3.amt)
WHEN M5.amt THEN MOD(MOD(MOD
(MOD(:cash, M1.amt), M2.amt),M3.amt), M4.amt)
WHEN M6.amt THEN MOD(MOD(MOD(MOD(MOD(:cash, M1.amt),
M2.amt), M3.amt), M4.amt), M5.amt)
END / M7.amt) AS qty, M7.denomination
FROM (SELECT MAX(M1.amt), MAX(M2.amt), MAX(M3.amt),
MAX(M4.amt), MAX(M5.amt), MAX(M6.amt)
FROM Money AS M1
LEFT JOIN Money AS M2 ON M1.amt > M2.amt
LEFT JOIN Money AS M3 ON M2.amt > M3.amt
LEFT JOIN Money AS M4 ON M3.am) > M4.amt
LEFT JOIN Money AS M5 ON M4.amt > M5.amt
LEFT JOIN Money AS M6 ON M5.amt > M6.amt
WHERE :cash >= M1.amt)
AS X(M1.amt, M2.amt, M3.amt, M4.amt, M5.amt, M6.amt)
INNER JOIN
Money AS M7
ON M7.amt IN (M1.amt,
CASE WHEN MOD(:amt, M1.amt) >= M2.amt
THEN M2.amt END,
CASE WHEN MOD(:amt, M1.amt), M2.amt >= M3.amt
THEN M3.amt END,
CASE WHEN MOD(:amt, M1.amt), M2.amt),
M3.amt >= M4.amt
THEN M4.amt END,
CASE WHEN MOD(:amt, M1.amt), M2.amt), M3.amt),
M4.amt >= M5.amt
THEN M5.amt END,
CASE WHEN MOD(:amt, M1.amt), M2.amt),
M3.amt), M4.amt),
M5.amt >= M6.amt
THEN M6.amt END)
ORDER BY M7.amt DESC;
You get pure function calls without even a correlated sub-query. While this is a bit extreme, you should consider this basic approach because using function calls and CASE expressions can often save you from having to use multiple statements, cursors, and other slower coding techniques.
Puzzle Three
There is a classic programming puzzle that goes under the name of "the Maximum Rectangle Problem," but it has several versions. In the April 1999 issue of Dr. Dobbs Journal ("Designing Algorithms Incrementally"), Udi Manber gave a detailed solution to this problem. The point of the article was to show how simple improvements to a basic algorithm could improve performance. We can do that here, too.
Manbers version of the problem started with a binary array of size M by N. The task is to find the largest rectangle (subarray) made up of ones inside the array.
Let's start with this table to hold the array Grid[i,j]:
CREATE TABLE Grid
(cell INTEGER NOT NULL
CONSTRAINT Binary
CHECK (cell IN (0,1)),
i INTEGER NOT NULL,
j INTEGER NOT NULL,
PRIMARY KEY (i,j));
Now we can define a rectangle as being bound by an upper left cell and a lower right cell as the corners of a rectangle. This query will give you all the possible rectangles, and it leaves it up to you to pick the maximum.
SELECT UL.i, UL.j,
LR.i, LR.j,
((LR.i - LU.i + 1)
* (LR.j - UL.j - 1)) AS area
FROM Grid AS UL, Grid AS LR
WHERE UL.i <= LR.i
AND UL.j <= LR.j;
How can you extend this query to make it work right? Hint: It helps to draw a picture or to read the original article.
Answer:
The first problem is that this query just finds rectangles, not rectangles with only ones inside it. There are several ways to add this condition, but the easiest is to realize that the area of the rectangle is the same as the number of ones inside it:
SELECT UL.i, UL.j,
LR.i, LR.j,
((LR.i - LU.i + 1)
* (LR.j - UL.j - 1)) AS area
FROM Grid AS UL, Grid AS LR
WHERE UL.i <= LR.i
AND UL.j <= LR.j
AND (SELECT SUM(cell) -- total ones is equal
FROM Grid
WHERE i BETWEEN UL.i AND LR.i
AND j BETWEEN UL.j AND LR.j)
= ((LR.i - LU.i + 1) -- the area of subarray
* (LR.j - UL.j - 1));
But why do all this math when you can simply write this:
SELECT UL.i, UL.j,
LR.i, LR.j,
((LR.i - LU.i + 1)
* (LR.j - UL.j - 1)) AS area
FROM Grid AS UL, Grid AS LR
WHERE UL.i <= LR.i
AND UL.j <= LR.j
AND NOT EXISTS (SELECT * -- no zeros inside
FROM Grid
WHERE i BETWEEN UL.i AND LR.i
AND j BETWEEN UL.j AND LR.j
AND Grid.cell = 0);
This answer suggests another improvement, namely that we can see if any of the borders on the four sides around the rectangle we have found so far contain a zero. If one or more edges had no zeros in it, then we know that we can extend our rectangle in that direction and get a larger rectangle.
SELECT UL.i, UL.j,
LR.i, LR.j,
((LR.i - LU.i + 1)
* (LR.j - UL.j - 1)) AS area
FROM Grid AS UL, Grid AS LR
WHERE UL.i <= LR.i
AND UL.j <= LR.j
AND NOT EXISTS (SELECT * no zeros inside
FROM Grid AS G1
WHERE Grid.i BETWEEN UL.i AND LR.i
AND Grid.j BETWEEN UL.j AND LR.j
AND G1.cell = 0)
and zeros on the side
AND NOT EXISTS (SELECT *
FROM Grid AS G2
WHERE G2.i = UL.i - 1
AND G2.j BETWEEN UL.j AND LR.j
AND G2.cell = 0)
AND NOT EXISTS (SELECT *
FROM Grid AS G3
WHERE G3.i = LR.i + 1
AND G3.j BETWEEN UL.j AND LR.j
AND G3.cell = 0)
AND NOT EXISTS (SELECT *
FROM Grid AS G4
WHERE G4.j = UL.j - 1
AND G4.i BETWEEN UL.i AND LR.i
AND G4.cell = 0)
AND NOT EXISTS (SELECT *
FROM Grid AS G5
WHERE G5.j = LR.j + 1
AND G5.i BETWEEN UL.i AND LR.i
AND G5.cell = 0);
This query will return all the rectangles.
The fancy version of the rectangle problem is given an array with positive and negative numbers. Your task is to find the rectangle with the largest sum of values inside the subarray. I will offer a free copy of SQL for Smarties, second edition, for the first answer I get in my email.
Joe Celko is an Atlanta-based independent consultant. He is the author of three
books on SQL -- SQL For Smarties (Morgan-Kaufmann, 1995),
SQL Puzzles and Answers (Morgan-Kaufmann, 1995), and
Instant SQL Programming (Wrox Press, 1997) -- and wrote the SQL for Smarties column for DBMS
magazine. You can contact him via email at www.celko.com or 71062.1056@compuserve.com.
Copyright © 2004 CMP Media Inc. ALL RIGHTS RESERVED
No Reproduction without permission