CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise





Table of Contents



New Years's Revelations


Why Kdb is revamping the way we think about databases

by Joe Celko

For me, New Year's Eve is an amateur night for drunks. We more experienced and serious drinkers should give the amateurs the road that night, and then enjoy their misery when they crawl back to work with a hangover.

Unfortunately, I missed that particular middle-aged joy this year because I was in Coconut Grove, Fla. taping a webcast. Supposedly, public speaking is the second greatest fear that most people have after the fear of death. I have been speaking in front of large audiences for decades, and it never bothered me. But looking a video camera in the face is really scary. In my eyes a camera lens is like the barrel of a gun, a very large gun.

The webcast was for Kx Systems Inc. (www.Kx.com), which sells a database product called Kdb. They have an interesting approach to databases worth looking at because it proves there are several very different ways to build a SQL database. The webcast was recorded at the home of Arthur Whitney, the co-founder and chief technical officer at Kx. Prior to founding Kx, Whitney was a managing director of Union Bank of Switzerland (UBS) in New York, where he led an internal team that developed global trading and risk management systems using the K platform.

Punch Cards in the Dark

Back in the dark ages, when I was young, we stored data on punch cards. For those of you under 30 years of age, a punch card is a piece of cardboard with holes in it. The machinery that handled punch cards were called "unit record equipment" because each punch card represented a record.

Then came magnetic tape. What we did with this new technology was promptly imitate the old technology. The tape records were the same size as a punch card (80 columns of character) and arranged the data so it read one record at a time.

After magnetic tape, we got magnetic disk technology. Once again, we imitated the old technology with the new technology. We laid out the records as physically contagious fields of data, just like a punch card.

Finally, the file systems were replaced by databases starting in the 1970s. A database separates the data from the application program and handles it apart from the programs that use it. However, the representation of data was still based on a unit record.

When SQL and the relational products came along, we got smarter and started to realize that a database is more than a collection of files, a table is not a file, a row is not a record, and a column is not a field.

Well, not all of us. If you read some current newsgroup postings, you will see many users are still confused today, which is somewhat forgivable. What is unforgivable is when this confusion also exists in products. A lot of the most popular SQL systems still have an underlying unit record model of data storage. They fetch whole rows of data from separate tables in blocks and they are disk-centric.

Putting the "Fun" in Functional Programming

Where is this history lesson going? Let's look to Kx, their K language, Kdb database, and another history lesson. The K language is a functional programming language in the tradition of APL or FP. Most of today's programmers learn object oriented languages like Java and C++, and the slightly older programmers learned procedural languages like Pascal and C. Neither have seen a functional programming language.

The big idea of functional programming is that you have vectors of data and functional operators that work on the vectors to produce a new vector. You program by nesting function calls. It's surprising how small a set of operators is required.

This is starting to sound a bit like SQL operations that are done on whole tables rather than on one data element at a time. But vectors are not tables; they have an ordering and can be nested within each other to build complex structures.

What Kx did in their database was take a memory-centric instead of a disk-centric view of data. Main storage is both big and cheap these days, so there is much less need to constantly refer to disk storage. Instead, think of the disk drive as a back up device.

Next, do not view a table as a file made up of rows that are just punch card records in new clothes. Think of a table as a collection of vectors, where each vector holds a column. Given a table with 100 columns in it and a query that uses only three columns, you pull the three vectors into main storage and leave the other 97 columns out on disk. You read only what you need to use. Reading a vector is quite fast if the data on the hard drive is organized properly -- the read-write head goes to a physical track and just sits there.

Kdb can do 40,000+ TPC/B transactions per second against a 100 megabyte database on a Pentium 2. In an application at Lehman Brothers, doing time series on stock data keyed by instrument id and transaction date, Kdb was able to find all the information pertaining to a given set of instruments at a given date. This query returned 7500 rows of 15 columns in less than a second from a database of 11 million rows of several hundred columns and a total size of 10 Gigabytes.

Setting SQL Straight

I am about to admit something -- SQL is not good at things that assume an ordering, such as time series. If you have read any of my books, you know that a time series usually involves self-joins, with two or three copies of the table representing past, present, and future events relative to each point in time. Moving averages and other statistics are hard in SQL because tables have no ordering.

Another admission -- SQL is not good at handling different levels of aggregation in the same query. A set is a collection of things of the same kind, and things at different levels of aggregation are not the same kind of things. This is why the GROUP BY clause in SQL destroys the original working table in the FROM clause to produce a grouped table in which each row is a summary of a group.

Having an ordering established on the tables at the start of the session lets Kdb add some extension to SQL that make time series much easier. And, obviously, ordering does not prevent you from writing straight Standard SQL code in the system. Whitney remarked that while people could do complex data models in Kdb, they always seem to come back to a normalized, relational database in practice. There are too many benefits in it to get fancy.

Are there any drawbacks to Kdb? They need to improve the syntax so that Kdb's dialect looks more like Standard SQL. They have Open Database Connectivity (ODBC) and Java Database Connectivity (JDBC) tools, but they need more developer tools to make the product popular.

Puzzle:

This puzzle is from Kent Johnson and came to me back on May 10, 1999. It is a good example of how complicated a query for cumulative statistics can be.

I would like to create a stored procedure that should show the most important customers as defined by that subset of customers who together represent 50 percent of the total amount of business. Let's make it easy and assume that the total amount is $100.00 and the data looks like this:

CREATE TABLE Customers
(cust_id CHAR(10) NOT NULL PRIMARY KEY,
amount DECIMAL(12,4) NOT NULL);
INSERT INTO Customers VALUES ('A', 25.00);
INSERT INTO Customers VALUES ('B', 15.00);
INSERT INTO Customers VALUES ('C', 10.00);
INSERT INTO Customers VALUES ('D', 5.00);
INSERT INTO Customers VALUES ('E', 1.00);
INSERT INTO Customers VALUES ('F', 1.00);

Doing the analysis by hand, we can see that the cumulative percentage looks like this:

cust_id amount cum_percent
===========================
'A' 25.00	25
'B' 15.00	40
'C' 10.00	50
'D' 5.00	55
'E' 1.00	56
'F' 1.00	57
...

The result should be customers 'A,' 'B,' and 'C,' who represent 50 percent of the total amount.

Answer:

The first solution proposed to this problem was:

SELECT C1.cust_id, C1.amount, SUM(C2.amount) AS cumulative
FROM Customers AS C1
INNER JOIN
Customers AS C2
ON C1.amount <= C2.amount
WHERE C1.amount
>= (SELECT MAX(x.amount)
FROM (SELECT C3.amount
FROM Customers AS C3
INNER JOIN
Customers AS C4
ON C3.amount <= C4.amount
GROUP BY C3.cust_id, C3.amount
HAVING SUM(C4.amount) >= 50.00) AS x(amount))
GROUP BY C1.cust_id, C1.amount;

Can you do better? Hint: notice the $50.00, not 50 percent, is "hard wired" into the query.




Joe Celko is an Atlanta-based independent consultant. He is the author of Instant SQL Programming (Wrox Press, 1997). You can contact him at www.celko.com or 71062.1056@compuserve.com.





IE Weekly Newsletter
Subscribe to the newsletter
    Email Address