Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Digital Library
Subscribe
Home




December 05, 2000




Ten Things I Hate About You


SQL needs to break old habits and find its own voice.

by Joe Celko

My diversion at work is Web surfing over the SQL-related newsgroups to spread my knowledge and learn from others. Well, that sounds better than saying I use the company Internet connection to find new porno sites and flame people, doesn't it?

The result is that I read a lot of code written by a lot of different people. After more than 25 years of the relational model, people are still writing SQL as if it were a 1950s punch card file system. It takes time for a new tool to find its own voice. Motion pictures were originally shot from a fixed position because that is how people viewed stage plays. The engine in a car is in front because that is where the horse went.

But usually, high technology finds its own voice a lot faster than what you see in SQL programs. Here are some examples, based on things that have actually been posted in newsgroups by working programmers.

1) Punch card formatting. In the days of punch cards, programmers wrote their code on special forms that were given to keypunch girls (yes, I know that "girls" is now politically incorrect). They received the deck of cards back, submitted it as a batch job, and received their printout with all the errors later.

We coded the cards in such a way as to let us rearrange them without going to a keypunch machine and retyping them -- or needing to rewrite the coding forms from scratch to fix errors. For example, each line shown here would be on one card:

SELECT
A
, B
, C
FROM
FOOBAR
;

I can add a new column, table, or clause or pull out a line without changing the rest of the code. Too bad a human being cannot read it. Try reading this:

Paris
in the
the spring.

What’s wrong? The faster you read, the harder it is to see that "the" appears twice in the text. This is the year 2000 and we have had video terminals and text editors for decades, people!

2) No simple rules for capitalization. This is often the result of generated code that should be consistent. You will see keywords in uppercase, lowercase, and mixed case. User-created identifiers are also in uppercase, lowercase, and mixed case.

The worst situation is to use only uppercase. There are very good reasons for mixed case in newspapers; get any book on typography for the details. The next worse is to use only lowercase. Your eye needs a way to pick out structure words from user defined words in a program text. The two visual cues are the layout on the page and the capitalization.

Before you ask, color does not work for everyone. In a common psychology test, there is a deck of flash cards with the names of colors printed in different colored inks (for example the word "RED" in green ink, and so forth). You show the subject the cards at a constant rate and ask them to either tell you the color or read the word. This is a quick, crude measure of how fast you can switch processing from one side of the brain to the other. Your score will not change much over a lifetime unless you get a brain injury. These two processes are so fundamentally different that I cannot say "read me the color" or "tell me the word" in English.

The rules I use are:

a) All reserved words in uppercase. One clause per line aligned on a gutter as a vertical visual cue.

b) All scalars in lowercase. Follow the Metadata Standards Committee naming conventions.

c) All schema objects are capitalized. Use plural or collective nouns for table names since they are sets and not scalar values.

3) Lack of keys in tables. If you don't have a key, then you don't have a table. It is that fundamental and simple. If you have more than one key, then you are in trouble. SQL Server and Sybase programmers are the worst offenders. They use the non-relational, non-standard, non-portable, proprietary IDENTITY column and assume it is as good as a key. For example, this was posted on a newsgroup:

CREATE TABLE US_States
(state_id IDENTITY NOT NULL,
state_abbreviation CHAR(2) NOT NULL,
state_name VARCHAR(15) NOT NULL);

The idiot who did this assumed that the IDENTITY column was a key. Those other two columns better be unique or this thing is not going to work. The correct declaration should have been something like this:

CREATE TABLE US_States
(state_code CHAR(2) NOT NULL PRIMARY KEY,
state_name VARCHAR(15) NOT NULL UNIQUE);

Another point is that the two-letter state codes we use are not abbreviations; they are codes. That is getting back to the Metadata Standards Committee naming conventions. It is amazing how much bad code you can put in so few lines.

4) Insanely wide character columns. Why do you want to have everything in your table be VARCHAR(255)? This is the result of really bad design, where programmers pick the most general datatype they have.

5) Prefixes on variables and schema objects. In the early days, the compilers were close to the hardware and needed all the help they could get. In Fortran, all variable names that started with I through N were of type INTEGER and the rest were floating point numbers. In BASIC, variables that started with a $ were strings.

This is the year 2000; the ‘60s are over! Things like "str_firstname" or "tblPayroll" are redundant. The syntax of SQL takes care of the allowable datatypes for you and the compiler.

What you are doing is trying to expose the physical storage choices in your logical data model all over again. And you are making your code hard to read. Try to read "Paris in the Spring," "nounParis prepIn artThe nounSpring" and see if the prefixes make it easier to understand; now imagine that was a 20 word sentence with subclauses.

6) Different names for the same data element. I have seen one person use the name "id" in the Requests table, then proceed to call the same data element "request_id" in the other tables. Why did it change in one table? What is fundamentally different about it, so that you must rename it?

The next stupidest thing is using a prefix that repeats the table name in front of the data elements. For example, calling the street address of a customer "c_street" and the street address of an employee "e_street." Are they serviced by a different postal system? Or does a different edit process verify them? Those prefixes make it hard to find where the address information is in the entire database if we need to add constraints to them, edit them, or do anything else.

This is another example of not understanding what a physical model and a logical model are. An attribute is the same no matter what entity has it -- color is color, weight is weight, and so forth. You can make a logical change to data without making a physical change to it. For example, I can do things with verified data that I do not want to do with raw data, even though they have the same values in all their attributes. There is a big logical difference in the two.

7) Too many NULL-able columns. This is usually the sign of a denormalized table. Instead of thinking out the design to at least Third Normal form, the programmer just threw around a lot of NULLs.

The NULL is the only missing value marker in SQL, but it does not tell you why the value is missing. Is it not applicable? Is the attribute itself missing so no value will ever be possible? Is it an error? Throwing NULLs around can result in multiple meanings appearing in the same column, destroying your logical model.

8) Needlessly proprietary code. Again, the Sybase and SQL Server people are the guiltiest. The poor Oracle people are forced into proprietary code because their product is so bad, which is another issue. For example, SQL Server accepts INSERT instead of INSERT INTO, but INSERT INTO is the Standard SQL-92 syntax. So why use the proprietary version and destroy portability and readability for the poor slob that has to maintain your code later?

9) Posting requests for SQL programming help without any DDL or DML. It is easy enough to generate the code automatically from the database. It is easy to cut and paste it into an email. But people in newsgroups expect the other people to correctly guess the table names, constraints, datatypes, primary and foreign key relationships, and everything else from a English narrative description of the problem. Would you work from specs like that? Would you do it for free?

10) Posting bizarre personal pseudo-code instead of real DDL and DML statements. This gives the impression that you have something more useful than an English language narrative. However, the truth is that someone then has to figure out what you meant and still has to write their own DDL and DML statements to answer your question.

Richard Romley, my arch SQL puzzle nemesis, sent this one in March 1999. He noticed my comment on the rarity of using a FULL OUTER JOIN. He sent a simplified version of any actual problem to which he found a solution that used both a FULL OUTER JOIN and CROSS JOIN in the same query.

CREATE TABLE Students
(student_id CHAR(10) NOT NULL PRIMARY KEY);
INSERT INTO Students VALUES('Arnold');
INSERT INTO Students VALUES('Barbara');
INSERT INTO Students VALUES('Charles');
INSERT INTO Students VALUES('Deborah');
CREATE TABLE Subjects
(subject_id CHAR(10) NOT NULL PRIMARY KEY,
credits_needed INTEGER NOT NULL);
INSERT INTO Subjects VALUES('Biology', 6);
INSERT INTO Subjects VALUES('English', 12);
INSERT INTO Subjects VALUES('Math', 15);
CREATE TABLE Achievements
(student_id CHAR(10) NOT NULL,
subject_id CHAR(10) NOT NULL,
course_id CHAR(10) NOT NULL,
credits_earned INTEGER NOT NULL,
PRIMARY KEY (student_id, subject_id, course_id));
INSERT INTO Achievements VALUES ('Arnold', 'English', 'E101', 3);
INSERT INTO Achievements VALUES ('Arnold', 'English', 'E102', 4);
INSERT INTO Achievements VALUES ('Arnold', 'Biology', 'B101', 3);
INSERT INTO Achievements VALUES ('Arnold', 'Physics', 'P101', 3);
INSERT INTO Achievements VALUES ('Arnold', 'Chem', 'C101', 3);
INSERT INTO Achievements VALUES ('Charles', 'Math', 'M101', 3);
INSERT INTO Achievements VALUES ('Charles', 'Math', 'M102', 3);
INSERT INTO Achievements VALUES ('Deborah', 'History', 'H101', 3);

The Students table lists all students of an imaginary school. The Subjects table lists all subjects which are required to be taken by all students and the total credits required in that subject. The Achievements table lists all credits actually earned by students to account for students taking non-required courses.

The problem is to generate a row for each student and each subject, either required or achieved. Some rows will represent completed courses in subjects that are required. Others will have completed courses in subjects that are not required. Some will show requirements with no completed courses.

The desired result set is:


Can you come up with Romley’s answer -- that is, one with both a CROSS JOIN and a FULL OUTER JOIN? Next, can you find an answer that does not use either of these constructs?

Answer:

Note the different combinations of conditions described above all represented here. Richard's solution for this problem is:

SELECT COALESCE(T1.student_id, A1.student_id) AS student,
COALESCE(U1.subject_id, A1.subject_id) AS subject,
U1.credits_needed,
A1.credits_earned
FROM (Students AS T1 CROSS JOIN Subjects AS U1)
FULL OUTER JOIN
(SELECT student_id, subject_id, SUM(credits_earned)
FROM Achievements
GROUP BY student_id, subject_id)
AS A1(student_id, subject_id, credits_earned)
ON T1.student_id = A1.student_id
AND U1.subject_id = A1.subject_id
ORDER BY student, subject;

The CROSS JOIN pairs each student and required subjects. The FULL OUTER JOIN matches courses in completed subjects with the results of the CROSS JOIN, either one of which might exist without the other.

However, my objection was that the schema was flawed. We really want some declarative referential integrity on the tables, like this:

CREATE TABLE Achievements
(student_id CHAR(10) NOT NULL
REFERENCES Students(student_id),
subject_id CHAR(10) NOT NULL
REFERENCES Subjects(subject_id),
course_id CHAR(10) NOT NULL,
credits_earned INTEGER NOT NULL,
PRIMARY KEY (student_id, subject_id, course_id));

Once you make that change, you must add rows for the subjects that Romley used but did not define:

INSERT INTO Subjects VALUES ('Chem', NULL);
INSERT INTO Subjects VALUES ('History', NULL);
INSERT INTO Subjects VALUES ('Physics', NULL);

Now, you can depend on certain relationships to be enforced and write the query this way:

SELECT T1.student_id AS student,
U1.subject_id AS subject,
U1.credits_needed, A1.credits_earned
FROM (Students AS T1
CROSS JOIN
Subjects AS U1)
LEFT OUTER JOIN
(SELECT student_id, subject_id, SUM(credits_earned)
FROM Achievements
GROUP BY student_id, subject_id)
AS A1 (student_id, subject_id, credits_earned)
ON T1.student_id = A1.student_id
AND U1.subject_id = A1.subject_id
WHERE U1.credits_needed IS NOT NULL
OR A1.credits_earned IS NOT NULL
ORDER BY student, subject;

Another way to do this in SQL-92 is with scalar subquery expressions:

SELECT S1.student_id, C1.subject_id,
(SELECT credits_needed
FROM Subjects AS C2
WHERE C2.subject_id = C1.subject_id)
AS credits_needed,
(SELECT SUM(credits_earned)
FROM Achievements AS A1
WHERE A1.student_id = S1.student_id
AND A1.subject_id = C1.subject_id
GROUP BY student_id, subject_id)
AS credits_earned
FROM Students AS S1,
Subjects AS C1
WHERE credits_needed IS NOT NULL
OR credits_earned IS NOT NULL;

I have no idea what the timings are like on this query, but it has no outer joins at all. Both of these queries are based on the idea of creating all possible combinations, then pruning out the ones which we do not want — in this case, no credits in a non-required course.




Joe Celko is an Atlanta-based independent consultant. He is the author of Instant SQL Programming (Wrox Press, 1997). You can contact him at www.celko.com or 71062.1056@compuserve.com.






IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







Techweb
Informationweek Business Technology Network
InformationweekInformationweek 500Informationweek 500 ConferenceInformationweek AnalyticsInformationweek Events
Informationweek MagazineGlobal CIOIWK Government ITbMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingPlug Into The CloudDr. DobbsContentinople
space
TechWeb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0Mobile Business ExpoNoJitter
Black HatGTECEnergy CampCloud ConnectGov 2.0 ExpoGov 2.0 Summit
space
Light Reading Communications Network
Light ReadingLight Reading AsiaUnstrungCable Digital NewsInternet EvolutionPyramid Research
Heavy ReadingLight Reading LiveLight Reading InsiderEthrnet ExpoTelco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems and TechnologyInsurance and TechnologyWall Street and TechnologyAccelerating WallstreetBST SummitBuyside Trading SummitIT Summit
space
Microsoft Technology Network
MSDNTechNetTotal IT ProTotal Dev ProNET Total Dev Pro CommunitySQL Total Dev Pro Community
space