CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise





May 31, 2003

The Logic of Failure

Programmers who rationalize their decision-making logic can create suicidal situations for databases

by Joe Celko

Knowledge and action don't seem to go together. Roughly 78 percent of respondents to a February DataFlux Corp. survey said their organizations need additional education about the importance of data quality and methods to maintain and improve it. What organizations need to know is that some bad data can't be stopped, just as some traffic accidents aren't preventable. But like most traffic accidents, much bad data can be prevented.

In Dietrich Dorner's The Logic of Failure (HarperCollins Publishers, 1997), from which I stole my headline, Dorner traces the reasoning that people used in historical situations that led to an "event cascade," such as the Chernobyl disaster.

We create similar lines of reasoning with databases. Look at how many columns are declared with NVARCHAR(< < upper limit > >) because it's the most general data type and can be cast to almost anything else. That's the logic of this design choice. But if you allow people to input a 255-character part identification number in Chinese, they'll do it. Maybe on purpose, maybe by accident, but it will happen.

Slips in Common Sense

When designing a table to hold events, one of the most common designs is to use only one timestamp, thus

CREATE TABLE Events (event CHAR (15) NOT NULL, start_date DATE NOT NULL DEFAULT CURRENT_TIMESTAMP, measurement INTEGER NOT NULL, .);

Appendix

Read more about this in the appendix.

The logic is that you can figure out the duration of the event by looking up the next start date, saving space and eliminating a NULL-able "end_date" column to mark the event's completion. A NULL in the end_date column would be used in the expression COALESCE (end_date, CURRENT_TIMESTAMP) to give you accurate current information.

But say the events are a job history and we leave out a job or two; you'd never know that data is missing. While you might not be able to provide the missing data, it's important to at least detect that it's missing.

Null Doesn't Mean Zero

I'm an advocate of NULLs, but I admit that they're dangerous. New SQL programmers will make every column NULL-able when possible without any regard for the meaning. The logic is that data might be missing and you're supposed to use NULLs for that. And besides, it's less typing.

What if the NULL has two or more meanings? There's a big difference between a hospital condition listed as "undiagnosed" and "diagnosed, but we don't know what it is!"

What if the NULL is being used instead of a natural default value? The most common example is allowing a NULL in a numeric column where you should have had a zero. All the aggregate functions are now wrong.

The Intent

Another invitation to bad data is not putting in obvious check constraints. Or even worse, some developers write the database at the same time as the application code. Then you'll hear, "We put all our integrity constraints in the application," spoken as if it's a good thing. The logic is that the developers know how to write procedural code, but not logical constraints, so they treat the database like a file.



Rate This Article

Comments:

Optional e-mail address:

It's not clear how programmers can be sure that all future programs will do all these integrity constraints the same way. Oh, and they don't have time to verify that the present applications are doing them either.

Bad schema design leads to baroque queries, baroque queries often lead to bad results, and there is your event cascade, paved with good intentions.


Joe Celko [celko@northfacelearning.com] is vice president of RDBMS at North Face Learning in Salt Lake City and author of five books on SQL.








IE Weekly Newsletter
Subscribe to the newsletter
    Email Address