The Logic of Failure
Programmers who rationalize their decision-making logic can create suicidal situations for
databases
by Joe Celko
Appendix
Sometimes a schema design is so bad that the data can never be reliable.
The following DDL is actual code which was posted by "Jason" on the SQL Server newsgroup in January. This
isn't something I made up to use as an example.
The problem is the usual "Orders and Order details" tables common in almost every business. Jason assures
us that "I removed some of the fields [sic] within the Order_T table to simplify things," so
we can assume the actual tables are even worse. He used a SQL Server utility program to dump the DDL
script, which doesn't use a uniform set of rules for formatting the code, so I have taken the liberty of
cleaning it up a bit:
CREATE TABLE dbo.Order_T
(OrderID INTEGER IDENTITY(3000000,1) NOT NULL,
ShipToID INTEGER NULL,
ShipToName VARCHAR(30) NULL,
ShipFirstName VARCHAR(14) NULL,
ShipLastName VARCHAR(15) NULL,
BillingFirstName VARCHAR(14) NULL,
BillingLastName VARCHAR(15) NULL,
PaymentType VARCHAR(20) NULL,
OrderTotal SMALLMONEY NULL,
OrderDate SMALLDATETIME NULL,
OrderStatus VARCHAR(50) NULL,
OrderTaxTotal SMALLMONEY NULL,
OrderShippingTotal SMALLMONEY NULL,
OrderSubTotal SMALLMONEY NULL,
UserID INTEGER NULL,
ShippingMethodID INTEGER NULL,
ShippingMethod VARCHAR(50) NULL,
CouponCode VARCHAR(50) NULL,
GiftCertificateCode VARCHAR(50) NULL,
OrderType VARCHAR(20) NULL);
CREATE TABLE dbo.Order_Detail_T
(OrderDetailID INTEGER IDENTITY(100000,1) NOT NULL,
OrderID INTEGER NOT NULL,
ProductDetailID INTEGER NULL,
ProductName VARCHAR(50) NULL,
ProductDescription VARCHAR(1000) NULL,
ExtraDescription VARCHAR(4000) NULL,
ProductParentSKU VARCHAR(50) NULL,
ProductFullSKU VARCHAR(50) NULL,
ProductStyle VARCHAR(20) NULL,
ProductColOR VARCHAR(20) NULL,
ProductSize VARCHAR(20) NULL,
ProductType VARCHAR(20) NULL,
ProductPrice SMALLMONEY NULL,
ProductTax SMALLMONEY NULL,
ProductShipping SMALLMONEY NULL,
ProductQuantity SMALLINTEGER NULL,
GiftBox TINYINTEGER NULL,
GiftCertificateID INTEGER NULL,
OrderDetailStatusID INTEGER NULL,
OrderReceipt VARCHAR(10) NULL,
OrderInitials VARCHAR(3) NULL,
QuantityAllocated SMALLINTEGER NULL,
Notes VARCHAR(255) NULL);
- Simply looking at the text, you can see that all of the columns are
NULL-able. No key
was declared, but it can only be the OrderID. If you're not familiar with SQL Server, the
IDENTITY "column" being used as a primary key is a nonrelational, proprietary autonumbering
property. Every time you hit a send key to execute an INSERT INTO statement, you will get a
new Order with incomplete data in it.
- The data types
SMALLDATETIME, SMALLMONEY, and TINYINT are
proprietary data types. While SMALLINT is part of Standard SQL, it's not available in all
products. Getting outside data into the table is now harder than it should have been.
- The columns (
ShipToID, ShipToName, ShipFirstName,
ShipLastName) have no business in this table. They belong in a Customers table
and the only the shipto_id should be in the Order, referencing the
Customers. With the name (and presumably address data which was dropped for the posting) in
this table, the chances of getting the customer's name consistent from Order to Order is almost
nonexistent.
The usual fix for this problem is to buy an address-scrubbing package. The operational time and
costs for these products is not inconsiderable.
Also, what's the logical difference between the ShipToName and the combination
(ShipFirstName, ShipLastName)? It looks like one of these is
redundant. But there's also a physical difference; the length of ShipToName is
VARCHAR(30), while the combined length of (ShipFirstName,
ShipLastName) is 29 Characters.
The United States Postal Service has standards for address labels that use 35 Characters per
line to fit the data onto a standard width 3.5 inch printing area. That means that if this
company buys a mailing list from an outside provider that follows these rules (i.e. any company
in the mailing list business), data truncation will result.
- The same objections made in point 3 apply to the (
BillingFirstName,
BillingLastName) and other dropped columns in this subset. But there's more. Most of the
time, the ship_to and the bill_to parties are the same. The exceptions are
usually companies with many branches and one central accounts payable department.
But either situation can be handled by a DEFAULT or TRIGGER action.
Having to enter data by hand every time simply invites data corruption.
- Notice that the money columns (
OrderTotal, OrderTaxTotal,
OrderShippingTotal, OrderSubTotal) are derived from the Order
Details. This schema design only adds the extra overhead of keeping the Orders and
the Order Details balanced correctly. Well, actually, that last sentence was based on the
assumption that the database has triggers which do this work. This is not always true.
- What is the logical difference between
ShippingMethodID and
ShippingMethod? If they are redundant, then keep the INTEGER ShippingMethodID
and rename it to something that conforms to ISO-11179 Metadata standard.
- Why is a
CouponCode and GiftCertificateCode are required part of every
Order? Most Order systems treat refunds of various types as an inventory item that has a negative value.
In the Order Details table, we see more problems.
- We again see the use of the
IDENTITY column, but here it's being used for the line
items of the Order. The IDENTITY Counter doesn't reset for every new
Order, so the OrderDetailID will keep increasing. The Counter can
also create gaps in the sequence.
- Isn't it an amazing coincidence that so many codes LIKE
ProductStyle,
ProductColor, ProductSize and ProductType all are
VARCHAR(20)? If you allow a user to put a 20 Character color name into a column, he will.
- Every column beginning with "
Product-" needs to be in an Inventory table and should be
referenced by a sku (Stock Keeping Unit) code, UPC, EAN or GTIN code.
- Why are there both
ProductDescription and ExtraDescription columns in the
table? If I have a non-NULL ExtraDescription, do I have to have a non-NULL
ProductDescription? How is this enforced? (The detective in me wants to think that perhaps one is
a short description used for screen display and the other is a long description used for a catalog. But
why are they both so long and NULL-able?)
- Isn't it amazing that everything they sell has a gift box and a gift certificate as an attribute?
What happens when I get a gift certificate for, say, $100 and I buy five things for $20 each? Which one
gets the fifty Character
CertificateCode code? If the GiftCertificateCode is
put in more than one detail item, my $100 can be counted more than once.
- The
OrderID column in the Order Details doesn't reference the Orders
table, so you can get Orphan details.
- What's the relationship between
ProductQuantity and QuantityAllocated? I
understand the concept of "quantity Ordered" and "quantity shipped" with the constraint
CHECK(qty_ordered >= qty_shipped). Confusing names can only lead to confusion on the part of
the next programmer who has to write code against this table.
- It's a bit unusual to see tax, shipping Charges, a status code and receipt acknowledgement on an
item-by-item, but it's possible. Usually, these Charges apply to the Order as a whole.
Making a guess at the specs and assuming that certain other tables exist in the schema, this could be
rewritten as:
CREATE TABLE Orders
(order_id INTEGER NOT NULL
CHECK(<>),
shipto_id INTEGER NOT NULL
REFERENCES Customers(cust_id)
ON UPDATE CASCADE,
billto_id INTEGER NOT NULL
REFERENCES Customers(cust_id)
ON UPDATE CASCADE,
payment_type INTEGER NOT NULL, -- numeric codes
order_date DATETIME NOT NULL
DEFAULT CURRENT_TIMESTAMP,
order_status INTEGER NOT NULL
CHECK(order_status IN (.)), -- numeric codes
shipping_method CHAR(7) NOT NULL
CHECK(shipping_method IN ('UPS', 'FedEx',.)),
order_type CHAR(5) NOT NULL, -- abbrev code
CHECK(order_type IN (.)),
order_notes VARCHAR(255) NOT NULL DEFAULT '');
CREATE TABLE order_details
(order_id INTEGER NOT NULL
REFERENCES Orders(order_id),
order_detail_id INTEGER NOT NULL,
sku INTEGER NOT NULL
REFERENCES Inventory(sku)
ON UPDATE CASCADE,
order_detail_status INTEGER NOT NULL
CHECK(order_detail_status IN (.)),
qty_ordered INTEGER NOT NULL
CHECK(qty_ordered > 0),
qty_shipped INTEGER NOT NULL
CHECK(qty_shipped >= 0),
CHECK(quantity_shipped <= quantity_ordered),
PRIMARY KEY(order_detail_id, order_id));
Obviously this is just a skeleton, but I hope that you can see how some bad data was prevented from
entering the database. The reason for the original posting was a failed attempt at a query to return
totals of various columns for a given day. That query looked like this:
SELECT SUM(OD.ProductQuantity) AS Units,
SUM(O.OrderSubTotal) AS OrderSubTotal,
SUM(O.OrderShippingTotal) AS OrderShippingTotal,
SUM(O.OrderTaxTotal) AS OrderTaxTotal,
SUM(O.OrderTotal) AS OrderTotal
FROM Order_T AS O
JOIN
Order_Detail_T AS OD
ON O.OrderID = OD.OrderID
WHERE OrderDate >= '1/18/2003 00:00:00'
AND OrderDate <= '1/18/2003 23:59:59');
The set of columns from the Orders were being cross joined to the Order details, giving the wrong totals.
The poor design leads to internal error creation in the database. The ambiguous date format simply adds
to the chance for errors in an already impossible situation.
Joe Celko