Guide to the TechWeb Network

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home




May 31, 2003

The Logic of Failure

Programmers who rationalize their decision-making logic can create suicidal situations for databases

by Joe Celko

Appendix

Sometimes a schema design is so bad that the data can never be reliable.

The following DDL is actual code which was posted by "Jason" on the SQL Server newsgroup in January. This isn't something I made up to use as an example.

The problem is the usual "Orders and Order details" tables common in almost every business. Jason assures us that "I removed some of the fields [sic] within the Order_T table to simplify things," so we can assume the actual tables are even worse. He used a SQL Server utility program to dump the DDL script, which doesn't use a uniform set of rules for formatting the code, so I have taken the liberty of cleaning it up a bit:

CREATE TABLE dbo.Order_T
(OrderID INTEGER IDENTITY(3000000,1) NOT NULL,
   ShipToID INTEGER NULL,
   ShipToName VARCHAR(30) NULL,
   ShipFirstName VARCHAR(14) NULL,
   ShipLastName VARCHAR(15) NULL,
   BillingFirstName VARCHAR(14) NULL,
   BillingLastName VARCHAR(15) NULL,
   PaymentType VARCHAR(20) NULL,
   OrderTotal SMALLMONEY NULL,
   OrderDate SMALLDATETIME NULL,
   OrderStatus VARCHAR(50) NULL,
   OrderTaxTotal SMALLMONEY NULL,
   OrderShippingTotal SMALLMONEY NULL,
   OrderSubTotal SMALLMONEY NULL,
   UserID INTEGER NULL,
   ShippingMethodID INTEGER NULL,
   ShippingMethod VARCHAR(50) NULL,
   CouponCode VARCHAR(50) NULL,
   GiftCertificateCode VARCHAR(50) NULL,
   OrderType VARCHAR(20) NULL);

CREATE TABLE dbo.Order_Detail_T
(OrderDetailID INTEGER IDENTITY(100000,1) NOT NULL,
   OrderID INTEGER NOT NULL,
   ProductDetailID INTEGER NULL,
   ProductName VARCHAR(50) NULL,
   ProductDescription VARCHAR(1000) NULL,
   ExtraDescription VARCHAR(4000) NULL,
   ProductParentSKU VARCHAR(50) NULL,
   ProductFullSKU VARCHAR(50) NULL,
   ProductStyle VARCHAR(20) NULL,
   ProductColOR VARCHAR(20) NULL,
   ProductSize VARCHAR(20) NULL,
   ProductType VARCHAR(20) NULL,
   ProductPrice SMALLMONEY NULL,
   ProductTax SMALLMONEY NULL,
   ProductShipping SMALLMONEY NULL,
   ProductQuantity SMALLINTEGER NULL,
   GiftBox TINYINTEGER NULL,
   GiftCertificateID INTEGER NULL,
   OrderDetailStatusID INTEGER NULL,
   OrderReceipt VARCHAR(10) NULL,
   OrderInitials VARCHAR(3) NULL,
   QuantityAllocated SMALLINTEGER NULL,
   Notes VARCHAR(255) NULL);

  1. Simply looking at the text, you can see that all of the columns are NULL-able. No key was declared, but it can only be the OrderID. If you're not familiar with SQL Server, the IDENTITY "column" being used as a primary key is a nonrelational, proprietary autonumbering property. Every time you hit a send key to execute an INSERT INTO statement, you will get a new Order with incomplete data in it.
  2. The data types SMALLDATETIME, SMALLMONEY, and TINYINT are proprietary data types. While SMALLINT is part of Standard SQL, it's not available in all products. Getting outside data into the table is now harder than it should have been.
  3. The columns (ShipToID, ShipToName, ShipFirstName, ShipLastName) have no business in this table. They belong in a Customers table and the only the shipto_id should be in the Order, referencing the Customers. With the name (and presumably address data which was dropped for the posting) in this table, the chances of getting the customer's name consistent from Order to Order is almost nonexistent.

    The usual fix for this problem is to buy an address-scrubbing package. The operational time and costs for these products is not inconsiderable.

    Also, what's the logical difference between the ShipToName and the combination (ShipFirstName, ShipLastName)? It looks like one of these is redundant. But there's also a physical difference; the length of ShipToName is VARCHAR(30), while the combined length of (ShipFirstName, ShipLastName) is 29 Characters.

    The United States Postal Service has standards for address labels that use 35 Characters per line to fit the data onto a standard width 3.5 inch printing area. That means that if this company buys a mailing list from an outside provider that follows these rules (i.e. any company in the mailing list business), data truncation will result.
  4. The same objections made in point 3 apply to the (BillingFirstName, BillingLastName) and other dropped columns in this subset. But there's more. Most of the time, the ship_to and the bill_to parties are the same. The exceptions are usually companies with many branches and one central accounts payable department.

    But either situation can be handled by a DEFAULT or TRIGGER action. Having to enter data by hand every time simply invites data corruption.
  5. Notice that the money columns (OrderTotal, OrderTaxTotal, OrderShippingTotal, OrderSubTotal) are derived from the Order Details. This schema design only adds the extra overhead of keeping the Orders and the Order Details balanced correctly. Well, actually, that last sentence was based on the assumption that the database has triggers which do this work. This is not always true.
  6. What is the logical difference between ShippingMethodID and ShippingMethod? If they are redundant, then keep the INTEGER ShippingMethodID and rename it to something that conforms to ISO-11179 Metadata standard.
  7. Why is a CouponCode and GiftCertificateCode are required part of every Order? Most Order systems treat refunds of various types as an inventory item that has a negative value.

In the Order Details table, we see more problems.

  1. We again see the use of the IDENTITY column, but here it's being used for the line items of the Order. The IDENTITY Counter doesn't reset for every new Order, so the OrderDetailID will keep increasing. The Counter can also create gaps in the sequence.
  2. Isn't it an amazing coincidence that so many codes LIKE ProductStyle, ProductColor, ProductSize and ProductType all are VARCHAR(20)? If you allow a user to put a 20 Character color name into a column, he will.
  3. Every column beginning with "Product-" needs to be in an Inventory table and should be referenced by a sku (Stock Keeping Unit) code, UPC, EAN or GTIN code.
  4. Why are there both ProductDescription and ExtraDescription columns in the table? If I have a non-NULL ExtraDescription, do I have to have a non-NULL ProductDescription? How is this enforced? (The detective in me wants to think that perhaps one is a short description used for screen display and the other is a long description used for a catalog. But why are they both so long and NULL-able?)
  5. Isn't it amazing that everything they sell has a gift box and a gift certificate as an attribute? What happens when I get a gift certificate for, say, $100 and I buy five things for $20 each? Which one gets the fifty Character CertificateCode code? If the GiftCertificateCode is put in more than one detail item, my $100 can be counted more than once.
  6. The OrderID column in the Order Details doesn't reference the Orders table, so you can get Orphan details.
  7. What's the relationship between ProductQuantity and QuantityAllocated? I understand the concept of "quantity Ordered" and "quantity shipped" with the constraint CHECK(qty_ordered >= qty_shipped). Confusing names can only lead to confusion on the part of the next programmer who has to write code against this table.
  8. It's a bit unusual to see tax, shipping Charges, a status code and receipt acknowledgement on an item-by-item, but it's possible. Usually, these Charges apply to the Order as a whole.

Making a guess at the specs and assuming that certain other tables exist in the schema, this could be rewritten as:

CREATE TABLE Orders
(order_id INTEGER NOT NULL
   CHECK(<>),
   shipto_id INTEGER NOT NULL
   REFERENCES Customers(cust_id)
   ON UPDATE CASCADE,
   billto_id INTEGER NOT NULL
   REFERENCES Customers(cust_id)
   ON UPDATE CASCADE,
   payment_type INTEGER NOT NULL, -- numeric codes
   order_date DATETIME NOT NULL
   DEFAULT CURRENT_TIMESTAMP,
   order_status INTEGER NOT NULL
   CHECK(order_status IN (.)), -- numeric codes
   shipping_method CHAR(7) NOT NULL
   CHECK(shipping_method IN ('UPS', 'FedEx',.)),
   order_type CHAR(5) NOT NULL, -- abbrev code
   CHECK(order_type IN (.)),
   order_notes VARCHAR(255) NOT NULL DEFAULT '');

CREATE TABLE order_details
(order_id INTEGER NOT NULL
   REFERENCES Orders(order_id),
   order_detail_id INTEGER NOT NULL,
   sku INTEGER NOT NULL
   REFERENCES Inventory(sku)
   ON UPDATE CASCADE,
   order_detail_status INTEGER NOT NULL
   CHECK(order_detail_status IN (.)),
   qty_ordered INTEGER NOT NULL
   CHECK(qty_ordered > 0),
   qty_shipped INTEGER NOT NULL
   CHECK(qty_shipped >= 0),
   CHECK(quantity_shipped <= quantity_ordered),
   PRIMARY KEY(order_detail_id, order_id));

Obviously this is just a skeleton, but I hope that you can see how some bad data was prevented from entering the database. The reason for the original posting was a failed attempt at a query to return totals of various columns for a given day. That query looked like this:

SELECT SUM(OD.ProductQuantity) AS Units,
   SUM(O.OrderSubTotal) AS OrderSubTotal,
   SUM(O.OrderShippingTotal) AS OrderShippingTotal,
   SUM(O.OrderTaxTotal) AS OrderTaxTotal,
   SUM(O.OrderTotal) AS OrderTotal
FROM Order_T AS O
   JOIN
   Order_Detail_T AS OD
   ON O.OrderID = OD.OrderID
WHERE OrderDate >= '1/18/2003 00:00:00'
   AND OrderDate <= '1/18/2003 23:59:59');

The set of columns from the Orders were being cross joined to the Order details, giving the wrong totals. The poor design leads to internal error creation in the database. The ambiguous date format simply adds to the chance for errors in an already impossible situation.

— Joe Celko









IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo Jitter
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet Evolution
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space