CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise





Table of Contents



Playing the Name Game


Understanding the distinctions between a unique entity, a collection of indistinguishable units, and a divisible mass will help you design the best database for your needs.

by Joe Celko

Columns in a table are either key or non-key attributes. The key attributes are also called identifiers or determiners. These alternate names are descriptive of the functions these attributes serve in a database and there are subtle differences.

In SQL, being a key is an option in the table declaration; it stays with the table itself. A key can be more than one column and usually has some implications in the physical implementation of the database. But it is more of a syntactic than a semantic distinction. Other attributes may be unique and not null, but the database can only use the ones you declare as keys.

Being a determiner refers to a relationship with the other attributes in this table, and perhaps attributes in other tables that constraints and FOREIGN KEY...REFERENCES clauses may or may not be enforce. It is more general in scope than a key.

Being an identifier is the most abstract semantic concept and is the source of the other two. An identifier is how you know that something is an individual thing, apart from all other things in the universe. We should start by defining what characteristics you want in an identifier.

  1. An identifier has to be unique. Otherwise, it is not an identifier.
  2. A personal example of this has happened to me for years. I was named after my father and we were constantly getting each other's mail, phone calls, and so forth because people would assume that such an unusual name could only belong to one person. I finally legally changed my first name from "Joseph" to "Joe" to avoid the confusion.

  3. An identifier should be created at the same time as the entity, or even before the entity exists, but not afterwards. Without an identifier, you simply cannot put an entity into the database.
  4. You might hold the "pre-entity" outside of the system for a short time, however. Imagine a situation where a salesman takes an order at the customer’s site, but has to call his company to get a confirmation number that the customer can then use to reference the transaction. Until that confirmation number is assigned, there is no transaction.

    As an example of an identifier coming into existence before the entity, think about a book that has not been published yet. It can be assigned an International Standard Book Number (ISBN), a title, an author, a price, and everything else while the publisher is waiting to get the manuscript.

    But a future-release book does not behave like a real book. You cannot put it in a box and ship it. You cannot get a review of the book either -- at least not an honest review. It is not the same kind of thing as a published book.

  5. An identifier should be verifiable within itself. That means that when I see a particular kind of identifier, I ought to know if it is syntactically correct. For example, I know that ISBN 0-486-60028-9 has the correct number of digits and that the check digit is correct for a proper ISBN. Later on I can find out that it identifies the Dover Books edition of An Investigation of the Laws of Thought by George Boole.
  6. An identifier should have repeatable verification against the reality that you are trying to capture in your data model. I want to be able to grab any copy of An Investigation of the Laws of Thought off a book shelf, read the physical ISBN off the back of the book, and have it match my inventory.

Exactly what verification means can be a bit fuzzy. At one extreme, prison inmates have their fingerprints taken at control points when they are transported and courts allow DNA evidence for convictions. At the other end of the spectrum, retail stores will accept your check on the assumption that you look like your driver's license photograph.

This is fairly easy to understand in the case of entities, which like people are obviously unique. Although most systems are not able to store fingerprints for identification (yet), using names is not the preferred method. Names are not as stable and unique as we would like, so we map them onto a set of numbers called "employee id numbers" and assume that we can match them up.

But how do you handle things that you cannot identify as individuals? Nobody puts a part number on the grains of rice in a bag. Instead, you identify a collection of the entities -- that is, you put a serial number on the bag.

Collections come in two basic types: those that are individual, but indistinguishable units and those that are a divisible mass. The English language makes a distinction that other languages do not. We can refer to "many" and "few" in the case of indistinguishable units ("there are many bottles of beer in the refrigerator") and "much" and "less" is the case of divisible mass ("there is too much beer in my glass").

The indistinguishable units can be counted (99 bottles of beer) or grouped in collective units (a six pack of beer). The divisible mass is measured with some scale (a bottle holds 350 milliliters).

The distinction between a unique entity, a collection of indistinguishable units, and a divisible mass can shift. To pick an example from the literature, imagine that you are breeding laboratory rats. The ratlets are treated as a litter and assigned a litter number when they are born. As they grow up, each little rat becomes an individual and assigned a unique identification number, encoded in colored markers on his tail. In a similar manner, rice is shipped in bulk, then put into bags with serial numbers at a factory. Tank cars of chemicals are poured into bulk holding tanks, and then back into containers. Each regrouping creates a new entity in the data model.

A family of mechanical puzzles illustrates the conceptual problems with these regrouping transformations. The puzzles have three rectangular pieces that you can assemble in two ways. Both ways produce a picture, but one of the ways has fewer objects in it than the other picture. Where did the missing object go? The answer is that the question is wrong. Each regrouping produces a new set of objects that did not exist before. (Check out www.funkypages.com/hahaha.php?page=/leprechaun/index.php)

The important point for a data modeler is that these three classes are fundamentally different and cannot be mixed in the same set. All the elements of a set have to be of the same type, at the same level of aggregation, and aggregated in the same way.

I cannot have a single column that shows both a count of bags of rice and the number of tons of rice. I have to convert the bags into tons or the tons into bags.

The most aggregated level is a class, which is an aggregation without regard to the groupings or entities in it. For example, I might have an inventory for a grocery store that looks like this:

Suppliers

Supplier	product 
======================= 
'Acme Grains'	'rice' 
'Acme Grains'	'wheat' 
'P&A Foods'	'apples' 
'P&A Foods'	'rice'

It would be wrong to add another row that reads ('Acme Grains', 'rice'), even if the table did not have a declaration of PRIMARY KEY (supplier, product). The value 'rice' is a class; I don't know if the rice is being sold by the bag or by the ton.

Now consider a table that represents an inventory record:

Inventory

Supplier	product	serial_nbr
======================================= 
'Acme Grains'	'rice'	123456 
'Acme Grains'	'wheat'	876291 
'P&A Foods'	'apples'	900821 
'P&A Foods'	'rice'	578765

Now we do care about the individual entities and want to track them by a serial number on their packaging.

It is amazing how the same reality can have so many data models, but that is why a good database designer gets those big bucks.

Puzzle

This one came in as a data warehouse problem. You have a history table of customer purchases that looks like this:

CREATE TABLE History (customer_id CHAR(10) NOT NULL,
order_date DATE NOT NULL,
order_amount DECIMAL(8,2) NOT NULL);

You can fill this table with some dummy data:

INSERT INTO History VALUES ('Celko', '1999-06-28', 450.00); 
INSERT INTO History VALUES ('Curly', '1999-06-25', 400.00); 
INSERT INTO History VALUES ('Curly', '1999-06-26', 300.00); 
INSERT INTO History VALUES ('Curly', '1999-06-27', 400.00); 
INSERT INTO History VALUES ('Curly', '1999-06-28', 450.00); 
INSERT INTO History VALUES ('Larry', '1999-06-25', 400.00); 
INSERT INTO History VALUES ('Larry', '1999-06-26', 400.00); 
INSERT INTO History VALUES ('Larry', '1999-06-27', 450.00); 
INSERT INTO History VALUES ('Larry', '1999-06-28', 400.00); 
INSERT INTO History VALUES ('Moe', '1999-06-25', 400.00); 
INSERT INTO History VALUES ('Moe', '1999-06-26', 400.00); 
INSERT INTO History VALUES ('Moe', '1999-06-27', 400.00); 
INSERT INTO History VALUES ('Moe', '1999-06-28', 400.00); 

The problem is to determine which customers decreased their purchase amounts on their most recent order placed with us. We are trying to get an idea of when people became saturated with what we are selling.

Answer:

I assumed that it was okay to be holding steady and came up with this query.

SELECT H1.customer_id, ' dropped purchase amount on ', MAX(H1.order_date)
FROM History AS H1
WHERE H1.order_amount
< (SELECT H2.order_amount
FROM History AS H2
WHERE H1.customer_id = H2.customer_id
AND H2.order_date
= (SELECT MAX(order_date)
FROM History AS H3
WHERE H1.customer_id = H3.customer_id
AND H1.order_date > H3.order_date))
GROUP BY customer_id;

The nested subquery says that the order amount has dropped and then uses another subquery within itself to ask if the date is the most recent date on file for that customer.




Joe Celko is an Atlanta-based independent consultant. He is the author of Instant SQL Programming (Wrox Press, 1997). You can contact him at www.celko.com or 71062.1056@compuserve.com.





IE Weekly Newsletter
Subscribe to the newsletter
    Email Address