|
OPTIMAL NORMAL FORMS
I noted previously that Codd's first paper on further normalization1 defined 2NF and 3NF in terms of collections of relations. For example: "A collection C of relations is in third normal form if... every relation R in C [is in third normal form as we now understand that term]." Now, this focus on collections is understandable, given that we often do seem to need a shorthand way of saying that (for instance) every relation in the database is in 3NF. The concept of optimal normal form addresses this issue, in part. (Although it also does more, as we'll see.) Codds paper "Further Normalization of the Data Base Relational Model"2 defines two "optimal normal forms," optimal 2NF and optimal 3NF; both apply to collections of relations rather than to individual relations per se.
* Optimal 2NF
A collection of relations is in optimal 2NF if (a) every relation in the collection is in 2NF and (b) the collection contains the fewest possible relations consistent with condition (a). In the case of the suppliers and parts database, for example, the usual collection of relations:
S { S#, SNAME, STATUS, CITY }
P { P#, PNAME, COLOR, WEIGHT, CITY }
SP { S#, P#, QTY }
is in optimal 2NF. However, if we were to replace (say) relation S by its projections on { S#, SNAME, CITY } and { S#, STATUS }, the resulting collection would no longer be optimal. Loosely speaking, therefore, "optimal 2NF" just means: Don't break relations down too far. (Note: There's no claim that optimal 2NF is unique in any sense; a given collection of relations might have several distinct optimal 2NF equivalents.)
* Optimal 3NF
Let C2 be an optimal 2NF collection of relations, and let C3 be a collection of 3NF relations derived from those in C2 by taking suitable projections. C3 is in optimal 3NF relative to C2 if (a) no relation in C3 has a pair of attributes A and C such that C is strictly transitively dependent on A in some relation in C2 and (b) C3 contains the fewest possible relations consistent with condition (a). (Perhaps I should remind you what it means for C to be strictly transitively dependent on A. Given a relation R, C is strictly transitively dependent on A in R if there exists some B in R such that A - B and B - C and C -/ B and B -/ A.)
For example, suppose the suppliers relation S { S#, STATUS, CITY } satisfies the functional dependency CITY - STATUS. (I ignore attribute SNAME for simplicity.) This relation is in 2NF but not 3NF. Now consider the collection of relations SS { S#, STATUS } SC { S#, CITY }. This collection contains 3NF relations only; in fact, it contains the fewest possible number of such relations. But it's not in optimal 3NF: It violates condition (a) of the definition, because relation SS contains two attributes, STATUS and S#, such that STATUS is strictly transitively dependent on S# (via CITY) in the original 2NF relation S.
What Codd is getting at here with condition (a) is what later came to be known as dependency preservation.3 The dependency CITY - STATUS has been lost in the decomposition into projections SS and SC. To put it another way, the trouble with the two projections SS and SC is that they're not independent.4 In contrast, consider the following alternative decomposition:
SC { S#, CITY }
CS { CITY, STATUS }
Again, both relations are in 3NF. This time, however, (a) the dependency CITY - STATUS has been preserved; (b) the two projections SC and CS are independent; and (c) the collection of relations is in optimal 3NF.(Note: Again, there's no claim that optimal 3NF is unique in any sense; a given database might have several distinct optimal 3NF equivalents.)
PRIME ATTRIBUTES REVISITED
Now let's get back to the question of prime vs. nonprime attributes. Just to remind you, an attribute is said to be prime if it participates in at least one candidate key of the relation in which it appears. Let me also remind you of Codd's original definition of 3NF: Relation R is in 3NF if it's in 2NF and every nonprime attribute is nontransitively dependent on each candidate key of R. Now let's take a look at an example.3
Suppose every supplier has both a unique supplier name and a unique supplier number, and consider the following relation, SSP (an extended version of the usual shipments relation):
SSP { S#, SNAME, P#, QTY }
CANDIDATE KEY { S#, P# }
CANDIDATE KEY { SNAME, P# }
Now, relation SSP obviously suffers from certain redundancies (and is accordingly subject to certain update anomalies). To be specific, the fact that a given supplier has a certain supplier number and a certain supplier name appears many times in this relation, in general. Yet the relation is in 3NF according to Codd's definition! The problem is that Codd's definition requires only nonprime attributes to be fully dependent on every candidate key -- and the redundancies in SSP are caused by the fact that certain prime attributes aren't fully dependent on certain candidate keys. (In terms of the definition shown here, we have S# - SNAME and SNAME - S#; hence S# isn't fully dependent on the second candidate key and SNAME isn't fully dependent on the first.) So Codd's definition of 3NF still permitted certain redundancies (and hence certain update anomalies) to occur. In 1974, therefore, he presented an "improved definition of 3NF" (developed with Raymond Boyce) that took care of this problem.5 However, that new definition was strictly stronger than the old one, in the sense that every relation that satisfies the new definition also satisfies the old, but some relations satisfy the old one and not the new. Thus, it doesn't really make sense to say that the new definition defines "3NF" per se; instead, we now say it defines Boyce/Codd normal form (BCNF). (I discussed BCNF in the December 15, 1998 issue of Intelligent Enterprise.)
Note: Actually, a definition of "third" normal form that was equivalent to the BCNF definition was first given by Ian Heath in 1971;6 "Heath normal form" might thus have been a more fitting name. Incidentally, this slight confusion over 3NF and BCNF was cleared up very elegantly in a paper by Carlo Zaniolo.7 First, Zaniolo gives this definition of 3NF:
* Let R be a relation, let X be any set of attributes of R, and let A be any single attribute of R. Then R is in 3NF if and only if, for every functional dependency X - A in R, at least one of the following is true:
1. X contains A
2. X contains a candidate key of R
3. A is contained in a candidate key of R.
Explanation: Possibility 1 takes care of trivial dependencies (trivial dependencies are always satisfied and can never be eliminated). Possibility 2 takes care of dependencies implied by superkeys (as we saw last time, if X is a superkey, then X - A is true for all attributes A; these dependencies too are always satisfied and can never be eliminated). Possibility 3 corresponds to that part of Codd's definition that allows prime attributes not to be "fully" (irreducibly) dependent on each candidate key.
Next, Zaniolo's definition of BCNF is obtained from his 3NF definition by simply dropping possibility 3. Among other things, this definition has the virtue that it shows immediately that BCNF is strictly stronger than 3NF. In fact, of course, when people talk informally of "3NF," it's usually BCNF that they really have in mind.
THE SJT EXAMPLE
There's another example I want to mention briefly (this one is taken from reference 3 also). We're given relation SJT { S, J, T }, where S, J, and T stand for student, subject, and teacher, respectively; the meaning of row (s,j,t) is that student s is taught subject j by teacher t. We're also told that the following functional dependencies hold:
{ S, J } - T(For each subject, each student of that subject is taught by only one teacher.)
T - J
(Each teacher teaches only one subject.)
Like relation SSP from the previous section, relation SJT is in 3NF but not BCNF, and so we should probably break it down. (It certainly involves redundancy in its present form, and it's certainly subject to update anomalies of various kinds). As I pointed out earlier,6 however, breaking it down necessarily violates the objective (mentioned briefly in the section on optimal normal forms) of dependency preservation! It follows that the two objectives of avoiding update anomalies and preserving dependencies are sometimes in conflict; that is, it isn't always possible to achieve both objectives simultaneously.
The reason I mention this example is that to his credit Codd was already aware of this problem in 1971! He expresses it differently, though, saying that "[It] is not always possible to remove all transitive dependencies without losing information." In terms of the SJT example, he would say that J is transitively dependent on the candidate key { S, J }. Of course, J is also trivially dependent on that candidate key, so it's a little difficult to see what he's really getting at here, but what he means is that breaking SJT down into its two BCNF projections on { S, T } and { T, J } which is what the further normalization discipline would suggest loses the functional dependency { S, J } - T.
CONCLUDING REMARKS
If we accept the rule that no tuple can have an undefined value for any primary key component, then, a database in which all relations are in 3NF is capable of representing certain information that an otherwise equivalent database in which all relations are only in 2NF isn't.2 An analogous remark holds if we replace 2NF by 1NF and 3NF by 2NF, of course. These facts can be seen as additional advantages of 3NF over 2NF and 2NF over 1NF.
Furthermore, Codd adds: "It is also conjectured that physical records in optimal 3NF will prove to be highly economical in [storage] space" (my italics). "In some cases, a further saving in space can be obtained by factoring."
Codds "Further Normalization" paper concludes with a brief discussion of growth and restructuring in the database, including in particular a look at the question of "attribute migration": "It is this author's thesis that, by casting [all relations in] the data base in third normal form at the earliest possible time... an installation will reduce the incidence of attribute migration to a minimum, and consequently have less trouble keeping its application programs in a viable state."
REFERENCES
1. Codd, E. F. "The Second and Third Normal Forms for the Relational Model." IBM technical memo (October 6th, 1970).
2. Codd. E. F. "Further Normalization of the Data Base Relational Model" (presented at Courant Computer Science Symposia Series 6, "Data Base Systems," New York City, N.Y., May 24th-25th, 1971). IBM Research Report RJ909 (August 31st, 1971). Republished in Randall J. Rustin (ed.), Data Base Systems: Courant Computer Science Symposia Series 6. Englewood Cliffs, N.J.: Prentice-Hall (1972).
3. Date, C. J. An Introduction to Database Systems (6th edition). Reading, Mass.: Addison-Wesley (1995). Note: The third edition of this book (mentioned in the body of this article) has a copyright date of 1981.
4. Rissanen, Jorma. "Independent Components of Relations." ACM Transactions on Database Systems 2(4), December 1977.
5. Codd, E. F. "Recent Investigations into Relational Data Base Systems." IBM Research Report RJ1385 (April 23rd, 1974). Republished in Proc. 1974 Congress (Stockholm, Sweden, 1974). New York, N.Y.: North-Holland (1974).
6. Heath, J. "Unacceptable File Operations in a Relational Database." Proc. 1971 ACM SIGFIDET Workshop on Data Description, Access, and Control, San Diego, Calif. (November 11th-12th, 1971).
7. Zaniolo, Carlo. "A New Normal Form for the Design of Relational Database Schemata." ACM Transactions on Database Systems 7(3), September 1982.
C. J. Date is an independent author, lecturer, researcher, and consultant, specializing in relational database systems. His most recent books are Foundation for Object/Relational Databases: The Third Manifesto, coauthored with Hugh Darwen, and Relational Database Writings 1994-1997, both published by Addison-Wesley in 1998. Correspondence may be sent to him in care of Intelligent Enterprise, 411 Borel Ave. #100, San Mateo, CA 94402.
|
|
|
| |||||||||||||||||||||||||||||||























