The final word on Codd's first two relational papersThe Birth of the Relational Model, Part 3 of 3
The 1970 Paper
Compared to the 1969 paper, there are two major new sections, 1.2 and 1.4. In addition, the 1969 section, "Derivability, Redundancy, and Consistency" has been split into two (2.2 and 2.3), and a summary has been added. Note the shift in focus, as evidenced by the new title (and abstract) and the new section 1.2; the 1969 paper stressed the notions of data redundancy and related matters, but the new paper concentrates on the relational model per se, especially on the usefulness of that model in providing data independence (meaning, primarily, physical data independence). It also discusses the other uses for, and advantages of, relations as given in the 1969 paper. The most significant change vis-a-vis the 1969 paper is the suggestion that all relations should be normalized. Data Dependencies in Present SystemsAs just noted, the 1970 paper concerns itself with the question of data independence much more than its 1969 predecessor. In the abstract, Codd says: "Users of large data banks must be protected from having to know how the data is organized in the machine. Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed;" in other words, we want physical data independence. He continues: "[Such activities and programs should also remain unaffected] even when some aspects of the external representation are changed;" in other words, we want logical data independence, too. Codd considers various examples of the lack of data independence in database systems at the time; he also discusses ordering, indexing, and access path dependence. He subsequently presents a simple example illustrating such dependencies in systems that provide "tree-structured or slightly more general network models of the data" (in other words, hierarchic systems such as IMS and network systems such as IDS). Note: IDS was the forerunner of the better known system IDMS; Codd was writing before IDMS appeared on the scene. It's worth reflecting on how far we've come! From the user's point of view, the solution to the ordering dependence problem is (in SQL terms) the ORDER BY clause; users should not be limited to predefined orderings, physical or otherwise, but should be able to request any ordering they like, dynamically. And if users request an ordering not directly represented in the stored version of the data, then the system should be able to sort or index the data dynamically. Similarly, the solution to the indexing dependence problem is to exclude references to indexes from applications. Instead, applications must request access to data through any preferred method, and a system component -- what we would now call the optimizer -- takes responsibility for deciding whether to use indexes in response to such access requests. Separate CREATE and DROP INDEX statements are available for use at any time independent of the existence of applications and their logical data requirements. Likewise, the solution to the access path dependence problem is to exclude all access paths from the user's view of the data (as is done in relational systems but not in systems such as IMS and IDS). Modern readers might be puzzled by Codd's distinction between indexing and access path dependence; after all, an index is just a special case of an access path. What Codd had in mind, however, was the following: The data the user saw in a hierarchic or network system included certain constructs (sometimes called links) that - no matter how much their defenders might argue -- were really just log access paths. These paths invariably also served as physical access paths. If such a "logical" access path were removed, for example, certain applications would cease to work. Normal FormAs noted, the most significant change in the 1970 paper is the idea that you should always normalize; they should be defined only on "domains whose elements are atomic (nondecomposable) values." "Normalized" thus means "no repeating groups," or what now call first- normal form, 1NF. (The higher normal forms -- 2NF, 3NF, and so on -- were defined later.) Codd insists on normalization because a normalized relation "can be represented in storage by a two-dimensional column-homogeneous array ... [whereas] some more complicated data structure is necessary for a relation [that is unnormalized]." As his first argument in favor of normalization, it's curious that Codd focuses on simple representation in storage. Perhaps he actually means simple representation as far as the user is concerned. His use of the term "array" is also a little strange, given that you access array elements by means of positional addressing, although relation elements (n-tuples) are most certainly not. Codd claims the simplicity of the array representation is an advantage "for communication of bulk data between systems which use widely different representations of the data. The communication form would be a suitably compressed version of the array representation and (a) would be devoid of pointers, (b) would avoid all dependence on hash addressing schemes, (c) would contain no indexes or ordering fields" (slightly paraphrased). The second sentence seems to be Codd's first explicit mention of the fact that the relational model expressly excludes pointers -- a fact that subsequently became the subject of much debate, as I'm sure you know. "The adoption of a relational model of data ... permits the development of a universal data sublanguage based on an applied predicate calculus. A first-order predicate calculus suffices if the collection of relations is in normal form." This is an important point! -- and it marks a major departure from the 1969 paper, which discussed second-order predicate calculus, not first. For the benefit of readers who might not be familiar with these concepts, let me just say that (in relational terms) "first order" means we can quantify over the rows of a relation, while "second order" can quantify over relations per se. First-order logic lets us formulate queries such as "Does supplier S1 exist in the suppliers relation?" Second-order logic lets us formulate queries such as "Does supplier S1 exist in any relation?" I'd like to elaborate briefly on this question of normalization. I do agree with Codd that it's desirable to stay in the realm of first-order logic, if possible. At the same time, however, I reject the idea of "atomic values," at least in the sense that there might be such a thing as absolute atomicity. In The Third Manifesto,3 we let domains contain values of arbitrary complexity. (They can even be relations.) Nevertheless, we stay within the confines of first-order logic. Further discussion of this issue would take us too far afield here; if you want to know more, you can find the details in another paper by Darwen.4 In the 1970 paper, Codd gives a simple example showing what's involved in normalizing an unnormalized relation. As already noted, "normalizing" here means getting to first normal form, and the example is essentially straightforward. However, the paper also includes the following tantalizing remarks: "Further operations of a normalizing kind are possible. These are not discussed in this paper." Another hint of interesting developments to come! Incidentally, Codd also remarks that "he knows of no application that would involve a primary key with a component defined on a domain whose elements are not atomic values" (considerably paraphrased). In fact, such applications do exist; one is described in Darwen's "Relation-Value Attributes.4 However, just because applications exist doesn't mean you can't normalize existing relations. (The issue, again, is what "atomicity" means.) Again, further discussion would take us too far afield here. Logic, Performance, and Other IssuesI'll bring this review of Codd's 1970 paper to a close with a few miscellaneous points that don't fit neatly into any of the preceding sections. Regarding using predicate logic and the quantifiers (EXISTS and FORALL) of that logic in a data access language, Codd says: "Because each relation in a practical data bank is a finite set at every instant of time, the existential and universal quantifiers can be expressed in terms of a function that counts the number of elements in any finite set." This remark seems to admit that we're really dealing just with propositional logic, rather than full predicate logic. The paper also includes the following brief remarks on performance: "[The] data system must provide a means of translating user requests ... into ... efficient ... actions on the current stored representation. For a high-level data language, this presents a challenging design problem. Nevertheless, it is a problem which must be solved. As more users obtain concurrent access to a large data bank, responsibility for providing efficient response and throughput shifts from the ... user to the data system." Prophetic words! Codd is pointing out the necessity for an optimizer component in the DBMS, and tacitly suggesting that there might be some interesting research issues in this area. Codd also discusses the connection trap. "A lack of understanding of [the semantics of the relational operators] has led several systems designers into what may be called the connection trap. [For example, suppose we have a nonrelational system in which] each supplier description is linked by pointers to the descriptions of each part supplied by that supplier, and each part description is similarly linked to the descriptions of each project which uses that part. A conclusion is now drawn which is, in general, erroneous: if all possible paths are followed from a given supplier via the [corresponding] parts ... to the projects using those parts, one will obtain a valid set of all projects supplied by that supplier. Such a conclusion is correct only in the very special case that the target relation between projects and suppliers is, in fact, the natural composition of the other two relations." We don't have to follow pointers to fall into the connection trap; unfortunately, we can make the same logical error in a purely relational system, too. Indeed, some writers have criticized relational systems on exactly this point. (See James Martin's "Semantic Disintegrity in Relational Operations.") 5 I hope it's obvious that such criticisms are invalid because they betray a sad lack of understanding of the relational model. Applying the 20/20 hindsight principle once again, I note that despite its title, the 1970 paper doesn't offer a succinct definition of the term relational model, nor of the term data model. This is strange because the paper introduced this latter concept, too. On the contrary, the 1970 paper implies that the relational model consists only of the structural aspects; in other words, you exclude the manipulative and integrity aspects. (The part of the paper called "Redundancy and Consistency" includes relational operators, but they're not included in the part called "Relational Model and Normal Form.") Furthermore, the paper talks about "a relational model ... for a data bank," unfortunately suggesting that the term "relational model" refers to an abstract view of the data in a specific database, instead of to an abstract view of data in general. Both of the foregoing misconceptions are still regrettably all too common in database literature.The idea that "the relational model is just structure" -- sometimes expressed in the form "the relational model is just flat files" -- represents a major misconception. Finally, a historical footnote: I recently came across a paper dating from 1966 (!) with the title "A Relational Model for Information Retrieval and the Processing of Linguistic Data.6 " That paper concerns natural language processing problems, and the "relational model" of the title is a formalism for representing certain English sentences. For example, the sentence "John loves Mary" might be represented by the expression "j L m." (The relations of the paper are all binary.) The ideas have certain points in common with Codd's relational model. (Naturally, because they're based on the same logical foundation.) However, I think it's fair to say that we have to give Codd credit for solely inventing the relational model of data as we've come to understand it. C. J. Date is an independent author, lecturer, researcher, and consultant, specializing in relational database systems. His most recent books are Foundation for Object/Relational Databases: The Third Manifesto (1998), coauthored with Hugh Darwen, and Relational Database Writings 1994-1997, both published by Addison-Wesley in 1998. Correspondence may be sent to him in care of Intelligent Enterprise, 411 Borel Ave. #100, San Mateo, CA 94402. REFERENCES
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
| |||||||||||||||||||||||||||||||























