CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise





October 30, 2003

The Alphabet Achievement

Will Unicode be as easy as saying your ABCs?

by Joe Celko

The next time you're searching for a word in the dictionary, think how much harder it would be if the entries weren't in alphabetical order. It's worth remembering that we didn't always file data that way.

A Long Time Coming

Although the Chinese had printing presses and movable type long before Johannes Guttenberg's introduction of them in 1453, they didn't have alphabetical order, for obvious reasons. Their type cases were arranged by topics (characters related to "animals," characters related to "heaven," and so forth), using whatever system the printer came up with.

Jorge Luis Borges' claims in his essay, "The Analytical Language of John Wilkins," that a Chinese encyclopedia, entitled Celestial Emporium of Benevolent Knowledge divides animals into the following categories:"(a) those that belong to the Emperor, (b) embalmed ones, (c) those that are trained, (d) suckling pigs, (e) mermaids, (f) fabulous ones, (g) stray dogs, (h) those that are included in this classification, (i) those that tremble as if they were mad, (j) innumerable ones, (k) those drawn with a very fine camel's hair brush, (l) others, (m) those that have just broken a flower vase, (n) those that resemble flies from a distance."

Now, Borges was famous for his literary fabrications, but you get an idea of how difficult it would be to find information quickly if everyone arbitrarily made up their own (potentially illogical) categories.

Western librarians were alphabetizing some things in the third century B.C., but the concept of alphabetizing wasn't important until the printing press became popular. Suddenly, there was more data to file, and people started to think in terms of the pieces of type that made up a word instead of the handwritten word itself.

Agreeing on the Rules

A Table Alphabetical by Robert Cawdrey (1604) set out rules for alphabetization. Considering how long we've had an alphabet, these rules were a long time coming.

But even alphabetization rules can be subject for debate. Different languages can have different ways to alphabetize. The Spanish Academy (the equivalent of the French Academy) finally decided that 'Ll' and 'Ch' each consisted of two letters. Before this decision, 'Ll' was treated as a single letter after 'L', and 'Ch' was treated as a single letter after 'C'. Even within the same language there can be different collation systems. German has at least three national collations — German dictionary order, Austrian dictionary order, and IBM collation order.

The Next Evolution

But just as the printing press changed the way we thought about organizing words, computers are changing the way we think about the alphabet. The Unicode standard is the next big jump for the alphabet, and I'm not convinced that its importance is fully appreciated. Computers work with binary numbers — not symbols. This means that you have to map a number to a symbol to use it in a computer. Unicode is the first step in exchanging data across language groups via computers because it provides a standard one-to-one mapping for numbers and symbols.



Rate This Article

Comments:

Optional e-mail address:

The original goal was to use a single 16 bit to map slightly more than 65,000 characters, sufficient for most of the characters used in the world's major written languages. Now, the current Unicode standard (and ISO/IEC 10646) supports three encoding forms (8, 16, and 32 bits) that use a common repertoire of characters, but allow for encoding as many as a million more characters. This pretty well covers all the known character encoding sets, all historic scripts of the world, and the common notational systems, such as math.

Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, and others, and is the official way to implement ISO/IEC 10646. The Unicode Standard, Version 4.0 (Addison-Wesley, 2003) has just been released, and you can preorder it at www.unicode.org. You might want to get a copy; it could be the biggest thing since the alphabet.


Joe Celko [celko@northface.edu] is vice president of RDBMS at North Face Learning in Salt Lake City and author of five books on SQL.







IE Weekly Newsletter
Subscribe to the newsletter
    Email Address