Keeping our warehouses for 50 years |
|
Digital Preservation |
||||||
|
||||||||
One of the oaths we take as data warehouse managers is that we will preserve history. In many ways we have become the archivists of corporate information. We dont usually promise to keep all old history online, but we often claim that we will store it somewhere for safekeeping. Of course, storing it for safekeeping means that we will be able to get the old history back out again when someone is interested in looking at it.
Most of us data warehouse managers have been so busy bringing up data warehouses, avoiding stovepipe data marts, adapting to new database technologies, and adapting to the explosive demands of the Web, that we have relegated our archiving duties to backing up data on tapes and then forgetting about the tapes. Or maybe we are still appending data onto our original fact tables and we havent really faced what to do with old data yet.
But across the computer industry there is a growing awareness that people are not yet preserving digital data, and that it is a serious and difficult problem.
Does a Warehouse Even Need to Keep Old Data?
Most data warehouse and Webhouse managers are driven by the urgent needs departments such as marketing, which have very tactical concerns. Few marketing departments care about data that is more than three years old because our products and markets are changing so quickly. It is tempting to think only of these marketing clients and to discard data that no longer meets their needs.
But with a little reflection, we realize we are sitting on a lot of other data in our warehouses that we absolutely must preserve. This data includes:
Detailed sales records, for legal, financial, and tax purposes
Trended survey data in which long-term tracking has strategic value
All records required for government regulatory or compliance tracking
Medical records that in some cases must be preserved for 100 years!
Clinical trials and experimental results that may support patent claims
Documentation of toxic waste disposal, fuel deliveries, and safety inspections
All other data that may have historical value to someone, sometime.
Faced with this list, we have to admit that we need a plan for retrieving these kinds of data five, 10, or maybe even 50 years in the future. It begins to dawn on us that maybe this will be a challenge. How long do mag tapes last, anyway? Are CD-ROMs or DVDs the answer? Will we be able to read the formats in the future? I have some eight-inch floppies from just a few years ago that are absolutely unrecoverable and worthless. All of a sudden, this is sounding like a difficult project .
Media, Formats, Software, and Hardware
As we begin to think about really long-term preservation of digital data, our world begins to fall apart. Lets start with the storage media. There is considerable disagreement about the practical longevity of physical media such as mag tapes and CD-ROM disks, with serious estimates ranging from as few as five years to many decades. But of course, our media may not be of archival quality, and they may not be stored or handled in an optimum way. We must counterbalance the optimism of vendors and certain experts with the pragmatic admission that most of the tapes and physical media we have today that are more than 10 years old are of doubtful integrity.
Any debates about the physical viability of the media, however, pale when we compare them to the debates about formats, software, and hardware. All data objects are encoded on physical media in the format-of-the-day. Everything from the density of the bits on the media, to the arrangement of directories, and finally to the higher-level application-specific encoding of the data, is a stack of cards waiting to fall. Taking my eight-inch floppies as examples, what would it take to read the embedded data? Well, it would take a hardware configuration sporting a working eight-inch floppy drive, the software drivers for an eight-inch drive, and the application that originally wrote the data to the file.
Obsolete Formats and Archaic Formats
In the lexicon of digital preservationists, an obsolete format is a format that is no longer actively supported, but there is still working hardware and software extant that can read and display the content of the data in its original form. An archaic format is one that has passed on to the nether-realm. My eight-inch floppies are, as far as I am concerned, archaic formats. I will never recover their data. The Phoenician writing system known as Linear A is also an archaic format that has apparently been lost forever. My floppies may be only slightly easier to decipher than Linear A.
Hard Copy, Standards, and Museums
A number of proposals have been made to work around the format difficulties of recovering old data. One simple proposal is to reduce everything to hard copy. In other words, print all your data onto paper. Surely this action will sidestep all the issues of data formats, software, and hardware. While for tiny amounts of data this approach has a certain appeal, and is better than losing the data, it has a number of fatal flaws. In todays world, copying to paper doesnt scale. A gigabyte printed out as ASCII characters would take 250,000 printed pages at 4,000 characters per page. A terabyte would require 250,000,000 pages! Remember that we cant cheat and put the paper on a CD-ROM or a mag tape because that would just reintroduce the digital format problem. And finally, we would be seriously compromising the data structures, the user interfaces, and the behavior of the systems originally meant to present and interpret the data. In many cases, a paper backup would destroy the datas usability.
A second proposal is to establish standards for the representation and storage of data that would guarantee that everything can be represented in permanently readable formats. In the data warehouse world, the only data that remotely approaches such a standard is relational data stored in an ANSI-standard format. But almost all implementations of relational databases use significant extensions of the data types, SQL syntax, and surrounding metadata to provide needed functionality. By the time we have dumped a database with all its applications and metadata onto a mag tape, even if it has come from Oracle or DB2, we cant be very confident that we will be able to read and use such data in 30 or 50 years. Other data outside of the narrow ANSI-standard RDBMS definition is hopelessly fragmented. There is no visible market segment, for instance, that coalesces all possible OLAP data storage mechanisms into a single physical standard that guarantees lossless transfer to and from the standard format.
A final, somewhat nostalgic proposal is to support museums, where ancient versions of hardware, operating systems, and applications software would be lovingly preserved so that people could read old data. This proposal at least gets to the heart of the issue in recognizing that the old software must really be present in order to interpret the old data. But the museum idea doesnt scale and doesnt hold up to close scrutiny. How are we going to keep a Digital Data Whack 9000 working for 50 years? What happens when the last one dies? And if the person walking in with the old data has moved the data to a modern medium like a DVD ROM, how would a working Digital Data Whack 9000 interface to the DVD? Is someone going to write modern drivers for ancient pathetic machines? Maybe it has an eight-bit bus .
Refreshing, Migrating, Emulating, and Encapsulating
A number of experts have suggested that an IT organization should periodically refresh the data storage by moving the data physically from old media onto new media. A more aggressive version of refreshing is migrating, where the data is not only physically transferred but is reformatted so that contemporary applications can read it. Refreshing and migrating do indeed solve some of the short-term preservation crises because if you successfully refresh and migrate, you are free from the problems of old media and old formats. But taking a longer view, these approaches have at least two very serious problems. First, migrating is a labor-intensive, custom task that has little leverage from job to job and may involve the loss of original functionality. A second and more serious problem is that migrating cannot handle major paradigm shifts. We all expect to migrate from version 8 of an RDBMS to a version 9, but what happens if heteroschedastic database systems (HDSs) take over the world? The fact that nobody, including me, knows what an HDS is, illustrates my point. After all, we didnt migrate very many databases when the paradigm shifted from network to relational databases, did we?
Well, we have managed to paint a pretty bleak picture. Given all these facts, what hope do the experts have for long-term digital preservation? If you are interested in this topic and a serious architecture for preserving your digital data warehouse archives for the next 50 years, you should read Jeff Rothenbergs 1998 treatise Avoiding Technological Quicksand, Finding a Viable Technical Foundation for Digital Preservation (see Resources), a report to the Council on Library and Information Resources (CLIR). It is very well-written and I recommend it very highly.
As a hint of where Rothenberg goes with this topic, he recommends the development of emulation systems that, although they run on modern hardware and software, nevertheless faithfully emulate old hardware. He chooses the hardware level for emulation because hardware emulation is a proven technique for recreating old systems, even ones as gnarly as electronic games. He also describes the need to encapsulate the old data sets along with the metadata we need in order to interpret the data set, as well as the overall specifications for the emulation itself. By keeping all these together in one encapsulated package, the data will travel along into the future with everything that we need to play it back out again in 50 years. All we need to do is interpret the emulation specifications on our contemporary hardware.
The library world is deeply committed to solving the digital preservation problem. Look up embrittled documents on the Google search engine. We need to study their techniques and adapt them to our warehouse and Webhouse needs. In a forthcoming article, Ill report on technologies and companies that we may find helpful.
Ralph Kimball, Ph.D., co-invented the Star Workstation at Xerox and founder of Red Brick
Systems, works as an independent consultant designing large data warehouses. He is the author of The Data Warehouse Toolkit (Wiley, 1996) and the newly published The Data Warehouse Lifecycle Toolkit (Wiley, 1998). You can reach him
through his Web page at www.ralphkimball.com.
RESOURCESJeff Rothenbergs CLIR report: www.clir.org/pubs/reports/rothenberg |
|
|
|
| |||||||||||||||||||||||||||||||























