Solving The MysteryThe technology behind the human genome project may do more than unravel the secret of life. B2B exchanges will also benefit from these toolsBy Hank Simon Why should an IT executive care about the announcement by dr. j. Craig Venter, president of Celera Genomics, and Dr. Francis Collins, head of the Human Genome Project, that the initial DNA sequencing of the human genome is complete? The completed DNA sequences can lead to the discovery of gene functions and possible medical breakthroughs, but how can they affect companies in other industries? The mapping of the human genome is all about large computers, massive databases, overwhelming data volumes, missing information, and methods to get a handle on everything. Sounds like a day in the life of the traditional CIO in most businesses. The Human Genome Project provides an interesting perspective on data, albeit from a completely different industry. For example, Celera uses the largest civilian supercomputer system created to date to unravel the DNA sequences and builds the genetic databases that describe the human genome. The machines are so big that the electricity for running them costs $1 million per year. In the pharmaceutical industry, complex biomedical data requires petabytes (approximately 1,000TB) of storage. Organizing and analyzing that volume of data requires greatly expanded methods and resources over today's infrastructure. For example, Glaxo Wellcome estimates a need to expand its computer infrastructure by a factor of 100 to handle its ballooning internal data. Other industries, like B2B exchanges, will soon run into similar volumes. For example, Covisint, the automobile B2B exchange, will deal with 30,000 vendors and suppliers flowing $240B worth of products and services per year. Organizing all that data and leveraging information for greater efficiencies will be a significant challenge. The techniques that the biotechnology industry is exploring may suggest some best practices for the rest of us and help us to organize and locate "the knowledge we have lost in information," to quote T.S. Eliot's "The Rock." We've Only Just BegunThe recently announced acquisition of the completed sequence is only the beginning. Venter told the House Subcommittee on Energy and Environment that obtaining the complete human DNA sequence does not mark the end of the project. From a conventional IT perspective, the DNA sequences represent only the initial inventory of business data. The next step after data collection is the creation of tools with which to store and analyze it. The DNA sequence data provides a starting point from which the real work can begin. Celera sells the sequence data, letting buyers have early access before the data goes public. According to Venter, Celera then releases the information to the public domain. But if Celera sells the data and then makes it available without charge, how can it make a profit? Celera's profits will come from providing database search and discovery services. As valuable as the data is, the metadata is even more valuable because it organizes the data in a meaningful way with specific parameters. By filtering and summarizing the data, knowledge workers accumulate knowledge and insight they can use to make decisions and take action. Celera is building the IT expertise, metadata, and tools necessary to extract valuable biological knowledge from the DNA sequence data. These tools will aid researchers in discovering new genetic functions and developing a variety of databases. Enhanced access to Celera's databases will enable paying customers to use intelligent search engines that let them more readily discover trends in the vast database. Other companies have also recognized the value of functional genomics data. For example, the world-famous SAS Institute spun off a new business-to-business (B2B) company called IBiomatics. (For more information about B2B, see "If You Build It, They Will Come," April 28, 2000.) SAS conceived IBiomatics to get the jump on the functional genomics competition and help biotech companies process huge amounts of genetic information to develop new therapies. IBiomatics provides common biomedical data frameworks and secure B2B Internet portals for biomedical research. IBiomatics will provide Internet applications, data processing, and data warehousing functions that will make functional genomics data more readily accessible to the biotech and biomedical industries. According to Lee Evans, president of IBiomatics, the biomedical products industry, poised on the edge of genetic revolution, is experiencing a data crisis. Companies collect data globally, often from far-flung research centers that have little industry or government data standardization measures in place. "Biomedical data is inherently complex because it describes interrelated biological and medical information," Evans says. While business data is usually more organized, the relationships of biological data are still not well organized or clearly understood. "Surprisingly, there is little industry data standardization to enable efficient systems for biomedical research. What is needed is a biomedical [data] warehousing framework that can intelligently deploy standard data models. Development of data standards is not enough, of course. The only way to achieve the efficiencies required is through development of secure [B2B] Internet portals based on intelligent data warehouses." The biotechnology industry can use intelligent data warehouse techniques and extensible markup language (XML) technology to manage hundreds of thousands of genes and billions of base pair molecules, the building blocks of DNA. Knowledge workers have already applied intelligent tools such as neural networks, expert systems, and variable pattern matching to data warehouse technology to arrive at knowledge discovery for biotechnology and genetic engineering applications. Knowledge DiscoveryKnowledge discovery, also known as data discovery, uses data warehouse technology as a starting point to finding new relationships and nonobvious trends in the data. Knowledge discovery, as a kind of data mining, can use artificial intelligence, machine learning, and neural network techniques to discover new relationships by sifting through an ocean of data to distill fresh answers to unstated questions. In many industries, large amounts of quickly changing data may make it difficult to get a handle on important business trends. For example, by starting with general trends, the retail industry has been able to use knowledge discovery to uncover details, such as how purchase patterns relate to demographics, time of day, and amount spent. Other industries have not caught up in their level of knowledge discovery sophistication, missing opportunities to leverage current customers. One of the differences between business-based data warehouses and genetics-based data warehouses, however, is that business data tends to be better defined and more complete than genetics data. In fact, the scarcity of completed genetic information is the reason that knowledge discovery in biotechnology is a growing field. Molecular biologists recognize the opportunity to extract valuable knowledge from the existing genetics databases. They are now trying to create complete genomic databases through projects like the Human Genome Project and other less publicized projects. The first goal of these projects - to build a complete database of the DNA sequence of a human, animal, or plant - has been completed at a coarser level of resolution. However, these efforts are complicated by the fact that there is no one DNA molecule for a given species. Your DNA differs from mine by a small percentage, which is why you are different from me. With hundreds of millions of base pairs in a DNA molecule, however, a small percentage adds up to a large absolute number. Therefore, your DNA may differ from mine by a million base pairs. So whose DNA do they store in the Human Genome Project database? For the first attempt, genetic scientists build a genetic map of one individual. Later, they will fill in the gaps and differences among a diversity of individuals. Because the information is incomplete, opportunities exist in discovering new knowledge about how genes work - functional genomics - not in the database itself. The database only stores the data of what the genes are, not the knowledge of how they work. The human genome database is similar to a directory to the homes of the movie stars that includes the addresses and phone numbers, but not the names. The names, like gene functions, are what make the directory and the database useful.
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
|
|











