At a White House press conference Monday, the Human Genome Project public consortium and Celera Genomics, a private firm, jointly reported that they had assembled working drafts of the human genome sequence. The two groups' presence on the same podium marked an apparent truce in what has been a desperate push to be first to announce a decoded human gene sequence.
While representing a breakthrough in scientific learning, the genome detective work also represents something of a breakthrough in modern computing techniques. Distributed computing and database technology as well as advanced search software and other technologies were employed to reach the goal of uncovering the basic plan for human life.
The work to create a genetic blueprint for a human being revealed a total of 3.12 billion base pairs in the human genome. An assembled genome is described as one on which the location and order of the letters of genetic code along the chromosomes are known. Computers are relied on to uncover matches in DNA sequences that serve to unravel the code.
Some observers suggest that the work is leading to the creation of a new field of technology known as bioinformatics. They say that a new discipline is arising out of the wedding of computer science and biology.
For its part, Celera has hooked up DNA sequencers with a supercomputing facility featuring 800 interconnected Compaq Alpha-based computer systems, each of which is capable of performing more than 250 billion sequence comparisons per hour. Celera has an alliance with Oracle for database development.
"The whole project has been about information acquisition and storage," said Bruce Birren, assistant director of the Whitehead Sequencing Center in Cambridge, Mass., a key participant in the Human Genome Sequencing Consortium.
"We've read out the four-letter code that represents the book of life," Birren said, referring to the four-letter code that corresponds to DNA's four basic chemical components. "We've always studied one gene at a time, but our perspective is changed because we now see the entire landscape. That takes computational ability."
There is substantial analytical work yet to do in the field, as researchers look to establish possible links between specific genes and specific traits. That next stage of work may be counted on to drive further computing advances, even as computing advances drive genome mapping forward.
"Now we're moving into a phase where interpreting [genetic] information is going to require new analytical tools," Birren said. Researchers are already using a mix of different advanced software technologies -- including neural networks, fuzzy logic, and data smoothing -- to uncover patterns in the genetic data.
It will also be necessary to carefully match analytical and data management software tools, said Michael Roberson, a program manager at the SAS Institute in Cary, N.C.
"One of the areas where SAS software has been used for a long time has been in the area of clinical trials," he said.
On one level, Roberson explained, genetic data manipulation and management is similar to traditional data mining and data warehousing tasks. But there are differences.
"In human genome work, data warehousing is made more complicated by the fact that the data is very irregular and very large," he said. "When you're looking at this data in relation to clinical trial data, it's much harder to take pieces of information from a lot of sources and combine them as you would, for example, with a traditional credit card information database. It's tricky data to work with, because the [techniques associated with the] collection of the data tend to be different for each subject."
Roberson said his group was looking at new technology known as data smoothing, which uses pattern recognition techniques to cull true genetic markers amid noisy data sets. In May the SAS Institute spun off iBiomatics LLC as a wholly owned subsidiary to specifically meet the computing needs of researchers in the emerging life science industry.
Links to related genome computing information on the ITworld.com Network
Links to other genome computing information