Bio gold rush could pay off for enterprise IT
Like most of us, IT managers at major retailing or banking companies probably find the current revolution in life sciences research compelling because of its promise to disarm hereditary diseases or cancers. But they may not realize that they also have a professional self-interest in computationally driven work on genetics, proteins, and pharmaceuticals.
Vendors who have traditionally served enterprise IT customers are now turning, dollar signs flashing in their eyes, to the potentially lucrative new bioinformatics market. As they compete to meet the specific customer requirements in this niche, the side effect could be a technological payoff for their traditional customers across a wide range of industries.
At a time when other IT market segments have cooled, bioinformatics is hot, according to Mike Swenson, senior research analyst at IDC. "The major vendors clearly see that this has the opportunity to be a major growth area," he said. IDC predicts that the bio-science IT market will grow from US$10.4 billion in 2000 to about $38 billion in 2006, an annual growth rate of 24 percent.
A defining characteristic of bioinformatics applications are their use of extremely large amounts of data, which is pushing high-end server, storage and database technology development at companies such as IBM Corp., Sun Microsystems Inc., Hewlett-Packard Co. (HP), Compaq Computer Corp. and Oracle Corp.
"There are several very important areas where our life sciences customers have had a profound impact on our technologies and products," said Ty Rabe, director of high-performance technical computing solutions at Compaq.
For example, life sciences customers have been taking relatively new technologies such as SANs (storage area networks) and scaling them up to extremely large sizes to find out if they are "bullet-proof," Rabe said. Celera Genomics Group has 120 terabytes of data stored on SANs, all set up using standard products, he said.
"In the process of creating these, we've discovered problems managing and moving around very large amounts of data," Rabe added. Compaq expects that other industries will get to that level of volume but over a longer time frame, and by the time they get there, such large-scale SANs will be truly bullet-proof thanks to the experience gained in life sciences.
For Sun's part, the explosion of data in both quantity and complexity -- "One person's DNA is 300 terabytes," said Siamak Zadeh, group manager of Sun's life sciences division -- is driving the company to develop a storage architecture roadmap from terabytes to exabytes. "We can handle the volume, but what about the I/O infrastructure to handle this, and the performance of I/O?" Zadeh asked.
Meanwhile, bioinformatics isn't stopping at generating the large amounts of data created by genetic research. The study of proteins, proteomics, will call for even more storage as it creates an order of magnitude more data than gene sequencing, according to Rabe.
At IBM, life sciences are also driving many requirements to the company's data management and storage technologies, according to Sharon Nunes, director of solution development for life sciences. For example, a project with Merck & Co. Inc. resulted in a database technology called DiscoveryLink, DB2-based middleware that allows users to pose one natural-language query against a variety of databases and data sources, and get back one result. In talking about this to customers in retail, banking and government, Nunes reported, "every one of them, their eyes would light up, and they would say, 'Wow, this would be tremendous.'" According to Nunes, IBM's software group has seen what the life sciences group has done with DiscoveryLink and is working on making the technology more generic, by developing a number of more generic data wrappers. A data wrapper is code that encapsulates a data package so that it can be shared among different platforms.
Database giant Oracle has jumped on the life-sciences bandwagon, with Chief Executive Officer and Chairman Larry Ellison heralding the company's commitment to the field. Jon Simmons, vice president of Oracle Life Sciences, confirms that this market will certainly have an effect on Oracle's foundation database technology that will be seen in Release 10, although he declined to elaborate in detail.
"The challenge is the huge amount of data," Simmons said. "How do you manage it, get intelligence from it? We've got quite a few initiatives under way relating to storage, access, mining. How do you handle these on the scale that's being demanded?" He called Oracle's Real Application Clusters (RAC) a "perfect fit" for such problems, and said that life sciences customers are consequently driving his company to continue developing RAC and scalability.
An example of the work Oracle is doing in life sciences that may have an impact on other kinds of customers, Simmons said, are investigations having to do with search algorithms from BLAST (The Basic Local Alignment Search Tool used in genetics and proteomics) that might ultimately benefit logistics and shipping companies.
Information security is another area where life sciences customers are seen as having the most demanding requirements, even more so than traditionally security-conscious users such as financial institutions. Nunes characterizes the pharmaceutical industry as "really paranoid" about sending queries over the Internet to government database, thanks to patent laws that could be read as equating such a query as publication and thus starting the clock ticking on when a patent application must be filed.
HP's David Valenta, global market development manager for life sciences, concurs that life sciences customers' perceptions, especially among pharmaceutical companies, is that "security is not good enough." Many of them still use courier services to transport floppy disks rather than trust data transfer over the Internet. While protection of intellectual property is one concern, Valenta says that "What really keeps them up at night is that they hold a lot of genetic information about people." Indeed, in the U.S. at least controversies over exposing someone's credit card information pale next to the issues raised by exposing someone's genetic data and potential predisposition to diseases and other medical conditions.
On the systems side, the Blue Gene project launched by IBM in late 1999 to build a supercomputer aimed at computationally intensive operations such as modeling the folding of human proteins, produced technology developments in load-balancing, self-healing, and fault-tolerance that made their appearance in the eServer p690, Nunes said. That Unix machine shipped last December to customers such as retailer Gap Inc., which is using it for global supply chain management, according to IBM.
HP's Valenta sees Linux getting a shot in the arm from the life sciences market: Because Linux-based systems are perceived to be relatively inexpensive, they have been well-received in academia, where many life sciences applications are developed. The consequence is that now the big pharmaceutical companies want commercialized, turnkey Linux systems to run these applications -- and companies such as HP will likely do their best to deliver.
The development of software and hardware to manage very large-scale clusters or compute farms is becoming very important for Compaq, according to Rabe. Many of the company's life sciences customers are using dozens to hundreds of four-processor Alpha systems connected together. "The difficulties are: How do you manage the environment? And how do you get useful work from compute farms?" Rabe said.
Users in the life sciences community are thus among the most vocal in discussions of emerging standards in grid computing, Rabe said. Grid computing aims to create a computational resource analogous to the electricity grid, so that systems can be tapped, shared, and aggregated regardless of geographical location. Issues with which users are grappling include keeping track of computing resources that are available, and applying policies for the availability of those resources. Other areas of concern for grid computing are security -- how users are identified as being authentic when accessing the grid -- and standard procedures for accessing data.
Another area where life sciences will increasingly influence Compaq's foundation technology is in the area of computing architectures. Rabe characterized the last three or four years of life sciences computation as requiring integer rather than floating-point compute power, as the work has involved doing comparisons of character strings. Such a task lends itself to being broken down and done on clusters, he said. However, "In the next few years, they're going to be moving from character string work to looking at chemical structures of compounds such as proteins, and in the longer term, structures of larger systems such as tissues and organs." Demands for better floating point performance will drive more capable interconnects between processors, and processors that are themselves more floating-point capable, Rabe said, adding that "we are actively working with government and life sciences customers to try to define new computing architectures."
The demands of working with these new data structures will have an impact on data management tools, as "data will be structured in different ways, and people will be looking for new ways to mine data on a scale and complexity never done before," Rabe said. The development in data mining technology pushed by life sciences users will be applicable broadly, Rabe believes, in areas such as fraud detection. "Life sciences is doing earlier what many other industries are going to be doing down the road," he said.