Bad data warning over public gene databases

May 6, 2002, 08:39 AM —  Australian Biotechnology News — 

Some of the most-used global databases of DNA and amino acid sequences are riddled with errors and there is no quick fix in sight.

Leading the list is the GenBank public database operated by the U.S. National Center for Biotechnology Information.

Dr Ian Collet, a bioinformatics lecturer at Queensland University of Technology, says he has been forced to foster attitudes of "healthy scepticism about the validity of data lodged in GenBank" among his students at the same time he teaches them how to access the database.

"A lot of the global (genomics) databases have a lot of incorrect data in them," he says.

"I use GenBank's entry for insulin as an example of how many mistakes you can find in an entry. The positions of the genes are in the wrong spot, the intron and exon (DNA sequencing components) boundaries are wrongly marked and three amino acids are left out."

The problem arises because the publicly-funded GenBank allows researchers to lodge their sequencing data on a do-it-yourself basis. It is not edited or checked on submission so incomplete or incorrect information is accepted and then propagated when other researchers retrieve it.

There are widespread errors in GenBank and some other global databases, agrees Mike Poidinger, head of the Australian National Genomics Information Service (ANGIS), an online provider of software tools and services to Australian biomedical researchers which include access to the large databases.

Poidinger recently received a call from an irate researcher complaining about a sequence received through ANGIS which did not tally with the sequence published in the original research paper. The researcher was correct about the discrepancy but a check by ANGIS revealed it had originated in data retrieved from GenBank.

Poidinger, who is also CEO of the Australian Genomic Information Centre, says many researchers know enough about the problem to be wary of GenBank but the issue is not yet widely recognized.

He is now considering flagging the issue on the ANGIS website to remind researchers of the need for caution in handling information from GenBank and other nucleotide databases.

GenBank's system only examines submissions for syntax errors and accepts them if they pass that relatively rudimentary check.

Other databases, for example SwissProt, which is focused on protein information, are more rigorous in their manual checks and only enter data after being satisfied it is correct.

It's not fair to blame lax researchers entirely for faulty data. The algorithms driving the today's automated, high throughput sequencing systems are not infallible. Even a one per cent error rate will produce 10 mistakes in every 1000 bases that a machine calls, and it is difficult for researchers to manually check the flood of machine-generated data.

Cleaning up corrupted databases as large as GenBank will not be an easy task, predicts Poidinger. "GenBank is doubling in size every seven to nine months. We are talking millions of base pairs. You would need a team the size of a small country to check submissions by eye," he says.

The database blunders carry the seeds of larger concerns about the faith that newer generations of students are placing in computer data, says QUT's Collet.

"PhD students are making a lot of mistakes because of a blind belief in what the computer tells them. They have to learn to think beyond the computer printout."

» posted by abennett

Australian Biotechnology News

Sign up for ITworld's Daily newsletter
Follow ITworld on Twitter @IT_world

I like it!
Post a comment
The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
peer-to-peer

Esther Schindler
If the comments are ugly, the code is ugly

claird
SVG a graphics format for 21st century

pasmith
Take Chrome OS for a test spin

Sandra Henry-Stocker
Solaris Tip: Have Your Files Changed Since Installation?

sjvn
64-bits of protection?

jfruh
Android fragments vs. the iPhone monolith

mikelgan
What Gizmodo missed about the Pro WX Wireless USB disk drive

 

Sidekick: The Good News & the Bad News
Either way you look at it Microsoft Data Center management did not follow standards or best practices in this failure. In which case it makes me wonder more about the outsourcing of corporate data much less personal data.
- mburton325

Join the conversation here

The Daily Tip

The Daily TipQuick, practical advice for IT pros. Made fresh daily.

Hot tips:

Want to cash in on your IT savvy? Send your tip to tips@itworld.com. If we post it, we'll send you a $25 Amazon e-gift card.

Newsletters

Subscribe to ITWORLD TODAY and receive the latest IT news and analysis.

I would like to receive offers via email from ITworld partners.
By clicking submit you agree to the terms and conditions outlined in ITworld's privacy policy.
Featured Sponsor

AISO founders envisioned a Web hosting company that was environmentally friendly. While the company employed energy-efficient innovations like solar panels, its infrastructure produced unacceptable power and cooling requirements. Find out how AISO leveraged AMD technology to overcome their challenge in this case study white paper.

In this whitepaper, Scalar explores the opportunity to change the landscape with respect to mission critical databases built around Oracle. Leveraging technologies such as Linux, high-end commodity processing power and Oracle RAC technology to architect, design, build and maintain database infrastructure that delivers maximum availability, reliability and performance at a fraction of traditional cost.

On a typical day, weather.com, the Web site for The Weather Channel in Atlanta, serves up between 15 million and 20 million page views. But in September 2004, when back-to-back hurricanes ransacked Florida, the peak traffic on one day more than tripled: over 70 million page views by more than 7 million unique visitors. Read the full success story now.

Marketplace