Google: DRAM error rates vastly higher than previously thought

By Lucas Mearian, Computerworld |  Storage, DRAM, Google Add a new comment

A study released this week by Google Inc. and the University of Toronto showed that data error rates on DRAM memory modules are vastly higher than previously thought and may be more responsible for system shutdowns and service interruptions.

The study ( download .pdf ), which used tens of thousands of Google's servers, showed that about 8.2% of all dual in-line memory modules (DIMM) are affected by correctable errors and that an average DIMM experiences about 3,700 correctable errors per year.

"Our first observation is that memory errors are not rare events. About a third of all machines in the fleet experience at least one memory error per year, and the average number of correctable errors per year is over 22,000," the report states.

"These numbers vary across platforms, with some platforms seeing nearly 50% of their machines affected by correctable errors, while in others only 12%-27% are affected."

The median number of errors per year on a Google server that had at least one error ranged from 25 to 611.

A memory error is marked by bits being read differently from how they were originally written. Memory errors can be caused by electrical or magnetic interference or by hardware corruption.

Memory errors are classified as soft errors, which randomly corrupt bits but do not leave physical damage and can be corrected, and hard errors, which corrupt bits (cells) within the DRAM that become a physical defect that repeats data errors. Soft errors are often caused by radiation or alpha particles, which naturally occur in organic materials, including the epoxy that DRAM chips come packed in. Hard errors are most often caused by chip contamination at the manufacturing facility, but they often don't show up in testing and only surface after the memory chip warms after hours of use, according to Jim Handy, an analyst with Objective Analysis in Los Gatos, Calif.

The Google/University of Toronto study included memory from multiple vendors as well as multiple types of DRAM (dynamic random access memory), such as DDR1, DDR2 and FB-DIMM.

The study covered the majority of servers in Google's data centers and was conducted over two-and-a-half years, from January 2006 to June 2008.

While the study focused on servers and stated that error rates are not climbing with the latest, more dense generations of DRAM, the results show that PCs will eventually need error correction codes (ECC) technology as the size of memory chips become more and more dense, Handy said.

ECC on special chips is used to detect and correct errors introduced during data storage or transmission.

Today, DRAM uses 50 nanometer lithography technology but is migrating to 40 nanometer technology. The smaller the bits, the more susceptible they are to soft errors due to normal levels of radiation, Handy said.

For example, while a server with error correction technology can continue to function after a soft error, a PC would need to be rebooted. A hard error would also be corrected each time a processor attempted to read from a bit on a server card, but the DRAM in a PC, because it has no error correction, would need to be replaced because it would cause a system or application using the memory to crash, Handy said.

"The study shows hard errors are more common than soft. That means modules are running and running and running in servers and every time a hard error bit is encountered, it's corrected so the memory module never gets replaced," Handy said. "If that happened to a PC user, the machine would stop working."

If an error is uncorrectable, as in the case of multiple bits exceeding the limit of what the ECC can correct, a server will shut down.

"In many production environments, including ours, a single uncorrectable error is considered serious enough to replace the dual in-line memory module that caused it," the Google report read.

Handy said such problems often result in system downtime and service outages.

The study states that memory errors are expensive in terms of the system failures they cause and the repair costs associated with them. They can also open the door to security problems.

"In production sites running large-scale systems, memory component replacements rank near the top of component replacements and memory errors are one of the most common hardware problems to lead to machine crashes," the report stated. "Moreover, recent work shows that memory errors can cause security vulnerabilities."

    Add a comment

    Post a comment using one of these accounts
    Or join now
    At least 6 characters

    Note: Comment will appear soon after you have activated your account.
    Obscene/spam comments will be removed and accounts suspended.
    The information you submit is subject to our Privacy Policy and Terms of Service.

    ITworld LIVE

    StorageWhite Papers & Webcasts

    White Paper

    AppAssure vs Acronis

    In this study of data protection for environments with virtual and physical servers running Windows, openBench Labs tested AppAssure Backup and Replication software v 4.7 and Acronis Backup & Recovery 11. Both solutions utilize block-based technology to unify data protection operations.

    White Paper

    Guaranteeing 100% Backup Recovery

    The single biggest challenge for IT personnel involved in the data protection process is making sure that their backups are recoverable every time. Management and users won't remember the ninety-nine successful recoveries but they will always remember the one failure.

    White Paper

    ESG Analyst White Paper - VMware's vSphere Storage Appliance: High Availability for Small IT Operations

    Learn how small and midsized businesses are increasingly adopting virtualisation to deliver consolidation, improve data back up and disaster recovery and increase security with an in-depth new paper from the Enterprise Strategy Group (ESG). Learn directly from your peer's experiences and see why VMware's solutions are perfect for the growing and ambitious business.

    Webcast On Demand

    Understand Your Data: The Future of Backup and Archiving

    Archiving and Backup are the foundation of the next generation of information governance. However, commodity data protection tools and basic archives are only good for storing data. In the changing IT landscape, understanding what you are keeping, when to delete, and delivering insight to the business from your data is the future of these systems. Join us to hear the impact of private and public cloud solutions, "big data" and your choices while market evolves.

    Sponsor: Autonomy

    White Paper

    NetVault: #1 in the 2011 Oracle Backup Solutions Buyer's Guide

    Want to know how NetVault Backup compared against other Oracle backup software solutions - and why it's DCIG's #1 choice? In this 37-page report you'll get unbiased, third-party evaluations of Oracle backup software - and why NetVault Backup sits on the top of the list. Download your copy today.

    See more White Papers | Webcasts

    Ask a question

    Ask a Question