SC13: Elevation plays a role in memory error rates

A study by AMD and the Department of Energy showed a higher supercomputer had more memory problems

By , IDG News Service |  Hardware

With memory, as with real estate, location matters. A group of researchers from Advanced Micro Devices (AMD) and the Department of Energy's Los Alamos National Laboratory have found that the altitude at which SRAM (static random access memory) resides can influence how many random errors the memory produces.

In a field study of two high-performance computers, the researchers found that L2 and L3 caches had more transient errors on the supercomputer located at a higher altitude, compared with the one closer to sea level. They attributed the disparity largely to lower air pressure and higher cosmic ray-induced neutron strikes.

Strangely, higher elevation even led to more errors within a rack of servers, the researchers found. Their tests showed that memory modules on the top of a server rack had 20 percent more transient errors than those closer to the bottom of the rack. However, it's not clear what causes this smaller-scale effect.

Vilas Sridharan, an AMD technical staff member, presented the findings Thursday at the SC13 supercomputing conference, being held this week in the mile-high city of Denver.

Using the error logs of two large high-performance computers, the study examined the characteristics of transient memory errors, in which a memory module may store a 1 as a 0, or vice versa.

Transient errors are different from permanent or even intermittent errors, which are usually caused by hardware failure, Sridharan said. Transient errors appear more randomly and are not usually the fault of machinery. They are relatively rare, but depending on where they occur, they can cause a cascade of additional system errors.

The group studied the monthly transient fault rates of SRAM--the L2 and L3 caches within processors--in two large Cray supercomputers, each running thousands of AMD processors.

One supercomputer was the Jaguar system at Oak Ridge National Laboratory in Oak Ridge, Tennessee, which is approximately 817 feet (249 meters) above sea level, according to an online altitude finder.

The other system under study was the Cielo supercomputer at the Los Alamos National Laboratory in Los Alamos, New Mexico, which is about 7,058 feet (2,151 meters) above sea level.

The group had found that, when all other possible confounding issues were factored out, Cielo's SRAM had a "significantly higher rate of SRAM faults," compared with Jaguar's SRAM, Sridharan said.

Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Spotlight on ...
Online Training

    Upgrade your skills and earn higher pay

    Readers to share their best tips for maximizing training dollars and getting the most out self-directed learning. Here’s what they said.

     

    Learn more

Answers - Powered by ITworld

ITworld Answers helps you solve problems and share expertise. Ask a question or take a crack at answering the new questions below.

Ask a Question
randomness