Last week Jim Salter over at Ars Technica wrote a thorough and interesting article about the phenomenon known as bitrot and how next generation file systems are gearing up to combat the problem. He also stresses the fact that, chances are, your RAID configuration won’t protect you from this type of data corruption.
Bitrot is a term describing a data storage anomaly which results in the “flipping” of a bit or multiple bits (1->0 or 0->1) on a storage device. Jim describes the occurrence well:
Sound too theoretical to make you care about filesystems? Let's talk about "bitrot," the silent corruption of data on disk or tape. One at a time, year by year, a random bit here or there gets flipped. If you have a malfunctioning drive or controller—or a loose/faulty cable—a lot of bits might get flipped. Bitrot is a real thing, and it affects you more than you probably realize. The JPEG that ended in blocky weirdness halfway down? Bitrot. The MP3 that startled you with a violent CHIRP!, and you wondered if it had always done that? No, it probably hadn't—blame bitrot. The video with a bright green block in one corner followed by several seconds of weird rainbowy blocky stuff before it cleared up again? Bitrot.
Reading that probably brings the issue closer to home for most. It’s pretty common to witness one of these seemingly random issues with a file at some point, but most of us never knew the cause. What’s worse, backups can’t protect you in most cases because you’ll be unaware of the corruption and the bad file will be backed up in its corrupted state, many times overwriting the good copy.
It’s a common misconception that by configuring your drives in a RAID 1, 5, or 10 setup with parity that your data will be protected. But again, as Jim describes:
That only works if a drive completely and cleanly fails. If the drive instead starts spewing corrupted data, the array may or may not notice the corruption (most arrays don't check parity by default on every read). Even if it does notice... all the array knows is that something in the stripe is bad; it has no way of knowing which drive returned bad data—and therefore which one to rebuild from parity (or whether the parity block itself was corrupt). If you’re in small to mid-sized IT and responsible for data storage, this should scare you. It’s true that there are more sophisticated RAID controllers that can protect against this issue, but they are generally out of the financial reach of smaller organizations.
Jim provides an experiment showing the real world results of a bit being flipped on one of his favorite photos. You can flip through a gallery of images showing the original, a copy with a single bit flipped under RAID 5, and a copy with a single bit flipped under a next generation file system (btrfs).
He goes on to describe in detail the advantages and the features of two next generation file systems, ZFS and btrfs. It’s an extremely interesting read and should get you thinking about how your next storage system could take advantage of a new, more resilient file system. Speaking of resilient file systems, notably absent from this article is Microsoft’s ReFS system which is available now with Windows Server 2012.
You should take a look at the full article if you’re relying on RAID to protect your data today. The improvements in some of the new file systems are worth a look for any upcoming data storage projects.