September 28, 2010, 10:19 AM — Experts in the inner workings of the Internet's Domain Name System - which matches IP addresses with corresponding domain names—say the 27-year-old communications protocol does not appear to be the cause of Facebook's high-profile outage last week.
[ Will security worries propel DNS into the cloud? ]
Facebook's service was unavailable to its 500 million active users for 2.5 hours on Thursday -- the company's worst failure in more than four years. Initial news reports blamed the outage on DNS because end users received a "DNS error" message when they couldn't reach the site.
"There's probably a lesson here that the problem at various times looked like DNS, but ultimately proved not to be," said Cricket Liu, vice president of architecture at Infoblox, which sells DNS appliances. "In my experience, users are quick to point fingers at DNS (perhaps because Web browsers like to implicate DNS when they can't get somewhere) but DNS often isn't at fault."
Facebook gave little detail about the cause of the outage except to say that it was the result of a misconfiguration in one of its databases, which prompted a flood of traffic from an automated system trying to fix the error.
"We made a change to a persistent copy of a configuration value that was interpreted as invalid," explained Robert Johnson in Facebook's blog post about the incident. "This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries per second."
The feedback loop created so much traffic that Facebook was forced to turn off the database cluster, which meant turning off the Web site.
"Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site," Johnson said. He added that "for now we've turned off the system that attempts to correct configuration values."
Experts said the Facebook outage was not the result of DNS because they were able to log partially into the site during the incident, which would be impossible if DNS were to blame.
"It looked like it was a configuration issue on their end with state information - what was cache versus what was authoritative," said Richard Hyatt, co-founder and CTO of BlueCat Networks, which sells DNS appliances and a cloud-based managed DNS service. "I think it was a configuration on their end that might have been connected to DNS...but they were very vague about it."
Infoblox attributed the Facebook outage to a problem with change management in a blog post on Friday.
"When dealing with network change and configurations, organizations must be more proactive in testing, validating and monitoring the ongoing status of both the critical network infrastructure (routers, switches, firewalls, etc.) as well as the Web services (applications, databases and servers)," Infoblox said, referring to change management as "the biggest problem for IT teams worldwide."
"It's not at all clear that Facebook had a DNS problem. There's no indication of that from the official information they have published," said Jim Galvin, director of strategic relationships and technical standards at Afilias, which operates more than a dozen top-level domains including .info and .org. "It's pretty clear they had a distributed database problem...From our point of view, it's an issue of how to make sure you have good data in the database to start with."
Galvin said the problem Facebook experienced was akin to a distributed denial-of-service attack, where a Web site is overwhelmed by traffic from a hacker. In Facebook's case, the excess traffic was created by its own automated system for verifying configuration values.













