Don’t blame DNS for Facebook outage, experts say
Facebook gave little detail about the cause of its outage
Experts in the inner workings of the Internet's Domain Name System - which matches IP addresses with corresponding domain names—say the 27-year-old communications protocol does not appear to be the cause of Facebook's high-profile outage last week.
Facebook's service was unavailable to its 500 million active users for 2.5 hours on Thursday -- the company's worst failure in more than four years. Initial news reports blamed the outage on DNS because end users received a "DNS error" message when they couldn't reach the site.
"There's probably a lesson here that the problem at various times looked like DNS, but ultimately proved not to be," said Cricket Liu, vice president of architecture at Infoblox, which sells DNS appliances. "In my experience, users are quick to point fingers at DNS (perhaps because Web browsers like to implicate DNS when they can't get somewhere) but DNS often isn't at fault."
Facebook gave little detail about the cause of the outage except to say that it was the result of a misconfiguration in one of its databases, which prompted a flood of traffic from an automated system trying to fix the error.
"We made a change to a persistent copy of a configuration value that was interpreted as invalid," explained Robert Johnson in Facebook's blog post about the incident. "This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries per second."
The feedback loop created so much traffic that Facebook was forced to turn off the database cluster, which meant turning off the Web site.
"Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site," Johnson said. He added that "for now we've turned off the system that attempts to correct configuration values."
Experts said the Facebook outage was not the result of DNS because they were able to log partially into the site during the incident, which would be impossible if DNS were to blame.
"It looked like it was a configuration issue on their end with state information - what was cache versus what was authoritative," said Richard Hyatt, co-founder and CTO of BlueCat Networks, which sells DNS appliances and a cloud-based managed DNS service. "I think it was a configuration on their end that might have been connected to DNS...but they were very vague about it."
Infoblox attributed the Facebook outage to a problem with change management in a blog post on Friday.
"When dealing with network change and configurations, organizations must be more proactive in testing, validating and monitoring the ongoing status of both the critical network infrastructure (routers, switches, firewalls, etc.) as well as the Web services (applications, databases and servers)," Infoblox said, referring to change management as "the biggest problem for IT teams worldwide."
"It's not at all clear that Facebook had a DNS problem. There's no indication of that from the official information they have published," said Jim Galvin, director of strategic relationships and technical standards at Afilias, which operates more than a dozen top-level domains including .info and .org. "It's pretty clear they had a distributed database problem...From our point of view, it's an issue of how to make sure you have good data in the database to start with."
Galvin said the problem Facebook experienced was akin to a distributed denial-of-service attack, where a Web site is overwhelmed by traffic from a hacker. In Facebook's case, the excess traffic was created by its own automated system for verifying configuration values.
"With Facebook, the interim system knew it had bad data and wanted to get the right data...so it will keep asking until it gets the right answer," Galvin said. "The analogy is a DDOS attack. You have more and more resolvers suddenly figuring out that they have bad data in the cache, and they're constantly requesting the right data. The servers that have bad data in them are seeing more requests, and everything slows down."
Galvin said confusion surrounding the Facebook outage stems from the fact that DNS has similar properties. If there was bad data in an authoritative DNS database, the DNS resolvers would continue to ask for the correct data and flood the system with traffic. Also, the bad data in a DNS database would continue to reside in a cached database for a certain number of hours after the error was fixed because of the time-to-live (TTL) feature of the DNS. Many Web sites have a TTL of one day, which means bad data will live in DNS caches for 24 hours.
"This is what DNS does by default," Galvin said. "If it gets bad data in the cache, that is where the TTL comes to play. You may or may not be able to do something about that depending on how long your TTL is."
While the Facebook outage does not appear related to DNS, similar misconfigurations of DNS data have prompted massive outages, most recently for Germany and Sweden.
A year ago, all Web sites with Sweden's .se extension were unavailable for an hour or more because an incorrect script used to update the .se domain was missing a dot.
"These types of outages happen frequently," Hyatt said. "They happen through poorly managed systems. The one that happened in Germany and the one that happened in Sweden - those were mistakes or errors in automated scripts that should never happen...They could have been avoided."
Hyatt said DNS appliances including BlueCat's feature configuration checking software that can alert administrators that the DNS data change they are making is invalid.
"We have data checking rules that look at the configuration you're trying to deploy and won't push it out...if the system doesn't exist or the system isn't configured right," Hyatt said. "Our system has a lot of smarts. It will give you an alert and tell you what's wrong."
BlueCat's appliances have featured DNS configuration checking since they were introduced back in 2001.
"We're looking for anomalies, logical errors that don't make sense," Hyatt said. "We definitely would have caught the Germany and Sweden errors because those were logic errors."
Similarly, Afilias checks zone file changes for the top-level domains that it operates before the changes get published to prevent errors like those experienced by the operators of .de and .se.
"We notice when zone files are changed. It pops an alert so it gets investigated," Galvin said. "We check the percentage of change...It would have helped prevent the Germany and Sweden problems, where there were very dramatic zone file changes."
But Galvin added that there's not much a service provider like Afilias can do if a customer has bad data in its DNS database, much like the scenario Facebook experienced.
"You're wholly responsible for your own data; all we guarantee is that your data is available," Galvin said. "You cannot recover faster [from your bad data] than your TTL allows recovery to occur."
Hyatt added that the best error checking systems can't prevent sys admins from making every type of mistakes that would cause an outage. "If they are doing something risky and overriding best practices, we can't prevent that," Hyatt added.
Read more about infrastructure management in Network World's Infrastructure Management section.