"It looked like it was a configuration issue on their end with state information - what was cache versus what was authoritative," said Richard Hyatt, co-founder and CTO of BlueCat Networks, which sells DNS appliances and a cloud-based managed DNS service. "I think it was a configuration on their end that might have been connected to DNS...but they were very vague about it."
Infoblox attributed the Facebook outage to a problem with change management in a blog post on Friday.
"When dealing with network change and configurations, organizations must be more proactive in testing, validating and monitoring the ongoing status of both the critical network infrastructure (routers, switches, firewalls, etc.) as well as the Web services (applications, databases and servers)," Infoblox said, referring to change management as "the biggest problem for IT teams worldwide."
"It's not at all clear that Facebook had a DNS problem. There's no indication of that from the official information they have published," said Jim Galvin, director of strategic relationships and technical standards at Afilias, which operates more than a dozen top-level domains including .info and .org. "It's pretty clear they had a distributed database problem...From our point of view, it's an issue of how to make sure you have good data in the database to start with."
Galvin said the problem Facebook experienced was akin to a distributed denial-of-service attack, where a Web site is overwhelmed by traffic from a hacker. In Facebook's case, the excess traffic was created by its own automated system for verifying configuration values.
"With Facebook, the interim system knew it had bad data and wanted to get the right data...so it will keep asking until it gets the right answer," Galvin said. "The analogy is a DDOS attack. You have more and more resolvers suddenly figuring out that they have bad data in the cache, and they're constantly requesting the right data. The servers that have bad data in them are seeing more requests, and everything slows down."
Galvin said confusion surrounding the Facebook outage stems from the fact that DNS has similar properties. If there was bad data in an authoritative DNS database, the DNS resolvers would continue to ask for the correct data and flood the system with traffic. Also, the bad data in a DNS database would continue to reside in a cached database for a certain number of hours after the error was fixed because of the time-to-live (TTL) feature of the DNS. Many Web sites have a TTL of one day, which means bad data will live in DNS caches for 24 hours.