Don’t blame DNS for Facebook outage, experts say

Facebook gave little detail about the cause of its outage

By Network World staff, Network World |  Networking, DNS, Facebook Add a new comment

Experts in the inner workings of the Internet's Domain Name System  - which matches IP addresses with corresponding domain names—say the 27-year-old communications protocol does not appear to be the cause of Facebook's high-profile outage last week.

[ Will security worries propel DNS into the cloud? ]

Facebook's service was unavailable to its 500 million active users for 2.5 hours on Thursday -- the company's worst failure in more than four years. Initial news reports blamed the outage on DNS because end users received a "DNS error" message when they couldn't reach the site.

"There's probably a lesson here that the problem at various times looked like DNS, but ultimately proved not to be," said Cricket Liu, vice president of architecture at Infoblox, which sells DNS appliances. "In my experience, users are quick to point fingers at DNS (perhaps because Web browsers like to implicate DNS when they can't get somewhere) but DNS often isn't at fault."

Facebook gave little detail about the cause of the outage except to say that it was the result of a misconfiguration in one of its databases, which prompted a flood of traffic from an automated system trying to fix the error.

"We made a change to a persistent copy of a configuration value that was interpreted as invalid," explained Robert Johnson in Facebook's blog post about the incident. "This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries per second."

The feedback loop created so much traffic that Facebook was forced to turn off the database cluster, which meant turning off the Web site.

"Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site," Johnson said. He added that "for now we've turned off the system that attempts to correct configuration values."

Experts said the Facebook outage was not the result of DNS because they were able to log partially into the site during the incident, which would be impossible if DNS were to blame.

"It looked like it was a configuration issue on their end with state information - what was cache versus what was authoritative," said Richard Hyatt, co-founder and CTO of BlueCat Networks, which sells DNS appliances and a cloud-based managed DNS service. "I think it was a configuration on their end that might have been connected to DNS...but they were very vague about it."

Infoblox attributed the Facebook outage to a problem with change management in a blog post on Friday.

"When dealing with network change and configurations, organizations must be more proactive in testing, validating and monitoring the ongoing status of both the critical network infrastructure (routers, switches, firewalls, etc.) as well as the Web services (applications, databases and servers)," Infoblox said, referring to change management as "the biggest problem for IT teams worldwide."

"It's not at all clear that Facebook had a DNS problem. There's no indication of that from the official information they have published," said Jim Galvin, director of strategic relationships and technical standards at Afilias, which operates more than a dozen top-level domains including .info and .org. "It's pretty clear they had a distributed database problem...From our point of view, it's an issue of how to make sure you have good data in the database to start with."

Galvin said the problem Facebook experienced was akin to a distributed denial-of-service attack, where a Web site is overwhelmed by traffic from a hacker. In Facebook's case, the excess traffic was created by its own automated system for verifying configuration values.


Originally published on Network World |  Click here to read the original story.

ITworld LIVE

NetworkingWhite Papers & Webcasts

White Paper

Building Cloud-Optimized Data Center Networks white paper

Enterprises are turning to the Cloud to improve business agility, reduce expenses and accelerate business innovation. Cloud computing redefines the way IT assets are deployed and consumed and dramatically affects the way data center networks are architected and managed. Conventional hierarchical data center networks built to support traditional IT architectures can't meet the security, agility and price/performance requirements of virtualized cloud computing environments. This white paper reviews the impact of cloud computing on data center networks and describes HP's approach to building simpler, more secure and automated networks that fully meet the stringent performance, security, reliability and agility demands of the new data center in the Cloud.Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

White Paper

Seven Priorities for Integrated Network Management - How HP Intelligent Management Center Delivers an Enterprise-class Solution

This white paper describes the major requirements for network management solutions to help the organizations become more profitable, efficient and reliable.Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

White Paper

Top 10 Best Practices of Backup, Replication & Recovery for VMware & Hyper-V

Whether you are new to virtualization or if you have been administering a virtual infrastructure for a while, it's now time to review your virtual infrastructure backup design and backup product features. Determine if you are both optimally protecting your virtual infrastructure as well as taking advantage of the latest virtualization backup features. Read this white paper to learn the 10 best practices for virtual infrastructure backup.

White Paper

Expert Guide on Backing up Windows Server in Hyper-V

Virtualization improves your infrastructure in many ways - it also introduces unfamiliar considerations. Take backup, replication and disaster recovery for example. The right backup and replication solution for Hyper-V can ensure that you'll be able to scale your infrastructure and protect yourself from data and application loss. But there are wrong choices to be made. Download this white paper from Microsoft MVP John Savill, avoid bad choices, and learn how to effectively protect your virtualized data and systems successfully.

White Paper

7 Expert Tips on VMware Backup

Want to create a bulletproof VMware backup infrastructure? Download this guide and learn 7 time-tested VMware infrastructure backup tips from virtualization backup pros:* Understand backup tool limitations* Save time, prevent data-loss* Find the solution that's right for youDownload the guide and save time planning your VMware backup.

See more White Papers | Webcasts

Ask a question

Ask a Question