How one company monitors its Web service (Hint: A single tool doesn't cut it)

What's it take to keep a sizeable Web service running? If you ask content discovery platform developer Outbrain, the answer is lots of monitoring tools.

I set out to talk to Outbrain about how it uses Boundary, an IT monitoring tool that aims to connect application performance to infrastructure. I ended up getting an inside peak at the many tools a service like Outbrain uses to keep its operations running smoothly.

Outbrain's customers include CNN, NBC News, Reuters, Allstate and MarketWatch. Those customers embed Outbrain's widget in their Web sites to display recommended links to content for site visitors or to pull in content from other providers. Outbrain says its widgets are installed on 100,000 Web sites where they serve up 150 billion content recommendations a month.

Tools Outbrain uses to monitor and troubleshoot its service include Boundary, Graphite, New Relic, Pagerduty and Keynote. But it also uses a host of tools it created internally to keep things running smoothly, said Shai Peretz, senior vice president of operations and IT at Outbrain.

There is "ongoing discussion" internally about the many tools the company uses, he said. "We try to get to the point where we don't look at any screens," he said, meaning it doesn't want admins to have to sit and watch dashboards, looking for trouble spots. "We're trying to move as much as we can into push mode so that if there's an issue the system detects it and reports to us."

It added Boundary to the mix about six months ago and finds that its by-the-second monitoring has been very helpful. In one instance recently, Outbrain noticed a network interface issue cropping up on many servers. "We saw this strange behavior but couldn't figure out what was wrong," Peretz said.

Using Boundary, Outbrain discovered that some network cards were shutting down from time to time for very short periods. Outbrain's other tools that weren't showing the cards shutting down because they only display minute-by-minute activity, he said.

"With Boundary we could identify the behavior and we were able to resolve it," he said. It turned out a driver in a certain Linux kernel version was causing the problem.

Boundary also recently allowed Outbrain to figure out that its Cassandra databases were generating loads of traffic between data centers as they replicated. The company figured out the databases were the source of the traffic because Boundary allows users to analyze traffic inside a VPN, Peretz said. Ultimately, Outbrain was able to tweak its Cassandra configuration so it doesn't create so much traffic between data centers, he said.

Outbrain admins have one thing going for them. With staff in Israel and the U.S., admins don't have to be on call responding to alerts through the night, he said. "That's more tolerable," he said.

Read more of Nancy Gohring's "To the Cloud" blog and follow the latest IT news at ITworld. Follow Nancy on Twitter at @ngohring and on Google+. For the latest IT news, analysis and how-tos, follow ITworld on Twitter and Facebook.

ITWorld DealPost: The best in tech deals and discounts.
Shop Tech Products at Amazon