With this heat map, a black box indicates an individual host is down. A green block means the host is performing adequately and a red box means that some aspect of the operation, such as a large number of timeouts, is beyond acceptable levels. In addition to providing a visual glimpse into operations, Claspin allows the users to drill down to specific metrics, by running the mouse pointer over a specific host.
Thus far, the tool has been a success within Facebook, Lynch noted. Additional engineering groups have also started using the heat maps to watch over their own servers. Also, more servers seem to be operational due to the engineers being able to consult Claspin.
"When I first deployed Claspin, the view above had a lot more red in it. By making it easier for more people to spot server issues quickly, Claspin has allowed us to catch more 'yellows' and prevent more 'reds,'" Lynch wrote. "I suppose there's no better validation of one's choice of statistics and thresholds than to have things start out red and then turn green as the service improves."
Facebook is considering releasing Claspin as open source, like it has done with numerous other internally developed tools, though releasing it as a stand-alone application might prove difficult given how deeply connected it is with other Facebook-specific infrastructure, according to the company.