The Google engineers offered a number of techniques for mitigating slow performance from individual nodes, such as breaking jobs into smaller components and better managing routine maintenance tasks.
But they admitted that only so much can be done reducing individual component latency. So the heart of the article focuses on describing some techniques that minimize the effects of such variability. In much the same way that reliable fault-tolerant systems can be made from somewhat less reliable components, so too can a consistently performing cloud system be made using somewhat less consistently performing end-nodes.
One technique they describe is "hedged requests," in which duplicate requests are sent to multiple servers, and the first response that is returned is the one that is used. Another technique is to set up micro-partitions, or multiple partitions on each machine, which allows the company to do more fine-tuned load-balancing. A third technique involves putting into practice "latency-induced probation," in which slow servers are quickly spotted and not assigned any additional work.
These techniques should "allow designers to continue to optimize for the common case while providing resilience against uncommon cases," the Google engineers wrote.
Stoica noted that many of the techniques that Google described would be applicable to smaller IT operations as well, though "though the effect would not be as pronounced as in a large deployment's as Google's."