Salesforce.com's Heroku frustrates rapidly growing Rap Genius
Rap Genius says changes in its queuing practices are for the worse
One of the largest customers of Salesforce.com's Heroku has charged that the performance of the platform as a service (PaaS) has degraded since Heroku re-engineered the way in which it queues work jobs.
As a result, the customer -- Internet service Rap Genius -- would have to pay considerably more in service fees to achieve the same level of responsiveness it enjoyed before the change.
In a blog post, Rap Genius engineer James Somers attributes the issue to a new work-assignment protocol Heroku adopted, though barely documented, in 2010.
Heroku declined to comment on Somers' report, other than to say, in a statement from a spokesman, that "We are working hard to get to the bottom of this situation and give our customers a clear and transparent understanding of our next steps."
Heroku also promised to post updates on the issue on its blog site.
The Rap Genius service, which runs on a Heroku-hosted Ruby on Rails platform, now has nearly 15 million monthly visitors. Commissioning multiple virtual servers, Rap Genius racks up a monthly bill from Heroku of about US$20,000, though the company is happy to pay because "we don't want to manage infrastructure, we want to build features," Somers wrote.
Rap Genius discovered the issue by a discrepancy in reports of response times. Serving a static Web page to a user would, according to Heroku's logs, take about 40 milliseconds whereas Rap Genius' own reports estimated a response time of a much lengthier 6,330 milliseconds.
Heroku's services are actually hosted by Amazon Web Services, though Heroku itself provides the routing layer and customer interface. For deployments that require multiple duplicate virtual servers to execute, user requests go to Heroku's routing mesh, which, using load balancing technologies, distributes them to the servers.
When Rap Genius first started using Heroku, Somers wrote, Heroku used what it called intelligent routing, in which a router would choose which server to send an incoming request to based on the amount of existing work that each server was already doing. Those servers with no work or the least amount of work would get the new requests, so the job wouldn't be slowed by overloaded servers.
"In mid-2010, Heroku redesigned its routing mesh so that new requests would be routed, not to the first available [server], but randomly, regardless of whether a request was in progress at the destination," Somers wrote. He also said that Heroku notes this change in only some of its documentation.
Somers charged that the new approach meant that, overall, the service took longer to respond to users, and that the servers Rap Genius procured were not being used as efficiently as they could have been. He likened the change to a grocery store that randomly assigned customers to check-out lines, rather than allowing each customer to choose the shortest line.
"How much longer would it take to get out of the store? How much more time would the checkout clerks spend idling?" Somers wrote.
The upshot is that customers such as Rap Genius now have to procure more virtual servers, or Dynos, as Heroku calls them, to execute the same amount of work within the same response time. Somers estimated that a workload that could be handled by 80 servers with the old intelligent routing would require as many as 4,000 servers with the new random routing.
While Heroku did not comment on the specifics of any changes it has made with its routing practices, other industry observers took note.
"If this is true it is pretty bad," wrote John Mount, a consultant at data science consulting firm Win-Vector LLC, in a blog post about the issue. "Randomized routing is very bad with near certainty."