July 06, 2009, 7:58 AM — Authorize.net has posted an explanation of what happened over the weekend to bring its services down. Users of the service can access the information from the 'Announcements' menu in the authorize.net dashboard.
In this document, they call the situation a "perfect storm" of events. The fire at Fisher Plaza happened late at night (11:10 pm PT on July 2nd) at the start of a long holiday weekend when many Authorize.net IT engineers were off on holiday, and it took time to get them all back to work the problem. The Seattle Fire Department wouldn't allow operation of the backup generators due to their proximity to the fire location, nor would they allow customers into the damaged building to access hardware. These factors were outside of Authorize.net's control.
Of more concern is the question of a back-up data center. Authorize.net states that they were approaching capacity of their current backup data center and they were in the midst of transitioning to a new one: a true "hot" site (in other words, real-time synchronization), so that the Authorize.Net platform could be switched from one data center to the other "on the fly." When the fire took out the primary data center, they attempted to fail over to the new, still-in-testing backup data center and encountered "a number of unanticipated errors." They offer no explanation as to why they tried to fail over to the new backup data center rather than the old (presumably well-tested) one.
The document finishes with a section entitled 'Lessons'
Even as our engineering and operations teams continue to ensure normal operations, the postmortem process is already under way. We are examining all aspects of this outage and implementing steps to mitigate future risks. Over the next weeks, we will be completing the work to ensure that we have two fully functional, synchronized hot sites. Failing over from one to the other will occur in a matter of seconds. Steps are also being taken to ensure that we have the ability to implement emergency communication by distributing our voice, e-mail and Web capabilities across multiple sites.
Over the next days and weeks the postmortem will continue. Processes will be refined and further protections put into place.
While Monday morning quarterbacking is always easy, it seems like some mistakes were made in the handling of the backup data center. It's unclear if the old backup center was no longer live, or if the engineers just determined that the new one was 'ready enough' to fail over to. At the same time, having been in that kind of position, I know that the engineers were under tremendous pressure and were doing their best to come up with solutions which would get services back online as soon as possible.