You are not authorized to post comments.

Authorize.net categorizes downtime events as 'a perfect storm'

By Peter Smith  Add a new comment

Authorize.net has posted an explanation of what happened over the weekend to bring its services down. Users of the service can access the information from the 'Announcements' menu in the authorize.net dashboard.

In this document, they call the situation a "perfect storm" of events. The fire at Fisher Plaza happened late at night (11:10 pm PT on July 2nd) at the start of a long holiday weekend when many Authorize.net IT engineers were off on holiday, and it took time to get them all back to work the problem. The Seattle Fire Department wouldn't allow operation of the backup generators due to their proximity to the fire location, nor would they allow customers into the damaged building to access hardware. These factors were outside of Authorize.net's control.

Of more concern is the question of a back-up data center. Authorize.net states that they were approaching capacity of their current backup data center and they were in the midst of transitioning to a new one: a true "hot" site (in other words, real-time synchronization), so that the Authorize.Net platform could be switched from one data center to the other "on the fly." When the fire took out the primary data center, they attempted to fail over to the new, still-in-testing backup data center and encountered "a number of unanticipated errors." They offer no explanation as to why they tried to fail over to the new backup data center rather than the old (presumably well-tested) one.

The document finishes with a section entitled 'Lessons'

Even as our engineering and operations teams continue to ensure normal operations, the postmortem process is already under way. We are examining all aspects of this outage and implementing steps to mitigate future risks. Over the next weeks, we will be completing the work to ensure that we have two fully functional, synchronized hot sites. Failing over from one to the other will occur in a matter of seconds. Steps are also being taken to ensure that we have the ability to implement emergency communication by distributing our voice, e-mail and Web capabilities across multiple sites.

Over the next days and weeks the postmortem will continue. Processes will be refined and further protections put into place.

While Monday morning quarterbacking is always easy, it seems like some mistakes were made in the handling of the backup data center. It's unclear if the old backup center was no longer live, or if the engineers just determined that the new one was 'ready enough' to fail over to. At the same time, having been in that kind of position, I know that the engineers were under tremendous pressure and were doing their best to come up with solutions which would get services back online as soon as possible. The more egregious issue is that Authorize.net didn't have other ways to keep in touch with customers. When the fire broke out, all the authorize.net websites went down. Eventually they opened a twitter account and for some time that was their only means of getting information out to the customers who were losing revenue as a result of the downtime.

Follow Peter on Google+

Peter Smith writes about personal technology for ITworld.

ITworld LIVE

BusinessWhite Papers & Webcasts

Webcast On Demand

Delivery Management -- Extending Lifecycle Management

Date: Wednesday, June 20, 2012, 1:00 PM EDT Siloed organizations continue doing the wrong things and doing things wrong, leading to increased costs, project delays, lower quality, and time-to-market delays. Providing a collaborative platform where the whole organization can prioritize, share and manage deliveries with more transparency can help the organizations make more informed decisions at all levels, and greatly improve communications and traceability between teams. Hear from application lifecycle management experts how to increase delivery efficiency and effectiveness with a new approach to Delivery Management.

Sponsor: IBM

White Paper

Gartner: Magic Quadrant for Midrange and High-End Modular Disk Arrays

This Magic Quadrant represents vendors that sell into the end-user market with branded midrange and high-end modular disk array storage systems that support block-access protocols. Despite rather gloomy macroeconomic conditions worldwide and ongoing geopolitical unrest in the Middle East, the midrange and high-end modular disk array storage market grew 8.2% from 3Q10 through 2Q11, compared with the same period the year before. Propelled by technological innovation and enhanced scalability, this continued growth in vendor revenue supports the observation that IT executives are willing to invest in modern midrange and high-end modular disk storage systems to improve operational efficiency, to support deployments of virtualized IT infrastructures, and to address the impact of unabated terabyte growth.Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

White Paper

Seven Priorities for Integrated Network Management - How HP Intelligent Management Center Delivers an Enterprise-class Solution

This white paper describes the major requirements for network management solutions to help the organizations become more profitable, efficient and reliable.Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

Webcast On Demand

Operational Analytics - Changing the Competitive Dynamics of the Business

Date/Time: June 5, 2012, 11:00 a.m., EDT, 4:00 p.m. BST / 3:00 p.m. UTC Please join us for this webcast, as Dr. Barry Devlin, Founder and Principal, 9sight Consulting, describes what operational analytics can do for your business and reviews an architectural approach that will enable you to make it a reality.

Sponsor: IBM

White Paper

The Total Economic Impact of the HP 3PAR Storage

Forrester Research provides an analysis of four HP 3PAR storage customer implementations to quantify the efficiency and cost savings achieved over legacy storage platforms. On average, HP 3PAR storage customers achieved a 10.4 month payback period with a 55 % ROI over a 3-year evaluation period and a significant reduction in CapEx and OpEx over that same period as a result of thin provisioning, maintenance costs avoided and labor productivity gains.Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

See more White Papers | Webcasts

Ask a question

Ask a Question