NOBODY MISSES THE NETWORK -- UNTIL IT'S DOWN. When a corporate network crashes, its damaging ramifications often sweep across national headlines, wreaking havoc on the company's bottom line and throwing a fiery spotlight on its CIO.
Indeed, too often the blame for what is mechanical error is dumped on the CIO, the chief architect, who has to spin an explanation for why things went horribly wrong. Just ask Maynard Webb, former CIO at Gateway and now president of eBay Technologies. He's in charge of the network that drives the Silicon Valley-based online auction house.
Last year, before Webb's arrival, eBay experienced a 22-hour outage that deflated the company's market cap by $2.25 billion in a single day and pinched quarterly sales. "There are huge challenges facing any company growing as fast as us," says Webb, who was brought on board to ensure such debacles won't happen again. "We got a little behind the curve in our capacity and operational excellence."
eBay's infamous black eye is today's rallying cry for the importance of network uptime in an accelerated electronic world. Companies are spending exorbitant amounts of capital and manpower to strengthen their networks. Still, the potential for a network outage looms on the horizon like a deadly offshore hurricane. No network is safe from disaster; however, CIOs can learn a few tips from companies that have endured head-on collisions with network downtime, and thus improve their chances for survival.
Previously, CIOs relied on systems management tools from Tivoli Systems, Computer Associates International, Hewlett-Packard and others to safeguard their networks, but the stakes have changed with e-commerce. Suddenly, networks are wider and subject to more weak links. Network downtime is now immediately apparent to customers and business partners. The cost of network downtime has risen exponentially. And it's difficult to measure downtime's effects on sales, market branding, customer loyalty and competition -- all moving targets in the new internet economy.
San Jose, Calif.-based market researcher Infonetics predicts U.S. corporations will spend $11.2 billion on network and systems management products in 2003, spurred by trends toward virtual private networks, network security and e-commerce.
The stakes are high for everyone, not just for eBay and the rest of the dotcoms. Infonetics surveyed companies with average annual revenues of $3 billion and found that they lost an average of $4 million annually because of local area network downtime. For wide area networks, companies reported an average annual loss of $3.3 million to downtime. (These figures primarily represent lost employee productivity and don't include losses due to customers being unable to access business services.)
Another key finding of the survey: Companies are allocating more time toward planning and designing their networks. On average, respondents expect to increase their networking staffs to a total of 48 people (up from 39) by mid-2001. "We're seeing people becoming proactive with their networks," says Mike McConnell, analyst at Infonetics. "That's great because it'll save them in the long run."
Dangers lurk inside every cubicle and every networking device. Attaining 99.999 percent availability -- five nines is network uptime's equivalent of the Holy Grail -- is a tedious and toilsome lifetime pursuit. Failure points include faulty software in routers and switches, increased bandwidth traffic that craashes servers, human errors, configuration problems, power failures, major carrier outages and even the applications that run on the networks.
It was the application more than the network that bedeviled eBay. But the distinction doesn't matter to CNN or the auction community, Webb found. After more than 20 years as an IT executive at various high-tech companies including Bay Networks, Quantum, Gateway and IBM, Webb accepted the chief technology post at eBay during a precarious time. eBay was in the midst of networking throes that threatened the company's future. "We are confident Maynard possesses the vision and hands-on experience to help ensure eBay's site stability moving forward and to help scale eBay's system," said CEO Meg Whitman, at the time of Webb's hiring.
Webb, too, felt the immense pressure in his new role. He recalls visiting eBay's headquarters shortly after the company's network went down. "I pulled up to eBay and saw CNN and CNBC outside, and it gave me a sense of the kind of attention any blip might receive," says Webb. "I hadn't been announced yet and no one knew who I was, so I walked through the front door like a salesperson -- nobody pays attention to those guys."
In Webb's first few months, he implemented a warm backup solution -- one that allows network recovery within four hours -- by increasing redundancy on NT servers, routers, switches and RAID drives. He then built a hot backup solution -- basically, a running duplicate of major systems. This reduced his recovery window to within an hour. Webb assembled an eight-person IT group dedicated to evaluating and implementing next-generation networking architectures.
The IT group's biggest challenge was adding resiliency to eBay's crown jewel: a massive database that stores 4.2 million auction items appearing concurrently on the website. However, this impressive number hides the fact that a lone database corruption can knock out the entire network. The server driving the database was also fast approaching architectural limits. "Our database is a pretty big single point of failure," Webb says. "We just can't throw more hardware at it anymore."
After evaluating possible fixes, such as migrating the database to a more scalable hardware platform or tweaking the web application to alleviate some of the pressure, Webb decided to split the database. He created separate databases to handle separate auction categories such as antiques and sports memorabilia. By deploying multiple databases, Webb bought back much-needed headroom. If eBay encountered a database corruption, chances are only one database would go down -- and hopefully, only the people participating in auctions hitting the database would notice. "I like my life to be very boring," Webb says.
The Bane of Bandwidth
Of course, it takes more than mere database partitioning to prevent network downtime. SBC Communications, a telecommunications service provider, has made networking a way of life. More than 200,000 SBC employees send 66 terabytes of data over a global network every month, including 16 million e-mail messages a week. The traffic flies over an ATM backbone linking 40 major network centers around the world. All tallied, SBC's networking hardware consists of 4,400 routers, 6,300 hubs and 1,100 switches. "Our internal business network is truly the lifeline of our corporation," says Ed Glotzback, CIO at SBC. "It connects our applications, our data centers and our employees."
Such a massive network is bound to run into problems. Glotzback admits his company has experienced a few outages. The culprit is often traced back to a dual failure in a redundant system, an initial network design flaw or quirky third-party software. Consequently, SBC's IS staff works closely with vendor development groups and conducts rigorous tests on individual products and integrated systems. These efforts, albeit coostly and time-consuming, have helped reduce potential network threats, Glotzback says.
Companies also overlook basic physical requirements when designing a network, according to Glotzback. Properly grounding data cabinets and other equipment can save hours of downtime and lost productivity, he says, as can utilizing multiple commercial power feeds.
It's not a perfect science, either. While SBC uses only the internet protocol for wide area networking -- limiting Novell's IPX and Apple Computer's AppleTalk to local segments -- other companies prefer utilizing multiple wide area networking protocols and platforms to act as safety nets.
Sandy Goldstein, CIO at Airgas Inc., a Radnor, Pa.-based distributor of specialty gases, learned about the advantages of many protocols the hard way. Airgas relied on MCI Worldcom's frame relay network to connect 700 branch offices throughout North America. In August 1999, the frame relay network went down.
"We had 10 days of zero access to our network," recalls Goldstein. "My employees were demoralized, my customers angry." Airgas's contingency plan centered on slow remote dial-up and other arcane technology, making it almost impossible for employees to perform their duties.
Moreover, Airgas's field employees, dependent on handheld Sprint PCS telephones to conduct business transactions and check on orders, couldn't get a dial tone. "If you're doing business with MCI Worldcom, keep in mind other carriers are affected," says Goldstein. "You're never really sure who's sharing whose wire." Even internal redundant systems can be rendered useless if they utilize the same egress or share the same power supply.
Airgas survived the outage, and MCI Worldcom issued an official apology. Not surprisingly, Airgas's board of directors demanded Goldstein develop a more flexible backup strategy.
Balancing risk and exposure, Goldstein decided to keep MCI Worldcom as his sole long-distance service provider. Having more than one carrier isn't a panacea, because it increases the management challenge. And if one went down, half of his workforce would still be exposed. But for localized networking, Goldstein did the opposite. He ordered every branch office to contract with at least two ISPs, as a kind of redundancy.
Goldstein also upgraded routers and deployed modems supporting multiple technologies such as internet, serial tunneling, ISDN and DSL lines. "We no longer rely solely on frame relay," he explains. "Should the frame go down again, we'll switch over to another networking platform." The end result of Airgas's wake-up call was "half a million dollars in hardware investments, an additional $80,000 in new dedicated communications, some more extra expenses and an insurance policy," Goldstein says.
All the hardware and software technology investments in the world won't safeguard a network against human errors. This sentiment is echoed by John Carrow, CIO at Unisys Corp., who believes most outages are caused by employee blunders.
Case in point: Unisys's financial staffers were busily closing the August books last year -- which also was the end of the company's third quarter -- when the system crashed. A Unisys employee conducting a maintenance check at a center in Plymouth, Mich., unknowingly disabled the local area network of the finance department in Blue Bell, Pa. The network was shut down for several hours.
It was a simple mistake, concedes Carrow, adding that because of what the employee learned, "that person will never make that mistake again." Unisys has spent considerable effort developing training programs, discipline programs and configuration control methods -- a sort of cross-training for horizontally disparate employees. "Companies can't afford to have telecom-only or data-center-only people anymore," he says. "You need people thinking end-to-ennd systems all the time, especially with the interdependency you have today."
Network outsourcer Intria-HP, based in Toronto, also emphasizes the importance of the human element and its impact on high-availability networking. The company, a joint venture between Intria and Hewlett-Packard Co., has a network that supports 14,000 branches, 2,000 local area networks and 120,000 point-of-sale terminals. The network has three central sites that triangulate information for backup and recovery purposes. Remote devices ping the network and perform extensive business functions over the internet in order to measure latency from the end-user point of view.
The most important factor in keeping the network up-and-running is the people, says Mike Somerville, vice president of technology planning and technical services at Intria-HP. The company encourages its employees to find and fix network problems no matter how trivial they might seem.
For instance, Intria-HP's monitoring software revealed that an ATM backbone was experiencing a slight service degradation. The backup solution, which was another ATM backbone, didn't kick in. Similar to looking for a needle in a haystack, Intria-HP employees searched until they found the error: A circuit was pulsating every 3 milliseconds, which wasn't long enough to indicate a problem and alert the secondary system to take over.
After manually starting the backup, Intria-HP staffers called the vendor of the automation software. They asked the ISV to fix its product. At first, the software developer claimed that its product operated within normal parameters. Intria-HP countered that it wouldn't tolerate any downtime in any product that it uses. In the end, the software developer conceded. "Our customers don't want to see heroics, they want things that work," says Somerville. "Sometimes you need weight to get carriers and vendors to deal with you, perhaps in a different way from what they'd prefer."