Saturday, July 4, 2009

How one CTO avoided a Web site disaster after data center fire

Most Seattle geeks probably didn't think they'd be spending a portion of their 4th of July holiday dealing with broken Web sites, back-up generators and damaged servers. But the small fire at the Fisher Plaza data center in downtown Seattle late last night knocked a number of sites offline for most of Friday, raising questions on TechFlash about how companies handle disaster planning and server co-location.

We actually first learned of the problem around 1 a.m. when Seattle-based Redfin posted a message on Twitter noting that their real estate site was offline because of problems at the data center. But by 4 a.m. Redfin's site was back online, purring along whereas other sites struggled.

We asked Redfin CTO Michael Young how they avoided the catastrophic failure that other sites are experiencing today. Turns out, the company learned some important lessons after a similar electrical fire hit the same data center last June.

Here's what Young told TechFlash today.

We were pretty embarrassed last June when Adhost had a similar electrical fire and took our site down for 8 hours (well into our core business hours) with brown-outs a day or two after that had us scrambling. 'Fool me once, shame on you; fool me twice, shame on me' resonated in our brains.

So by October 2008, we basically instituted a disaster avoidance plan where we had redundant-everything for our mission-critical databases, servers and networks in separate buildings.

When the problem happened last night, our beepers went off, we saw what looked like a major outage in one building, and were able to switch to the redundant systems.

Everything was up and running by 4am PST / 7am EST, well before our core business hours. We’re a startup, but we try to maintain high standards in our datacenter operations without spending too much money. The failover didn’t happen at the push-of-a-button, but the disaster planning paid off for us.

Young's explanation is interesting given that many sites -- including high-profile consumer-oriented sites such as AllRecipes, Bing Travel and Big Fish Games -- have been offline most of the day.

I have a feeling there will be some high-level meetings with CTOs, IT administrators and co-location operators on Monday discussing some of the ways to make sure this doesn't happen again.

I asked Young -- who was up at 5 a.m. dealing with the situation -- why other larger companies didn't appear to have a similar plan in place.

"It's hard to get every single point of failure," said Young. "And most people need to be burned once, like us."

[Flickr photo via Jamison_Judd]


READ MORE and COMMENT, more 

No comments:

Post a Comment