CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise





April 22, 2003

The Sleeping Bag Solution

An outage plan can help your organization manage unplanned application downtimes in a controlled and efficient manner

by Robert Northrop

I clearly remember the day I began consulting for eBay: It was March 2000, during the height of the Internet boom, and I was at the mecca of e-commerce companies. While receiving my initial tour of eBay's facilities, I was struck by the sleeping bags under every cubicle desk. Not wanting to seem intimidated, I remarked, "I didn't know that housing in Silicon Valley was that hard to come by."

"Oh, those," my guide chuckled, "We really don't use the sleeping bags much anymore. They're only needed if we have unplanned site outages."

The message was simple, but powerful: "If there's an outage, we don't leave." The events of that day piqued my interest in how organizations can better plan for unexpected outages. I have observed organizations that spend hours or days pointing fingers while critical systems are unavailable, and I believe that a more sensible approach is to deal with outages in a controlled, planned, and well-executed manner.

Every organization needs to plan for system outages. Disaster recovery plans have become commonplace over the past two years; however, true disasters, such as the loss of an entire data center, are thankfully rare. On the other hand, unplanned application outages are much more commonplace. Despite having less of an impact, being less critical, and receiving less attention, if these outages aren't handled properly, their cumulative effect can exceed those of a catastrophe.

Why Plan For the Unplanned

Application uptime is critical for businesses to operate successfully. The effects of outages extend from quantifiable elements, such as lost sales, increased overtime, and loss of productivity, to long-term factors, such as loss of customer loyalty and diminished employee morale.

And while loss of customer loyalty justifiably receives a lot of attention, you can't afford to overlook the effect unplanned outages can have on employees. Without a plan, employees that may not be adept at high-pressured problem diagnostics often work in isolation, squandering time on misdirected diagnoses, communications, and solution attempts.

The interruptions to their work pushes projects off schedule, often resulting in a shortened testing cycle, nearly ensuring additional future outages. Employees are constantly firefighting and can't make headway on their current deadlines. Developing an approach for dealing with unplanned application outages isn't just beneficial to better satisfy customers, but for improving employee productivity as well.

Constructing an Outage Plan

Despite your best efforts to prevent unplanned outages, you can't avoid them. Every organization should formulate a plan for resolving unexpected outages with minimal disruption. The components of the plan should include preparation, roles, rules, and processes.

Preparation. "Be prepared" is an appropriate motto for both Boy Scouts and organizations with mission-critical systems. Although each organization may prioritize preparation measures differently, common elements should include:

  • A physical center where outages will be handled. This facility should include Internet access, telecommunications, meeting facilities, and diagnostic equipment.
  • An outage response team (see "Outage response team roles" in the next section). Redundant communication mechanisms should be in place for contacting this team.
  • Contact information. You need to be able to reach not only the outage team, but also third-party vendors and service providers.
  • Tools to monitor application log files, execution data, and statistics. When configured to watch for suspect events, these tools can help diagnose outages and proactively prevent future outages.
  • Organization change-control policies. You'll also need a list of recent systems changes with associated "rollback" procedures.
  • A listing of applications. You'll also need to include each application's purpose, service-level targets, and importance.







IE Weekly Newsletter
Subscribe to the newsletter
    Email Address