The Sleeping Bag Solution
An outage plan can help your organization manage unplanned application downtimes in a controlled and efficient manner
by Robert Northrop
Continued from Page 1
Outage response team roles. Identifying and assigning members to an outage response team is a critical
component of any plan for handling unplanned downtimes. The sole duty of this team is to restore the
affected systems; the team should comprise the following personnel:
- Data center operations. The operations members handle day-to-day monitoring of applications and systems. If an outage occurs, they lead all resolution efforts by coordinating processes, escalation, and change tracking.
- Application development. These team members diagnose problems and potentially implement fixes for both in-house and externally developed applications.
- Technical operations. These members have expertise in networking, hardware, security, backup and recovery, and third-party software applications and are responsible for diagnosing failures in these areas.
- Vendor coordination. The coordinator is responsible for ensuring vendor accountability and acts as a single point of contact for any vendors brought in for assistance.
- Internal communications and support. The responsibilities of this role range from providing executive updates to managing team logistics.
- External communications and support. This role is found in organizations that need to coordinate external communications for audiences ranging from internal business users to a press corps camped outside the company entrance.
Depending on outage frequencies, implementing a rotating on-call schedule may be appropriate. Some outages may
require additional support personnel. The team should be careful to minimize the interruptions of unaffected
business operations when involving additional resources.
Rules and guidelines. In addition to assembling an effective outage response team, the team members must agree on
rules and guidelines as part of their charter. I recommend the following:
- The accountability rule. The entire team is responsible for diagnosing and repairing the outage regardless of the suspected cause. This rule creates a spirit of collaboration, rather than one of accusation.
- The prisoner rule. The team is imprisoned by the outage and can only leave when the problems are resolved. This attitude is critical for recovering applications in a timely manner.
- The proximity rule. Team synergy and effective communication is promoted if the team works from the same location. If this is impossible, communication media ranging from video conferencing to Internet chat should be utilized.
- The fix-it-first rule. Many organizations go astray with outage analysis paralysis. Although learning as much as possible to prevent similar future outages is important, this knowledge shouldn't be acquired at the cost of additional application downtime.
A process framework. In general, once an unplanned application outage occurs, a customized version of this process
should be followed:
- Identify the outage and affected resources. Many organizations immediately fill out a form or checklist when an outage occurs.
- Notify key personnel. The list of personnel will vary depending on the severity and affected functional area.
- Assemble the outage response team. The team should convene as quickly as possible for briefing on the outage.
- Diagnose the problem and identify potential solutions. Recent system changes should be examined as the most likely culprits.
- Escalate the problem to support teams if necessary.
- Implement and test a solution. If the solution is unsuccessful, it's rolled back and the process returns to step four.
- Monitor the application to ensure that the recovery was successful.
- Determine and document outage causes and their solutions for use in solving future problems.
- Identify and implement application monitoring to proactively recognize the symptoms that catalyzed the outage.
- Modify outage plan if necessary.
- Periodically analyze data such as outage frequency, recovery time, and costs to improve prevention and handling.
Technology innovation often goes hand in hand with destabilization. Embrace the fact that unplanned outages
are inevitable; proper planning only ensures minimal application downtime. If firefighting has become your
organization's daily grind, you should consider assembling your top team and using a model to devise your
own outage handling strategy. All you need is an outage response team, some preparation measures and
processes, and some comfortable sleeping bags!
Robert Northrop [robert.northrop@tallan.com]
is a director of design and development with Tallan, a professional services company specializing in
developing custom technology solutions for its clients. In the past three years, Northrop has consulted
for eBay, Compaq, Ingram Micro, Stage Stores, Kinko's, and others.