Guide to the TechWeb Network

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home




April 22, 2003

The Sleeping Bag Solution

An outage plan can help your organization manage unplanned application downtimes in a controlled and efficient manner

by Robert Northrop

Continued from Page 1

Outage response team roles. Identifying and assigning members to an outage response team is a critical component of any plan for handling unplanned downtimes. The sole duty of this team is to restore the affected systems; the team should comprise the following personnel:

  • Data center operations. The operations members handle day-to-day monitoring of applications and systems. If an outage occurs, they lead all resolution efforts by coordinating processes, escalation, and change tracking.
  • Application development. These team members diagnose problems and potentially implement fixes for both in-house and externally developed applications.
  • Technical operations. These members have expertise in networking, hardware, security, backup and recovery, and third-party software applications and are responsible for diagnosing failures in these areas.
  • Vendor coordination. The coordinator is responsible for ensuring vendor accountability and acts as a single point of contact for any vendors brought in for assistance.
  • Internal communications and support. The responsibilities of this role range from providing executive updates to managing team logistics.
  • External communications and support. This role is found in organizations that need to coordinate external communications for audiences ranging from internal business users to a press corps camped outside the company entrance.

Depending on outage frequencies, implementing a rotating on-call schedule may be appropriate. Some outages may require additional support personnel. The team should be careful to minimize the interruptions of unaffected business operations when involving additional resources.

Rules and guidelines. In addition to assembling an effective outage response team, the team members must agree on rules and guidelines as part of their charter. I recommend the following:

  • The accountability rule. The entire team is responsible for diagnosing and repairing the outage regardless of the suspected cause. This rule creates a spirit of collaboration, rather than one of accusation.
  • The prisoner rule. The team is imprisoned by the outage and can only leave when the problems are resolved. This attitude is critical for recovering applications in a timely manner.
  • The proximity rule. Team synergy and effective communication is promoted if the team works from the same location. If this is impossible, communication media ranging from video conferencing to Internet chat should be utilized.
  • The fix-it-first rule. Many organizations go astray with outage analysis paralysis. Although learning as much as possible to prevent similar future outages is important, this knowledge shouldn't be acquired at the cost of additional application downtime.

A process framework. In general, once an unplanned application outage occurs, a customized version of this process should be followed:

  1. Identify the outage and affected resources. Many organizations immediately fill out a form or checklist when an outage occurs.
  2. Notify key personnel. The list of personnel will vary depending on the severity and affected functional area.
  3. Assemble the outage response team. The team should convene as quickly as possible for briefing on the outage.
  4. Diagnose the problem and identify potential solutions. Recent system changes should be examined as the most likely culprits.
  5. Escalate the problem to support teams if necessary.
  6. Implement and test a solution. If the solution is unsuccessful, it's rolled back and the process returns to step four.
  7. Monitor the application to ensure that the recovery was successful.
  8. Determine and document outage causes and their solutions for use in solving future problems.
  9. Identify and implement application monitoring to proactively recognize the symptoms that catalyzed the outage.
  10. Modify outage plan if necessary.
  11. Periodically analyze data such as outage frequency, recovery time, and costs to improve prevention and handling.



Rate This Article

Comments:

Optional e-mail address:

Technology innovation often goes hand in hand with destabilization. Embrace the fact that unplanned outages are inevitable; proper planning only ensures minimal application downtime. If firefighting has become your organization's daily grind, you should consider assembling your top team and using a model to devise your own outage handling strategy. All you need is an outage response team, some preparation measures and processes, and some comfortable sleeping bags!


Robert Northrop [robert.northrop@tallan.com] is a director of design and development with Tallan, a professional services company specializing in developing custom technology solutions for its clients. In the past three years, Northrop has consulted for eBay, Compaq, Ingram Micro, Stage Stores, Kinko's, and others.









IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo Jitter
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet Evolution
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space