Mind over Matter
At U S West, business systems availability is a mindset, not just a hardware spec
By Bill Juliano & Sandra McCready
The information explosion in the business world is making the layer of computing complexity between users and the end target the data transaction grow dramatically. This increasing complexity challenges the ingenuity of the most sophisticated IT organizations; if something can go wrong, it will. If nothing can go wrong, something will. And your IT groups job is to prevent users and customers from feeling the impact when it does go wrong.
Unfortunately, systems environments are difficult to manage and operate because standard definitions and enforcement often vary significantly across organizations and because of the many managed applications involved. In many cases, this precarious framework stays in one piece only through a combination of luck, heroic effort, and sheer numbers of support staff.
Furthermore, the pressures of normal business operations in the computing environment are insignificant compared to those of e-business. E-business applications are driving availability requirements up and putting pressure on IT to provide true 2437 availability in the near future. This situation requires careful attention to planned as well as unplanned downtime. These pressures, however, work to the advantage of the organization that develops and enforces standards. Evidence is mounting that standardization provides better system availability and lower cost. In addition, faster deployment is possible by considering only a limited number of well-understood architectures that are configured to meet all internal requirements. IT is no longer a supporting function in business; in fact, in e-business, it is the business. Consequently, IT outages can have severe impacts. The infamous 22-hour outage experienced by eBay in June 1999 cost that company between $3 and $5 million in revenue, not to mention the damage to customer confidence and competitive position. As part of an ongoing effort to achieve standardization and improve availability of U S Wests IT products, our IT organization has developed four supported service levels Fault Resilient, Highly Available, Highly Reliable, and Conventional with specific configurations for midrange computing environments based on required availability. (See Figure 1.) These configurations cover system design, engineering, and operation and support availability percentages weve assigned to each of the four standard service levels. Our goal is to provide standards that will minimize unplanned downtime within our computing environment. Our follow-up efforts focus on architectural frameworks that will enhance the Fault Resilient service level and provide some of the features of the next three higher service levels (Fault Tolerant, Disaster Resilient, and Disaster Tolerant; see Table 1, for descriptions of all service levels) to minimize planned downtime. These three service levels move closer to true 2437 availability by reducing planned downtime as well as unplanned downtime, even in the event of a regional disaster. Emerging processes and technologies such as wide-area failover and rolling upgrades are part of providing these higher levels of availability.
FIGURE 1 U S Wests four supported service levels.
Our four standard configurations Fault Resilient, Highly Available, Highly Reliable, and Conventional provide several benefits. For example, they: Provide a limited number of well-defined as well as well-understood configurations Help achieve Capability Maturity Model (CMM) Level 3 and above, which require the documentation, standardization, and integration of processes into a uniform software framework for the organization. Ensuring that standard configurations are documented and understood supports that goal as follows: Match application configurations to business needs, focusing resources where most required Reduce support costs by reducing variability in the environment Simplify the procurement process Help position midrange applications for stacking to take advantage of technology changes that further reduce costs of ownership
Improve client satisfaction by setting and meeting availability expectations.
Each service level has an associated availability percentage ranging from 98.0 percent (Conventional) to 99.8 percent (Fault Resilient). This approach lets the development team select a service level early in the development cycle based on business needs, then design applications and use resources accordingly. The availability percentage we assign to each level is consistently achievable; over any given period of time, higher levels will be achieved, but the percentage we assign to each level represents the availability that we can provide over the longer period given the reliability of all components.
Providing higher availability requires larger expenditures both initially and in ongoing support costs, so matching business value to architecture is crucial. We design, engineer, and operate applications with high business value those that incur lost revenues, increased costs, or lost customer base during downtime at a high service level. These applications include those involving automated monitoring and trouble reporting of telephony equipment, dispatching for installation and repair, online ordering, and public Web sites. For example, the life-cycle cost of a Fault Resilient application is approximately 2.3 to 2.6 times that of a Conventional application. On the surface, this approach appears to conflict with the cheaper part of our faster, better, cheaper mission. But considering the amount of resources (and the amount of luck required) we would need to consistently develop and operate applications at the Fault Resilient level using nonstandard configurations, this approach provides substantial savings. Appropriate configurations based on business need focus resources where they are cost justified, resulting in more effective and efficient operations.
The Standards
Information from several sources in the industry including Gartner Group, a Unix standardization opportunity analysis Hewlett-Packard did for U S West, and internal research such as our service criticality model gave us a starting point for defining supportable architectures that provide software development options for appropriately configuring applications. These architectures also provide a basis for moving U S West projects to higher levels of availability as business opportunities dictate. In each case, the desired level of availability is based on the user perception and the cost-benefit ratio of providing the users desired level of availability. (See Table 1.)
MIND OVER MATTER
These availability percentages are assigned to the service levels based on defined configurations. The top three levels will not be available until we define and test configurations for them. Availability % Description |
| N/A |
Disaster Tolerant
Business functions must be continuously available, with any system failure transparent to the user. No work interruptions, no performance degradation, and no lost transactions are permitted. Supports continuous operations and transparent remote recovery in the case of disasters such as flood, fire, or earthquake. |
| N/A |
Disaster Resilient
Business functions require uninterrupted computing services, either during essential time periods, or during most hours and days of the week throughout the year. Users stay online, although the current transaction may need restarting and performance may degrade. Supports continuous operation and remote recovery in the case of disasters such as flood, fire, or earthquake.
|
| N/A |
Fault Tolerant
Business functions demand continuous computing, and failures are transparent to the user. No work interruptions, no performance degradation, and no lost transactions are permitted. Supports continuous operation except following a regional disaster. |
| 99.8 |
Fault Resilient
Business functions require uninterrupted computing services, either during essential time periods, or during most hours and days of the week throughout the year. Users stay online, even though the current transaction may need restarting. Maintains data integrity but performance may degrade.
|
| 99.5 |
Highly Available
Business functions can survive minimally interrupted computing services. The user will be interrupted, but can log on again. Transactions may be lost, but data maintains its integrity. Performance may degrade.
|
| 99.0 |
Highly Reliable
Business functions may be interrupted for short periods of time, as long as data maintains its integrity. The users work stops, and an uncontrolled shutdown occurs. Transactions may be lost, but no corruption occurs. |
| 98.0 |
Conventional
Business functions may be interrupted; data integrity can recover to a previous point. The users work stops, uncontrolled shutdown occurs, and transactions may be lost or corrupted. |
|
TABLE 1 Service level definitions.
|
The availability percentages apply to planned uptime and take the computing platform, local network, and application into consideration. These figures represent the availability for a given service (application) on a given server. Keep in mind that business services often include chains of interrelated applications that offer functionality in a series of transactions; the availability of the end users experience is the statistical product of each links availability. Thus, each and every link is extremely important from an end-user perspective.
U S Wests midrange service level standard covers five major areas: application design, hardware and OS, environmental support, network, and support processes.
Application design. If an application is worth doing, its worth doing right. In most computing environments, more than half of reported downtime is attributable to application coding and touch-labor that occur after deployment. Our standard for application design minimizes these risks through several protocols:
Application code must be fault- tolerant and provide a reasonable behavior when interfaces fail.
Code must provide hands-off recovery in the event of a fault when a resource is undergoing a state change, is missing or does not respond, is inaccessible, is corrupted or returns untrustworthy results, returns an error condition, or is degraded in performance or capacity.
The application must provide hands- off recovery on restart from an unknown state due to a hardware or OS failure. The application and middleware must be self-contained; all associated data, executables, log files, temporary space, and so on must be restricted to a single mount point isolated from root and all other applications. The application must capture response time and support trend analysis and alarms, because excessive response time is often reported as downtime.
All development tool sets must support development of fault-tolerant applications.
When developed, practices and procedures must include strict test procedures for initial fault resilience and as part of change control. Test scripts must identify every state change in the application and test for failure before, during, and after the state change.
Hardware and operating system (OS). If we were to remove the application and the administrative touch-labor associated with it from the picture, the next most common issue would be touch-labor associated with managing a diverse computing environment. The standard for hardware and OS seeks to minimize the risk by limiting the number of possible configurations: Selecting a limited (small) number of vendors with good availability history and configurations that can support no single point of failure; and purchasing appropriate maintenance agreements. Developing hardware and OS configurations that support the desired service levels. In the case of the higher levels given here, this directive means you must enforce no internal single point of failure and OS parameters that handle 95 percent or more of all applications. Developing peripheral configurations that support the desired service levels. In the case of the higher levels given here, that means create no internal single point of failure.
Reviewing each configuration to ensure it can operate successfully in the environment. In the case of U S West, these configurations include switched Ethernet and storage area network architecture.
Selecting standard tool sets and middleware from a pre-approved list of alternatives that rates each option by the service level it can support.
Environmental support. This standard ensures that power, cooling, and grounding are redundant within the environment and that the operating environment itself does not provide a single point of failure for higher-availability applications. (For example, we use diverse PDUs and dual power feeds for the server.) The standard also establishes physical security that limits the number of individuals who have change-access to the servers and environment (and holds them accountable). Finally, it establishes strict change controls for the environment linked to schedules for planned downtime.
Network. We limited network efforts for this initiative to heartbeat and server connections to our intranet. The U S West WAN/MAN is already a Self-Healing Alternate Route Protection/Self-Healing Network Service (SHARP/SHNS) configuration and provides high availability via redundancy. The intranets backbone is designed with no single points of failure; should one part experience a problem, the system automatically routes data through an alternate path. The weak link (single point of failure) is in the local LAN connection to the WAN/MAN; therefore, we detail specific configuration information for connections to the router using different network interface connections and switches. The configuration specifies which slots and ports in each server configuration are approved for heartbeats and public network interfaces, as well as how to achieve Ethernet switch redundancy. This approach ensures consistency and makes maintenance less prone to error. In summary: The WAN should have no single point of failure; todays technology can ensure virtually 100 percent availability.
Local network connections (including public and heartbeat networks) should be redundant to ensure higher availability levels.
Support processes. When the design and engineering protocols weve described here support the service level desired, this standard minimizes errors due to touch-labor during operation by:
Providing strong event management processes for operations support, including reactive alarms, notification, and escalation Offering strong, proactive event management for operations support, including proactive alarming, capacity planning, and response time trending
Enforcing strict change management Providing a production-ready soak environment where all final testing is performed in a production environment
Supporting automated backup and recovery processes that provide the service level desired Making maintenance windows a criterion for meeting the higher service levels.
One other key area of consideration is the overall development paradigm. IT professionals are often led to believe that availability is a hardware problem alone. In reality, adding hardware will not help availability if the application is not designed and developed to take advantage of redundancy. Even a server cluster can suffer an outage caused by a simple CPU failure if the application cannot gracefully failover to the redundant resource. Therefore, we specify detailed design and development criteria to ensure applications can failover in the event of a component failure. At the higher service levels, testing in the production-ready environment can take between one and three weeks where a combined team of development staff, operations staff, and users exhaustively test all failure conditions.
To make product selection and configuration easier, U S West has an approved products list (APL) that lists all hardware, OS, middleware (tools), third-party applications, network hardware, and so on. Every approved product is rated for a service level based on its ability to support the availability desired. Product ratings are based on quality metrics, vendor support, compatibility with other products, and ability to function during and following a failure. To achieve higher levels of availability, an application must only rely on products that are rated at or above the desired level.
Next Steps
As we discussed earlier, e-business applications are driving availability needs higher than those delivered by traditional fault-resilient architectures. The industry is providing more and more technology that will support e-business in a no-fail environment; however, technology is only a small piece of the puzzle. Hardware that does not fail, and an OS that supports no-fail operations, are only part of the picture. Availability is not just a hardware solution it is a mindset and commitment to providing a given application at a given service level.
Bill Juliano (wjulian@uswest.com) is a lead project manager at U S West Communications. He has 19 years of IT experience as a programmer, manager, and lead project manager on platforms ranging from desktops to mainframes.
Sandra McCrady (smccrad@uswest.com) is senior engineer for high-availability architecture at U S West Communications and has 27 years of experience in IT. She sets standards and policies relating to best practices in achieving performance, scalability, and availability to meet service level targets.
|