Fault Tolerant Computing
Brian S. Mitchell
School Of Engineering
Department of Electrical Engineering
Introduction and Background
In order for computers to reach a stage of acceptable dependability in the performance of modern applications, they must demonstrate the ability to produce correct results or actions in the presence of faults or other anomalous or unexpected conditions. Such fault-tolerant systems were first developed for robotics, process control, aerospace and other mission-critical applications where the consequences of device failure could be catastrophic. Today, computers are playing a larger part in everyday life, and the tolerance for system failures in more common applications is becoming unacceptable. This coupled with the increasing complexity of modern computer-automated services demonstrates the need for ultra-reliable, highly available, fault-tolerant hardware and software.
This paper will elaborate on previous work done in the field of fault tolerant computing in order to present pertinent aspects of modern hardware and software fault-tolerant computer architectures.
Improving the reliability of computer systems started as a necessity in the 1950s due to the poor reliability of relay and vacuum tube machines. With the advent of the transistor, component reliability was greatly improved and focus was shifted to other research areas such as computer performance and functionality. Today, the increasing public dependence on automated services combined with the lower cost and increased performance of semiconductor technology has contributed to a renewed interest in fault-tolerant computing. The remainder of this section will discuss the strategies, classifications and techniques used to implement modern fault-tolerant computing systems.
Elements of Fault-Tolerant Computer Systems
A designer must consider many factors when creating a fault-tolerant computer system. Performance, complexity, cost and size are a few of the many parameters that introduce constraints on the final system design[RENN84]. These constraints must be weighed against the cost and consequences of system failure in order to produce an appropriate fault-tolerance strategy for the target system. Typically this strategy would include one or more of the following elements:
Masking. A fault-tolerant system should be able to effectively mask, or dynamically correct, invalid system operation or errors. Errors that are not able to be masked must be handled by some other fault-tolerance mechanism (perhaps software) or the system will eventually fail.
Detection. A computer system must be able to distinguish a correct operation or value from an error. This is not always trivial due to the complexity of modern hardware designs; therefore, the addition of intelligent mechanisms, that are capable of detecting suspect operations, must be incorporated into the system design.
Containment. In order to minimize the complexity of recovering a system, errors must be contained to the module in which they originated. This is typically implemented by first partitioning the system into recoverable modules, and then requiring each module to validate their own outputs. Detection of an incorrect output should invoke some additional fault-tolerance mechanism in order to restore correct system operation.
Diagnosis. The cost associated with the maintenance of modern computer systems is rapidly increasing. Therefore in order to expedite the repair process, fault-tolerant systems should be able to automatically identify the modules that were responsible for generating errors.
Repair and Reconfiguration. A system is restored to normal operation (repaired) by replacing a faulty module with a spare or by reconfiguring the system to bypass the failed module by redistributing the work load to other functioning modules. The module replacement technique requires redundant hardware but will restore the system to full operational capacity. Redistribution of the systems work load is appropriate in circumstances where degraded system performance is acceptable until a replacement module can be installed.
System Recovery. System recovery involves the restoration of the system to a state that is acceptable for continued operation. If an error can not be dynamically corrected, or masked, then some other technique must be utilized in order to properly recover the system. The most typical system recovery technique involves restoring or "rolling-back" the systems operation to a previously saved state that was known to be correct.
Classification of Fault-Tolerant Computing Environments
The design of fault-tolerance into a computer system is highly dependent on the type of functionality that target system is going provide. The following are the five most popular application classes of fault-tolerant hardware systems [RENN84, SEIW86]. They are presented in order of increasing reliability requirements.
General Purpose Commercial Systems. The increased dependence of the public on automated computing services has contributed to the proliferation of fault-tolerant mechanisms being incorporated into general purpose computer systems. These systems typically provide automatic recovery within a couple of seconds of the detection of an error. In the case of severe errors, degraded operation should be possible until appropriate maintenance can be performed. Modular replacement components, and robust diagnostic routines that expedite troubleshooting and repairs are also characteristics of this computer system class. Many of the DIGITAL and IBM computers are designed for this operating class.
Computer Networks. Computer networks provide a naturally redundant environment of computer resources that are connected using proven communication protocols and reliable, high speed media. The three major advantages of utilizing computer networks for fault-tolerance are that the cost of processors (nodes) is remarkably inexpensive, fault-tolerance is inherently built in, and they tend to be very scaleable and expandable. Just about every computer designed today is incorporating the ability to connect and share resources over standard computer networks. The IEEE 802.5 token ring network specification incorporates several fault-tolerant strategies into the protocol definition and is currently very popular in the commercial computing environment [MART84].
High Availability. Highly available systems are typically designed to perform specific applications where the consequences of a small error is tolerated in favor of continued, uninterrupted, system operation. For example, signal processing systems which process billions of measurements per second will not be affected by an occasional error in an individual measurement, but will be greatly or gravely impacted by several seconds of down time. Typical techniques used to implement fault-tolerance in high availability systems include coded memory, bus parity and time-out counters. The Bell Telephone electronic switching system (ESS), the Tandem "Non-Stop" and the Intel 432 are examples of this architecture class.
Long Life. Long life systems must be designed to operate in harsh environments where human intervention is not feasible (or even possible) over the systems operating life. These systems, which are typically designed for unmanned spacecraft, are often highly redundant, with recovery mechanisms that can either be automatically or remotely invoked. The goal of designing a long life system is to provide enough spare components so that the required computational power will always be available over the systems lifetime. The STAR and Voyager are examples of long life spacecraft systems.
Critical Application and Computation. The most important class of fault-tolerant computing systems are those in which faulty operation could result in the loss of expensive equipment or human life. Techniques which employ massive redundancy, protected memory and storage, high quality components and concurrent recovery mechanisms are always used in these applications. Critical application and computation systems must not only detect, contain and mask faults, but they must recover and restore normal system operation in a minimal time frame. Aerospace, and other mission-critical control applications typically warrant the investment in implementing this level of fault-tolerance. The SIFT and FTMP avionic computers are examples of critical application and computation systems. Their design goal was to meet the military specification of a failure probability less then 10-9 over a 10 hour mission.
Implementing Hardware Fault-Tolerance
Fault tolerance in computer systems is primarily implemented by providing a system with additional (redundant) resources that are used to overcome the effects of an error. Redundancy is usually provided in the form of additional hardware, software, information and/or computations that are beyond the minimum necessary to perform the computational task[NELS90]. The two major subdivisions of redundancy are coding and replication.
Coding is a technique where extra bits are added to the basic information structure in order to provide error detection and correction capabilities. These mechanisms have become very popular because of their efficiency and ease of implementation. Because this has been a popular research area, many coding techniques have been both developed and applied to a variety of applications. Each technique has certain advantages over other methods; therefore, the selection of the best coding methodology is somewhat application specific. Some common coding algorithms are simple parity (odd/even), M out of N, Berger, Checksums, Hamming and CRC codes.
Replication methods utilize extra hardware to detect, perform or overcome the undesired effects of a failed module. Replication techniques are classified as static, dynamic or hybrid[HARI92]. Static replication is typically implemented by providing multiple instances of identical processing modules (PMs) and a voter to compare the outputs of the PMs. Although the added hardware is expensive and complex to implement, static replication techniques have the capability of detecting and tolerating a fault without the need for reconfiguration. Thus, fault-tolerance is implemented in real-time without incurring any performance penalty to recover and reconfigure the system to normal operation. Dynamic replication (also referred to as stand-by redundancy) is usually implemented by providing a system with spare processing modules and an error detector. These spares may be active (powered) or dormant (unpowered), but once an error or faulty module is detected, a spare is switched on-line in order to restore correct system operation. Hybrid replication methods utilize both static and dynamic redundancy techniques to provide an optimal level of fault coverage without introducing unwanted design complexities. Figure 1-1 illustrates commonly used static and dynamic redundancy configurations.
Figure 1-1: Fault-Tolerance Replication Techniques
On February 10th, 1987, the United States Food and Drug Administration (FDA) along with the Canadian Health Protection Branch (CHPB) notified AECL Medical Corporation that the Therac-25 machine, an electron accelerator used for cancer treatment, is officially considered defective under United States and Canadian laws[LEVE93]. The company was instructed to notify their customers that the device was faulty and should only to be used in circumstances where no treatment alternatives exist. Unfortunately, six people were already dead and numerous other patients received massive radiation overdoses. The cause of all of these accidents - faulty software.
Unlike hardware, software is not subject to wear out conditions that occur from aging, stress or environmental factors. Thus, the only cause of software faults is the improper design and/or implementation of programs which result in the inability of the software to meet its designed requirements or specifications under unforeseen circumstances or changes to the computing environment. Although just like in fault-tolerant hardware, reliable software systems must be able to either eliminate or effectively tolerate and recover from faults. Because modern software systems are so complex, it is nearly impossible to eliminate all of the embedded faults; therefore, development techniques which protect the entire system from incorrect operation should be utilized in cases where program failure would have serious consequences.
The following four sections describe fault-tolerance strategies that are commonly utilized to improve software reliability[HECH86].
The primary purpose of fault-avoidance and detection techniques is to identify and repair incorrect program operation prior to releasing a system. Fault-avoidance mechanisms are the most common form of software protection used in the industry today because of the relatively low cost of implementation. Unfortunately, fault-avoidance techniques only prevent or catch known faults, they do not ensure correct system operation, or provide recovery mechanisms to handle unexpected conditions which can lead to system failure. The four most popular fault-avoidance categories are utilization of automated design tools, use of demonstrated development techniques, formal proof of correction techniques and testing. The following sections analyze each of these techniques.
Automated Design Aids. It has been show that 60 to 65% of all software faults are introduced during the design phase of system development. The cost of correcting these faults is about 10 times greater then if they were introduced during the coding phase of development. Because of this, formal system specification languages and metalanguages have been developed to describe a software system design at the component and algorithmic levels. Some specification languages even have interpreters and simulators, allowing the system design to be "debugged" prior to implementation at the coding phase[LAMB88].
Use of Demonstrated Development Techniques. Structured programming, use of high-level languages (C/C++) and top-down design techniques instill discipline and structure on the programmer and designer which inherently reduce the probability of faults being introduced into the system. Additionally, newer programming techniques modeled after the object oriented paradigm are becoming popular because this methodology is based on combining code and data into reusable objects. Over time, code that is reused will mature and will have fewer defects then newly developed code. This will result in fewer faults propagating into the released system.
Use of Formal Proof of Correctness Techniques. Many attempts have been made to formally prove the correctness of programs. This approach, which is based on validating a program by mathematical induction, is complex and is currently not applicable to large programs. Furthermore, in some instances, programs that were proven to be "correct" were shown to contain faults. These faults were not manifested by the "program-prover", but were introduced during the translation of the programs source code and algorithms into the formal specification that is required for the mathematical proof.
Testing. Although testing techniques do not prove that software is fault free, it does show that the software performed as specified for a specific set of test sequences, in a controlled environment. Testing, both formal and informal, is a valuable tool for exposing many embedded system faults; but even the most thoroughly tested software will escape the testing cycle with subtle errors[LAMB88]. These errors will probably manifest themselves while processing some unusual data or during an unexpected machine state and will lead the system to failure unless some additional fault-tolerant mechanisms are incorporated into the software.
Even though fault-avoidance techniques are not guaranteed to eliminate incorrect system behavior, they are proven tools for improving system reliability and overall quality. Figure 2-1 [HEIC86] shows the effects of utilizing fault-avoidance techniques during the project development cycle. Notice the large variations in the program failure rate during the development cycle and how the failures start to converge to the stabilize to the trend during the testing phase (which started in the eighth month).
The robustness of a software system is a measure of its ability to correctly operate in the presence of invalid inputs. These conditions are often attributed to unforeseen sequences of inputs or events which occur as a program encounters invalid data that was not detected by the use of any fault-avoidance techniques. For example, robust programs should be able to handle, without any performance degradation, such things as invalid or out of range inputs, invalid indexes, unknown or spurious signals, inconsistent check sums and data type conflicts.
Robustness techniques tend to be tightly coupled to validating input data. The goal of a robust program is to prevent incorrect inputs from corrupting or propagating to additional parts of a system by providing automated recovery mechanisms that correct, intercept or replace invalid data. This is often done by providing a user with an error message and requesting a new input, or if an operator is not available, other options such as reuse of a previous value or a use of a default value might be appropriate.
Fault containment mechanisms prevent faulty outputs or operations in one section of a program from adversely effecting other system modules. The identification of a fault by containment methods does not guarantee that the system can be recovered, or even continue to operate, only that the anomalous condition will be localized.
Common applications that employ fault containment methodologies are operating and database systems. For example, operating systems such as IBM OS/2 will terminate a task if it corrupts its address space or raises a processor exception (divide by zero, stack overflow, etc.). Although the operating system makes no attempt to recover the "crashed" task, the integrity of the other running tasks are preserved. Database systems also utilize fault-containment techniques to maintain data consistency. If the Sybase SQL Server database product detects a system error while modifying a database file, it will automatically perform a transaction rollback. Rolling back an "inflight" transaction will ensure that the database file remains consistent at the cost of loosing the update.
Fault-tolerant software provides extensive fault detection and recovery mechanisms which ensure the continued, correct and uninterrupted operation of a program in the event that an error or unplanned events occurs during the execution of critical function. In addition to fault-avoidance, robustness and fault-containment techniques, fault-tolerant software includes multiple or redundant implementations of its critical functional processes. One of the most challenging aspects of implementing fault-tolerant software is the selection of a methodology to manage redundant processes. The three most common software redundancy techniques are[NELS87]:
Temporal Redundancy. Temporal redundancy provides mechanisms to seemlessely execute another version of a software module following the detection of an error or suspicious event.
Spatial Redundancy. Spatial redundancy techniques facilitate the concurrent execution of different versions of software modules and manage the comparison or voting process used to detect errors or incorrect operation.
Operational Redundancy. Operational redundancy provides software systems with additional modules that are not necessary for the normal system operation, but are used solely for the detection and diagnosis of faulty operation.
Each redundancy technique has various tradeoffs in terms of improving fault coverage at the expense of increased complexity, execution time and required resources. Temporal redundancy has a large amount of fixed overhead associated with the detection of an error, and the penalty of "switching" to an alternative program version to re-execute a failed module. Spatial redundancy eliminates most of the degraded performance problems that are associated with incorporation of redundancy into a system by concurrently executing multiple versions of a software module. Spatial redundancy techniques are usually not implemented where system development and operational cost is a primary factor. This technique requires both complex software, and hardware capable of executing and synchronizing multiple versions of concurrently running programs. Operational redundancy techniques are easy to implement (relative to the others) and are usually loosely coupled or even independent of the primary software system. Unfortunately, the fault-coverage of systems that utilize operational redundancy is limited to providing "checks" or "tests" at convenient or appropriate places in the system.
The two most popular techniques used today for the implementation of fault-tolerant software systems are Recovery Blocks and N-Version programming. The following two sections will describe these methods:
N-Version Programming Techniques. N-Version programming is a spatial modular redundancy technique in which two or more versions of a software system are concurrently run with the results of each version compared and voted on to determine the final output[AVIZ85]. A special case of N-Version programming is where N is two. Clearly, no majority outcome can be derived in the case where the individual program results do not agree. Even though it would not be sensible to pick one of the results, disagreement between the independent programs is a strong indicator that a problem might exist. Most fault-tolerant software systems utilize N ³ 3, and use a majority vote to resolve inconsistent results.
The components necessary to implement N-Version programming fault-tolerance are multiple, independent, versions of the software, redundant hardware to perform the concurrent program execution, and a facility to perform comparison, voting and recovery services. The reliability improvement of the N-Version programming technique is very difficult to mathematically model. This technique makes the assumption that all programs contain faults, but it relies on the fact that the number of hidden faults will be small and that they will be in different locations in each of the versions. To maintain independence, each version is typically built using a different compiler and run on a different type of processor because it is invalid to assume that these components are fault-free. Examples of systems that utilize N-Versioning techniques for fault-tolerance are the NASA Space Shuttle and the Draper Laboratory FTMP computers.
Recovery Blocks. Recovery blocks utilize temporal redundancy for fault-tolerance by partitioning a system into a number of self-contained recoverable modules. Each recovery block is responsible for validating correct operation within its boundaries. Recovery blocks consist of a primary processing routine, an acceptance test and at least one alternative processing routine[RAND75]. Figure 2-2 illustrates the major components of a recovery block.
Figure 2-2: A Typical Recovery Block
A recovery block works by providing an acceptance test to validate the output of the primary processing routine. Failure of the test will result in the automatic invocation of an alternative processing routine. This alternative routine might not be as efficient or as complete as the primary routine but should be robust enough to satisfactory meet the conditions of the recovery block. A recovery block is ultimately responsible for the correctness of its own output; therefore, if the acceptance test exhausts all of the alternative processing routines, a mechanism should be provided to signal that a failure condition has been detected and contained. Figure 2-2 shows the architecture of a recovery block with multiple alternative processing routines.
Modern technology constantly reminds us of the of consequences of failed computer systems. Well known mishaps such as the AT&T long distance switching failure, the 1990 Airbus 320 crash which killed several hundred people; and even most recently, the destruction of two United States helicopters by friendly F-15 aircraft remind us of the catastrophic results which are incurred when computer or safety systems fail. Although major technical advances have been made over the last 40 years, the challenge to engineers still remains to design and implement affordable fault-tolerant computing environments.
[AVIZ85] Avizienis, A., "The N-Version Approach to Fault-Tolerant Software", IEEE Transactions on Software Engineering, Vol. SE-11, No. 12, pp. 1491-1501, December 1985.
[HARI92] Hariri, S., A. Choudhary, and B. Sarikaya, "Architectural Support for Designing Fault-Tolerant Open Distributed Systems, Computer, Vol. 25, No. 6, June 1992.
[HECH86] Hecht, H., and M. Hecht, "Fault-Tolerant Software", Chapter 10 in Fault- Tolerant Computing - Theory and Techniques, vol 2, Prentice-Hall, Englewood Cliffs, NJ., 1986.
[LAMB88] Lamb, D., Software Engineering - Planning for Change, Prentice-Hall, Englewood Cliffs, NJ., 1988.
[LEVE93] N. G. Levenson and C.S. Turner, "An investigation of the Therac-25 Accidents", Computer, Vol. 26, No. 7, July 1993.
[MART84] Martin, J., Local Area Networks: Architectures and Implementations, Prentice- Hall, Englewood Cliffs, NJ., 1984.
[NELS87] Nelson, V. P., B. Carroll, Fault-Tolerant Computing, IEEE Computer Society Press, Washington, DC., 1987.
[NELS90] Nelson, V. P., "Fault-Tolerant Computing: Fundamental Concepts", Computer, Vol. 23, No. 7, July 1990.
[RAND75] Randell, B. "System Structure for Software Fault-Tolerance", IEEE Transactions on Software Engineering, Vol. SE-1, No. 2, pp. 220-232, June 1975.
[RENN84] Rennels, D. A., "Fault-Tolerant Computing-Concepts and Examples", IEEE Transactions on Computers, Vol. C-33, pp. 1116-1129, December 1984.
[SEIW86] Siewiorek, D., "Architecture of Fault-Tolerant Computers", Chapter 6 in Fault- Tolerant Computing - Theory and Techniques, vol 2, Prentice-Hall, Englewood Cliffs, NJ., 1986.
Glossary of Terms and Definitions
The field of fault tolerant computing extensively relies on terminology to describe, classify, model and measure faults. The following collection of terms and definitions, which was compiled by Victor Nelson of Auburn University, is very complete and is included for reader reference:
Failure- A failure is any departure of a system or module from its specified correct operation. Thus, a failure is a malfunction.
Fault- A condition existing in a hardware or software module that may lead to the failure of the module or system. Hardware faults are caused by physical factors resulting from wearout, external disturbances, design mistakes or manufacturing defects. Software faults often result from design or implementation mistakes.
Error- An incorrect response from a hardware or software module. An error is the manifestation of a fault. In other words, the occurrence of an error indicates that a fault is present in the module, that the module has been given an incorrect input, or that the module has been misused. An error will lead to the failure of a system unless tolerance of the underlying fault has been provided. Conversely, a fault may exist without the occurrence of an error under certain conditions.
Type- The fault can be in software or hardware.
Cause- A fault can be the result of improper design, failure of a hardware component, or external disturbances.
Model- A fault is typically represented by some model, ranging from simple "stuck-at" models of digital logic lines in which lines become fixed at logical 1 or 0 values, to more complex models in which faults take on indeterminate values or result in timing errors or other non-logical behavior.
Duration- A fault is said to be permanent if its cause will not disappear without repair and if its effect is always present. An intermittent fault also will not disappear without repair but its effect may not always be present. A transient fault will exist for some period of time and then disappears without the need for any repair action.
Level- The level at which a hardware fault occurs may be that of a component, a module, a subsystem or a system, while software faults can exist in programs or microprograms.
Extent- The extent of a fault refers to the scope of its effect on the system and ranges from localized in which case only a single component at the level of interest is affected, to global, in which case damage has propagated to other system components.
Latency- The property of a fault which allows it to go undetected by virtue of not causing an error.
Fault-Avoidance- The use of high-quality components and conservative design as a means to prevent the occurrence of faults.
Fault-Tolerance- The use of protective redundancy to permit continued, correct operation of a system after the occurrence of specified faults.
Protective Redundancy- The use of extra hardware, software, information, or time to mask faults or to reconfigure a faulty system.
Graceful Degradation- The ability of a system to enter an operational yet degraded state after the occurrence of a sequence of specified multiple faults.
Fail-Safe- The ability of a system to fail but not produce a catastrophic result.
Reliability- The probability that a system will not fail at time t given that it was operating correctly at time 0.
Availability- The probability that a system will be operating correctly at time t.
Maintainability- The probability that a system will be restored to operation within time t given that it was in a failed state at time 0.
Mean Time to Failure (MTTF)- The expected value of system failure time.
Mean Time to Repair (MTTR)- The expected value of system repair time.
Mean Time Between Failure(MTBF)- MTBF = MTTF + MTTR.
Return to my home page