Establishing ultra-reliability by fault injection experiments

A validation procedure combines field data, arguments-from-design, and fault injection experiments to demonstrate high reliability. This paper covers integrating these elements and deriving results that reduce the fault injection effort. A typical argument about the impossibility of demonstrating ultra reliability by experiment is presented to motivate the emphasis on reducing the experimental effort. An example is used to explain the integration of field data, fault injection, arguments-from-design, and performance monitoring. This validation procedure has stringent requirements, but it is shown that these requirements are common to other procedures. There is an extended analysis on number, type, and time of fault occurrence. These results are applied to a system with an extremely high reliability requirement. They reduce the experimental effort to a moderate level.


INTRODUCTION
A long standing problem in the field of ultra-reliable digital control systems is the design of a fault injection experiment for system validation. Such an experiment combines arguments-from-design, fielddata-on-fault-occurrence, and results-from-faultinjections. If the system successfully completes the experiment, then the system has a certain reliability at a certain confidence level.
The next several sections describe a type of validation experiment based on modifying a naturallife experiment. The procedure is made more efficient by using probability arguments and design arguments. A natural-life experiment requires an enormous number of trials to establish ultra reliability at a high confidence level, but if the system uses quality components then the probability of a component failure during a single trial is small. If there is a design argument that the system operates acceptably when there are no component failures, then there is no need to run the trials with no component failures. Furthermore, a fault injection experiment need last only as long as it takes for the system to remove the faulty component. As explained below, these features may make it possible to establish high reliability with a reasonable effort.
The difficult parts of this procedure are integrating the various elements and deriving the probability results to reduce the number of trials. Section two presents a typical argument that this type of experiment cannot be done. Section three discusses integrating the elements for a concrete example. It concentrates on the relationship between the experiments and the logical arguments. Sections four, and five cover integrating field data and performance into the general procedure. Since this approach has stringent requirements, section six compares the requirements of this approach with the requirements of the Markov model approach. It is shown that the Markov model approach has the same requirements, although the Markov model approach usually does not state them explicitly. Section seven derives the probability results. These are used to determine the number, type, and time of fault injections. They are also used to reduce the: level of experimental effort. Section eight discusses establishing the reliability of a flight control computer for a commercial aircraft. Since the previous example covered integrating the various elements of field data, fault injection, and arguments-from-design, section eight focuses on reducing the experimental effort.

TYPICAL I NFEASI BI LI TY ARGUMENT
To motivate the emphasis on reducing the level of effort, consider the following argument.
A proposed requirement for an electronic flight control system is that the chance of system f<ailure be less than one in a billion for a ten-hour flight. To establish this at the 100 ( 1le-9 ) % confidence level requires 2 1,000,000,000 ten hour flights without failure. A fleet of 1000 planes flying continuous ten-hour missions would require 24,000 years. Not even a six order magnitude gain in efficiency could make this experiment feasible.
The counter is that the argument above treats the system as an unknown black box. In reality, we know as lot about the system (because we built it). First we know the structure, which permits using arguments-from-design and tests. One option is to concoct the arguments-from-design and then construct the system. Second we know the failure rate of the components, which permits conducting quantitative fault injection experiments. The goal below is to use this structural and stochastic knowledge to bring establishing ultra-reliability within reach.

AN ILLUSTRATIVE EXAMPLE
This section determines the fault injection experiments needed and the arguments from design needed to establish the reliability of a simple system. These two activities must fit together to have an authentic procedure. After deciding what experjiments and arguments are needed, it is a separate matter to design the experiments and concoct the arguments.
Consider a reconfigurable fourplex where each component has a failure rate of 1.24 e-4 per hour. The flight time is two hours and there is maintenance after each flight that replaces the failed components. Suppose the requirement is that the probability of failure be less than 1 e-6, and that this be established at the lOO(1-le-6)% confidence level.
The first step is to imagine conducting a natural life experiment. In a natural life experiment, each operating period is a trial. At the end of each trial (opera.ting period), maintenance is performed, and the system is subjected to another trial. If the system successfully completes a large number of trials without failure, then the system can be said to have the required reliability at the required confidence level. If the requirement is le-6 reliability at the le-6 confidence level, then the number of trials required (without failure) is 13,816,000. (A more extensive analysis is needed if failures appear during the trials. See section seven.) The first reduction in effort comes from a probability argument that most of the 13,816,000 trials do not have a1 fault occurrence. It can be seen from the discussion above that three arguments from design are needed.
(1) If there are no component failures, the system will perform acceptably.
( 2 ) If the system "successfully recovers" from one failure, then the system will perform acceptably.
(3) If the system "successfully recovers" from two failures, then the system will perform acceptably.
An explanation of the phrase "successfully recovers" requires a discussion of arguments from design.
An argument from design proceeds by demonstrating that if a system begins in a certain state and receives certain inputs, then its outputs will be acceptable. From this point of view a "successful recovery" detects the fault, removes the faulty component, and places the system in a certain state (or set of states). The argument from design then considers the state after recovery as a start state. The argument proceeds by showing that if the system starts in this state and receives certain inputs, the outputs will be acceptable.
The validation procedure for this system is being arranged so that the arguments from design can ignore the details of the reconfiguration process. As far as the argument from design is concerned, the start state can be the result of a reconfiguration or it can be the natural start state at the beginning of the operating period.
For this example, the procedure does not require any argument that a fault is detected or that reconfiguration proceeds correctly. This part of the procedure is handled by lab experiment. This procedure does not establish (or require) a 100% diagnostic level, but a system without an extremely high diagnostic level will not pass this test. This level of diagnostics will have to be part of the original design.
Similarly for correct reconfiguration. The reconfiguration algorithm also carries the burden of placing the system in the set of states identified by the arguments from design as acceptable start states.
With these qguments from design in place, the only span of operation not covered is the time from fault occurrence to system recovery. The performance of the system during this span is studied by fault injection experiments. Techniques from probability are used to determine the time of fault occurrence, the component that is faulty, and the external environment at the time of fault occurrence. The system is run for a short time before fault injection in order' to produce the required external environment and to set up the related internal states. The fault is injected, and the system observed to check that the fault is detected and removed, that performance requirements are met, and that recovery places the system in an acceptable state for continued operation.
If system recovery is fast, little time is spent on the experiments. Suppose a single trial takes one minute. If the procedure is highly automated, then there are about 50 trials per hour or about 1000 trials per day. This rate gives a time of two weeks. If it takes five minutes to conduct a trial, it will take ten weeks.

1-105
Suppose the arguments from design and the 13,823 trials have been successfully completed. Then the entire procedure has established that the probability of system failure or occurrence of an undetected fault during a single operating period is less than le-6. This has been established at the le-6 confidence level.
This validation procedure has established something slightly different from a natural life test. A natural life test addresses the probability of failure during a single operating period with no explicit concern about undetected faults carried over from previous operating periods. The simulated-life procedure needs the additional conclusion about fault detection because it relies on some arguments from design. These arguments from design assume that the system begins an operating period fault free. For this reason, the simulated-life test establishes that if the system begins an operating period with no faults, then there is less than a le-6 chance that the system will fail during this period or begin the next period with a latent fault. In other words, the argumentsfrom-design and the fault injection experiment have shown the system satisfies a more stringent set of conditions than the original requirements. This is a price we pay for replacing natural life testing with structural and stochastic arguments.
A general principle is that a system will have to be over-designed in order to pass a validation test. There are two reasons for this. One reason is that a validation test uses arguments from design to improve on the efficiency of natural life testing. The system must conform to the additional features required by the arguments from design. Another reason the system needs to b'e overdesigned is experimental error and the confidence level requirement. The system will have to be more reliable than required for a reasonable test to establish reliability at a high confidence level. An example of this can be seen from the experiment described above. If a system has a probability of failure of le-6, then 14 failures can be expected to occur in 14 million trials. Establishing a failure probability of le-6 at a le-6 confidence level with only 14 million trials requires that no failures occur.

FIELD DATA AND FAULT INJECTIONS
One of the difficult areas in the fault tolerant approach to reliability is gathering field data on the occurrence and behavior of faults. This section considers several possibilities and how they relate to fault injection and system design.
An idealized scenario is that the components are bench tested under the proper environmental conditions until errors are observed on the output pins. The failure rate is estimated and used in the probability arguments (as illustrated in section five). The system is designed to detect failures that appear at the output pins of the devices. In the experiments, faults are injected at the output pins. The actual error pattern that appears at the output pins of a faulty device may be hard to obtain. Hence, part of system design may be a worst case analysis for the detection of any error pattern.
It has been suggested that faults are characterized by open or short conditions at the input pins. If this characterization of faults is accepted, then the system must be designed to detect this type of fault, and the experiments must inject this type of fault.
This validation procedure incorporates performance almost automatically. There are two cases. The first case considers performance when the system begins in an approved start state and the system does not contain any faults. These start states are the states used by the arguments from design discussed in section two. Performance, for this first case, can be established by a combination of arguments and tests.
The second case considers performance during fault recovery. In this case, performance is established by observation. The period of observation is from the time of fault injection to the time of complete and correct system recovery.

COMPARISON WITH THE MARKOV MODEL APPROACH
Four common objections to the approach above are: (1) the need to establish the system works correctly if fault free; (2) the need for nearly 100% diagnostics; (3) the need for between flight maintenance to bring the system to a fault free state; and (4) the need to know the nature of faults in order to inject them. The reply is that these requirements are present in other approaches, but often not stated as explicitly as above. Consider the usual "demonstration" of reliability by a Markov model for a reconfigurable fourplex.
The mnemonics for figure 1 are S for a fault free state, R for a recovery mode state, A for an intermediate state, C for a failure state because of nearly coincident faults, and E for failure by exhaustion of parts. The failure rate is h and the reconfiguration rate is p.
It is apparent that a system will be designed and validated for a given class of faults. This class of faults will have to be stated (and agreed upon) in the initial stages of design. transition could be included in the model, but it would not effective since its density function is not known. Hence, the model assumes that the system operates correctly if its components are fault free.
Part of the mathematical basis for semi-Markov models implies that the recovery function p as depicted is a complete density function which means the model assumes 100% diagnostics. Compare this with the experimental procedure that requires extremely good diagnostics to be successful, but does not assume anything about diagnostics.
The model above, which describes the behavior of the system during a single flight, begins in state S the state where all the components are good. Hence, the mode1 assumes that between flight maintenance brings the system to a fault free state.
In order to get the parameters for the recovery function p, fault injection experiments must be conducted, but these fault injection experiments require knowledge about the nature of faults.
One approach for modeling suggests that the engineer can provide (perhaps conservative) parameters for the recovery function because of his knowledge of system design. If we accept engineering judgement as valid then there is no reason to perform the experiment, but we may be skeptical about our ability to predict one-in-a-million or one-in-a-billion events. This skepticism, and our desire for additional assurance, is the motivation for conducting experiments.
In summation, the requirements for conducting the experiment, although stringent, appear to be common to other approaches.

PROBABILITY DERIVATIONS
The experiment consists of a number of trials where each trial represents (and simulates) one operating period. The system either completes a trial or fails (during) a trial. If the system successfully completes almost all of the trials, and if the trials realistically correspond to an operating period, then the system can be said to have the specified reliability at the required confidence level. The system successfully completes a trial if it detects and recovers from the faults while maintaining its performance. This section discusses five items: (1) the number of trials, (2) the number of faults per trial, (3) superfluous fault injections, (4) the time of fault occurrences, the independence of occurrence time and fault type, and ( 5 ) conditioning to restrict the number of faults injected during a trial.

The Number of Trials for Binomial Sampling
We wish to establish that the probability of failure for a system during an operating period is small. Furthermore, we wish to establish this small probability of failure at a confidence level of lOO(1y)%. That is, there is less than a loor"/. chance that the experiment has misled us.
Since any one trial for the system has an outcome of either success or failure, the probability distribution is binomial. Let the probability of failure during a single trial be p. We wish to show that p is small at a high confidence level. Let NF be the number of failures observed during n trials. The probability that there are no failures during an experiment consisting of n trials is The larger the value of p, the smaller the probability that no failures will be observed. Hence, this expression will yield a one-sided confidence interval. That is, if no failures appear during a large number of trials, then the likelihood that the probability of failure p is greater than or equal to a certain quantity is very small. For a given probability p and a given confidence level 1-y, the number of trials required is given by the equation For extremely high reliability requirements, it seems appropriate to request a high confidence level. We'll require that the experiment have a confidence level equal to the reliability requirement. Some For p = le-6 and y = le-6, n = 1.3816 e+7; For p = le-9 and y = le-9, n = 2.0723 e+10.
The binomial expression can be extended to handle failures during the trials. For examlple, the probability of zero or one failure during n trials is For zero or one failure, establishing reliability when p = le-6 and y = le-6 requires that n = 16.7e6.

Number of Faults During a Single Trial
This part of the design for the experiment determines how many faults to inject during a single trial.
There are several methods of determining this number, but the theory of Poisson renewal processes is the one used below because of its simplicity. Suppose there are N classes of faults where the total fault rate within each class is: h, , ...

Supeguous fault injections
The fault injection procedure will have to be adapted to the assumption of a constant rate used by the Poisson process. Since the system removes failed components, the failure rate of the system does not remain constant. One method of handling this is to treat the removed components as virtual components. This means that the component is theoretically subject to later fault injections, but in practice these faults will not be injected if the system has already removed the component. If the system has noit yet removed the faulty component, then the second fault can be injected into the same component. This double injection checks that the occurrence of a second fault does not interfere with the detection and removal of a faulty component.

Time und order of fault injection
In addition to the convenient formula above for the number of faults, the Poisson renewal process also has a nice distribution for the conditional arrival times. To begin consider ordered uniform sampling. First choose n samples from the uniform Second, order the sample such that x (1) < . . . < x ( , , ) . The probability that x(,) < s , ... , x(,) < s , for 0 < s I < ... < s , < T is distribution on the interval [0 TI, say x , ... 9 X n We will show that the (properly conditioned) arrival time for any combination of faults during the operating time T is given by the ordered uniform distribution.
Suppose m, faults of type i have occurred in some specified order with m , + ... + m N = K. Let a i be the rate of occurrence for faults of type i. Let 6 be the rate of the j * fault that occurs. Hence, The probability that the K designated faults occur in the designated order and that the j* fault occur before time s is given by

jF-tK
where, as before, a is the total fault rate of the system. The expression (9) is equal to

1-108
The probability that m, faults of type i have occurred is given by (6). Hence the probability that m, faults of type i have occurred in some specified order given that m, faults of type i have occurred is given by dividing (10) by (6)  As an illustrative computation, the example in the next section will need the probability that there are three faults present in the system given that two permanents and one transient fault have occurred during the operating period. Suppose the operating period has been normalized to 1 time unit, and that j,".,, ae-atK+l dtK+,dtKK dt2dtl The first factor gives the number of combinatorial arrangements that are indistinguishable. The integral expression gives the probability that a single arrangement occurs in the specified order before the designated times. The integrands simplify to give which is algebraically identical to I (a1 T)" e-a1T (aN T)" e -a N T The first factor is the probability that m, faults of type i occur. The second factor is the probability that K events occur before times s through sK in the ordered uniform distribution. Hence, the time of occurrence is independent of the type of faults that occur.

Conditoning to Improve EfJiciency
This subsection discusses using knowledge about system structure and performance to improve the efficiency of the experiment. As described above, for each trial the number of injected faults is chosen by random sampling. Likewise, the time and place for fault injection.
The fact that the number of trials with k fault occurrences is a random variable introduces two problems. The first, more serious, problem is that it is possible to randomly choose a large number of faults for a single trial. For the fourplex example, any trial with three or more faults has the potential for causing the system to fail. A few such incidents can be handled by sequential testing, but it is more efficient to eliminate them. The second problem is that the random selection can produce a much larger number of trials than expected, increasing the time and expense of the experiment. It can also produce a much smaller number, but this can cause concern about the entire experiment being marginal because of an unusual run of random numbers. Both these problems are handled by partitioning and conditioning.
Conditioning can also be used to exclude nearly coincident fault occurrences. The conservative approach used here is to treat the excluded occurrences as system failures, just as too many occurrences are treated as system failure.
The major concern is to preserve the coinfidence level of the experiment while conditioning. We need the standard result below on combining confidence intervals. The procedure divides fault occurrences during a trial (a simulated operating period) into three classes. The first class, labeled BF for benign faults, are those that will not cause system failure. Examples:

Classic Theorem[Wilkes
(1) the occurrence of zero faults during an operating period, (2) the occurrence of a few faults in a nonreconfigurable, Byzantine resilient, system, and (3) the occurrence of transient faults sufficiently far apart. The second class, labeled EF for experimental faults, are those to be studied by experiment. The third class, labeled F F for fatal faults, are those we declare to cause system failure upon occurrence.
Examples: (1) an extremely large number of fault occurrences during an operating period, and (2) nuisance faults that are unlikely to occur, may or may not cause system failure, but are difficult to include in the experiment. Of course, the total probability of all faults in class FF must be very small.
There are three comments about this partition. First, all the experimental error is concentrated in the second class. The first class has zero chance of failure at the 100% confidence level. The third class will ble assumed to have 100% chance of failure at the 100% confidence level. Second, more refined partitions have been examined, but (for the examples to date) there has been no gain in efficiency. The problem is that the number of trials for the larger number of faults is too small to get good confidence intervals. Third, the description of the partition is a little optimistic about the division between the second and third classes. It assumes that the system is well designed and that the occurrence of more faults than the system can handle is extremely unlikely. The approach taken in this initial attempt to coiistruct a validation experiment is that the experiment will be conducted only if the preliminary probalbility analysis shows the system is well designed--that the occurrence of more faults than the system can handle is extremely unlikely. Otherwise, the system will be redesigned before the experiment is undertaken.
To begin the derivation, let P(Fai1ure) be the probability of system failure that we are trying to establish at the 100[ 1 -h 1% confide,nce level.
Suppose P{BF}, P{EF} and P{FF} are the probabilities of the benign, experimental, and fatal fault occurrences as described above. Partitioning gives the conditional probability expansion P{Failure} = P(Fai1ure I BF} P( BF } + P(Fai1ure I EF} P{ EF } +P{Failure I FF} P( FF } (18) Since P{Failure I BF}= 0 and P(Fai1ure IFF}=l, The 0 and the P{FF} on the right side have 100% confidence. The PIFailure} on the left is the reliability requirement. Hence, it is necessary to establish that is less than P{Failure} at the 100[ 1h 1% level. Example

SECOND EXAMPLE
This section proposes an architecture and conducts the probability analysis for a system whose requirement is less than one chance in a billion of failure during a ten hour flight. The analysis includes transient faults. The ,architecture is a nonreconfigurable sevenplex where each module consists of a computer-on-a-chip plus six transmission lines to the other modules. The system is assumed to have Byzantine resilience, which means the system can tolerate and correctly identify up to two faulty components. The system is also assumed to have transient recovery that removes any effects of a transient fault within 15 seconds. The permanent and transient rates are These are nominal rates for illustrative purposes. A typical figure for a chip is 1 e-6 per hour. The transmission link consists of several devices in series, and is given a rate an order of magnitude greater. A rule of thumb is that transients are an order of magnitude more frequent than permanents. The permanent fault rate for the system is 0.004274 10 hours); the transient rate is 0.04274 10 hours); and the total rate is 0.046974 10 hours).
The requirement is to establish that there is less than one chance in a billion of failure during a ten hour flight, and to establish this at the 100(1-le9)% confidence level. That is, there is less than one chance in a billion that the experiment has mislead us. The probability analysis concentrates on the number of fault injections required since reducing this number is necessary for feasibility.
Computational procedure and example We use two features of the system (together with the derivations in section seven) to reduce the number of trials, and hence the number of fault injections. First the Byzantine algorithm implies we need consider only those trials that have three or more faults present in the system at one time. Second the transient fault recovery (of at most 15 seconds) implies most transient fault occurrences will be isolated.
As an example of the latter, we'll compute the probability of three coincident faults given three transients have occurred. Normalizing the operating period to 1 gives the transient recovery time of 6 = (15 seconds)/( 10 hours) = 1/2400. The expression is (21) = 3 a 2 -2 a 3 =5.21e-7 Hence, the probability of three coincident faults given three transients is small. The computation for three coincident faults given 2 permanenits and 1 transient is given in section seven by equation (13). The other computations are similar, and will not be displayed.
P{ 4 or more perm I 1 tran} (OP,2t:, (lP,2t:, (2D.2t'I The trial reduction procedure for this system decomposes the faults into the three classes described in the previous section: (1) the benign occurrences, BF, where the occurrences cannot cause system failure because of the Byzantine algorithm, (2) the experimental occurrences, EF, where it must be established by experiment that the occurrences do not cause system failure, and (3) the fatal occurrences, FF, where it is assuined the occurrences will cause system failure. This decomposition will be carried out in stages.

5.647807152780061e-13
8.698 150996378502e-4 3.7141 1047545362e-6 7.929625865093479e-9 Large differences among the probabilities require retaining their numerical values to many decimal places. This accuracy is preserved in the tables. The narrative uses approximations.
P{ 3 or more Perm I 2 tran First step: overall computation 1.12985594508703 le-1 1 Table 2 displays the probabilities for the occurrence of ( i permanents, j transients ) during a trial (a ten hour flight). All the entries labeled lP{ f a u l t occurrences } are rare events that are placed in the fatal fault, FF, category. All entries with less than three faults are placed in the benign fault, BF, category. The rest receive closer examination in the subsections below.
The computation for the probability that three faults are present at one time for (2p,lt) is given by (14).
The niumerical value is ~0 . 3 3 . Hence, P{ RF}=2.5e-7, P{E:F}=1.2e-7, and P{FF)=O. (3p,lt) The case (3p,lt) is handled differently. The probability of four coincident faults is ~0.252. Hence, P{BF)=O, P(EF}=4.0e-10, P{FF}=1.3elO. That is, if the transient occurs after the third permanent (or 15 seconds before the third permanent), then the trial belongs to the FF category because there will be four faults present in the system. Otherwise, the trial will be performed. EF I 3.954240098059948e-10 FF I 1.332177145335705e-10 Fourth step: analysis for five faults BF (2p,2t) EF We examine the cases: (Op,5t), (lp,4t), and (2p,3t) and say (as before) that faults are isolated if their occurrence times are further apart than 6. The probability that all five faults are isolated given that five faults have occurred is -0.9979 7.9296 1000584 1749e-9 0 Applying this to the cases (Op,5t) and (lp,4t), we will say that there is benign fault occurrence if all faults are isolated since this implies two or less faults present at any time. Otherwise it is considered a fatal fault occurrence. For the case (Op,5t) this gives P{BF}=l.le-9, P{EF}=O, and P{FF}=2.2e-12. For (lp,4t) it gives P( BF}=5.6e-10, P{EF}=O,4t) Applying this to the case (2p,3t), we will consider any occurrence where the faults are not isolated as fatal. We will still conduct trials, but only accept isolated fault occurrences. Hence there will never be more than three faults present in the system. For EF 0 FF 2.643208621697826e-13 (2p,3t) this gives P{BF}=O, P(EF}=l.le-10, and P( FF}=2.2e-13. (Op,5t) Summary for analysis BF 1.1262799 16293896e-9 EF 0 FF 2.370165 17 1076441e-12 Table 3 summarizes the three subsections above. Table 4 gives the fault category probabilities.  Using formula (20) gives the total number of trials as n = 3439. The expected number of trials for each fault condition is given in table 5. Only a moderate effort is required because the system was designed for validation.

SUMMARY
This paper outlines a validation procedure that consists of integrating various elements: field data, system diagnostics, proofs and tests, and fault injection experiments. There has been considerable effort in all these areas, but no previous attempt to rigorously establish ultra reliability, perhaps because such an effort has appeared impossible. The exposition above demonstrates it may be: within reach. It depends on a careful integration of the various elements. The main topics above are a discussion of integrating the elements and a derivation of results to reduce the fault injection effort.
The major point is dividing validation into two parts: structural and stochastic. The structural part consists of proofs and tests that the system operates correctly if all the components are fault-free. It also includes proofs and tests that the system can, up to a point, identify and tolerate misbehaving components. It appears, however, to be beyond the reach of structural arguments to predict successful reconfiguration or demonstrate the system can handle a large number of faults. These items are more suitable for demonstration by experiment. It is shown how to demonstrate this at an extremely high confidence level.