An integrated approach to system design, reliability, and diagnosis

Two tools for engineering analyses of highly reliable systems, one for quantitative reliability evaluation and the other for fault diagnosis, have been developed based on an object-oriented representation of fault trees. The fault trees serves as a central knowledge base for the integrated tool set, ensuring that consistent design information is used in both procedures. The tools have a graphical interface for data entry and the display of results, and enable the engineer to modify system models easily and understand the effects of the changes quickly. The availability of the models in an accessible form improves the design process by eliminating redundant model development in various stages of the lifecycle. The object-oriented models are particularly useful since they are easily modified to characterize various aspects of system behavior, promoting the development of additional analysis tools that will access the same knowledge base. The proposed approach is illustrated with reference to a representative subset of the Space Station Freedom data management system, which consists of three subsystems connected by a token ring network.<<ETX>>


SUMMARY
The requirement for ultradependability of computer systems in future avionics and space applications necessitates a top-down, integrated systems engineering approach for design, implementation, testing, and operation. The functional analyses of hardware and software systems must be combined by models that are flexible enough to represent their interactions and behavior. The information contained in these models must be accessible throughout all phases of the system life cycle in order to maintain consistency and accuracy in design and operational decisions. One approach being taken by researchers at Ames Research Center is the creation of an object-oriented environment that integrates information about system components required in the reliability evaluation with behavioral information useful for diagnostic algorithms.
Procedures have been developed at Ames that perform reliability evaluations during design and failure diagnoses during system operation.
These procedures utilize information from a central source, structured as object-oriented fault trees. Fault trees were selected because they are a flexible model widely used in aerospace applications and because they give a concise, structured representation of system behavior. The utility of this integrated environment for aerospace applications in light of our experiences during its development and use is described.
The techniques for reliability evaluation and failure diagnosis are discussed, and current extensions of the environment and areas requiring further development are summarized.

INTRODUCTION
The presentation of information about the design and operation of complex systems is a central issue in the development of information systems that support all phases of a system's life cycle.
Engineers use various models of the system's configuration and behavior as they move through requirements specification, design, manufacturing, assembly, and integration, and on to operation of the system. Some models support the analysis of subsystems, whereas others facilitate the assess- The details of this approach are presented in reference 2 and will not be repeated here, but the following example will illustrate key features of the evaluation process.

Example 1
The fault tree shown in figure 1 is a moderate-size tree with several repeated basic events and a repeated subtree. Events 28 and 19 appear more than once in the tree, and the entire subtree with OR-GATE 20 as the top node appears below both AND-GATE 2 and AND-GATE 3. The triangular symbol shown in the diagram is a transfer gate, in fault tree terminology, and is used to indicate a continuation in the tree structure. The repeated subtree is not statistically independent of the rest of the tree structure because event 19 appears outside the subtree structure as well.

FAULT DIAGNOSIS
FTDS is based on the failure--cause identification process of the diagnostic system described by Narayanan and Viswanadham (ref. 9). Their system has been enhanced in the present implementation by replacing the knowledge base of if-then rules with an object-oriented fault tree representation. This allows the system to perform its task much faster and facilitates dynamic updating of the knowledge base in a changing diagnostic environment. Accessing the information contained in the objects is more efficient than performing a lookup operation on an indexed rule base. Additionally, the object-oriented fault trees can be easily updated to represent the current system status.

Rules As Objects
Narayanan and Viswanadham suggested that the rule base for the diagnostic system be constructed directly from a fault tree representation of the system to be diagnosed.
In their system each fault tree gate is converted to an if-then rule. An AND gate becomes a rule with a conjunction of the child events of the gate as an antecedent and the output event of the gate as a consequent.
An OR gate has a disjunction rather than a conjunction in the antecedent. The rules are stored as text in a data base and when the system needs a rule it must perform a rule base lookup by using a failure event as the lookup key. This involves a significant amount of processing overhead as the system performs data base access and pattern matching. FTDS reduces this overhead considerably by representing rules as objects.
FTDS objects currently contain the same information as the if-then rules used by Narayanan and Viswanadham, but can easily be expanded for additional capability.
The event name stored in the object is the consequent of the rule. The rule antecedent is found by following the children pointers in the object to the antecedent events. New slots added to the fault tree objects to hold additional parameters that are needed by the diagnostic routine are described in the next section. In the current implementation of FTDS the objects are stored in a hash table and referenced by event name. This scheme allows very quick reference to a rule, given its consequent event, and easy retrieval of rules containing a given event in their antecedent. To find the rule object with failure event E as its consequent, all that is required is a single hash table lookup. To find rules with event E in their antecedent, the system only needs to look up the object for E and follow the pointers contained in that object's parent slot.

Object Descriptions
The information used by the diagnostic system that needs to be added to the fault tree objects includes contributory factors, or C-factors, and time intervals. The C-factor associated with a failure event in a fault tree is an heuristic measure of the likelihood that the occurrence of the parent fault of This informationcanbeenteredinto theenvironmentusingthe graphicalfault treeeditor describedabove,which generates fault treeobjectdescriptions.

Diagnosis Process
The diagnostic system is initially given information about the system being diagnosed in the form of normal and abnormal alarms (the nomenclature used by Narayanan and Viswanadham).
Each possible alarm corresponds to a system failure event and is referred to by the name of that event. A normal alarm indicates that the failure event it is monitoring has not occurred, and an abnormal alarm indicates that the failure event has occurred. In addition, each abnormal alarm includes the suspected time at which the failure event occurred, and each normal alarm includes the latest time the specified event was known to have not occurred.
The diagnostic process is initiated by specifying the estimated time of occurrence of a failure, the current set of normal alarms and the time that each normal alarm was last confirmed, and a set of abnormal alarms with estimated failure times. The diagnosis begins by infcr_'ing the relevant failure events that must have occurred and those that could not have occulted basea on the information in the normal and abnormal alarm sets. The alarm sets are updated accordingly.
The system uses the alarm sets to guide its search of the diagnosis space. It does not consider those portions of the diagnosis space with diagnoses containing sets of basic failure events that would cause the occurrence of a failure in the normal alarms set. Also, those portions of the search space with diagnoses containing abnormal alarms are searched early in the diagnosis process. The system also checks possible diagnoses for temporal and causal consistency. The time-of-occurrence information provided for each alarm is used to propagate temporal constraints throughout the fault tree.
Using the abnormal alarm information, the system selects starting points for the diagnostic process, and builds constraint sets that help to narrow the diagnosis search space. After this information has been gathered, the system uses heuristically driven backward chaining to find a set of basic Suppose that at time 10 the DMS system goes down. A record of sensor data provides the information that there was a failure in cluster C at time 8, and it was known that the cluster-C NIUs were functioning at time 8 and that cluster A was functioning at time 9.5. The initial normal alarms set is {(Cluster A, 9.5) (NIU Cluster C, 8)}, and the initial abnormal alarms set is {(Cluster C, 8)}. With this information the diagnostician reasons from the fault tree that the failure of cluster C was sufficient to cause the failure of the entire DMS system. This conclusion is reached by considering the object representing the DMS system failure. That object is an OR gate with a child pointer to the object representing the failure of cluster C. Since it is known that cluster C failed, it is assumed that its failure is the cause of the DMS failure. The accuracy of this assumption depends, of course, on the completeness of the fault tree in representing all possible causes of system failure. Any failures inferred in this way from information in the abnormal alarms set will be added to the abnormal alarms set along with their estimated failure times.
The diagnostic system then uses the information in the normal alarms set to determine which other failure events have not occurred.
Since the cluster-C NIU system is represented by an ANDgate and the cluster-C NIUs were known to be functioning at time 8, it reasons that all the child eventsof the object representing that gate could not have occurred at a time before time 8 minus the error-propagation time recorded in the time-interval slot in the cluster-C object. In other words, at time 8 -t, where t is the error propagation time, the system knows that both NIU-C 1 and NIU-C2 were working properly. Similar reasoning is done based on the fact that cluster A is in the normal alarms set. By backward chaining from this fact, the system infers that the cluster-A NIUs are functioning properly, that the processors in cluster A are all running correctly, and that the information from cluster B is reaching cluster A. This backward chaining continues until all relevant failure events in the fault tree that could not have occurred are recognized. These events are added to the normal alarms set, and the information in that set is used to guide the diagnosis. Notice that the information obtained in backward chaining from a normal alarm is not necessarily restricted to the branch of the fault tree under the original normal alarm. When a given failure event can contribute to the cause of more than one other failure event it will appear in several places in the fault tree. For instance, in this case it is determined that cluster B is operating correctly since cluster A is receiving information from it. This inference provides the additional information that the failure in cluster C was not caused by a failure to receive the needed information from cluster B, because cluster B was down. Such repeated events can help narrow the diagnosis search space considerably.
Now that the diagnostic system has determined a high-level cause of the DMS failure, as well as which failures definitely have and have not occurred, it goes on to find a set of basic events that were a likely cause of the top-level failure. In this case it starts by backward chaining from the cluster-C failure event. The first event it considers as a cause for the cluster-C failure is the failure of the cluster-C processors. This is because the cluster-C processors node has the highest C factor of all of cluster C's children that are not contained in the normal alarms set. Continuing the reasoning from there, the diagnostic system reaches the conclusion that at least one of the cluster-C processors must have failed sometime before time 6, and that this failure propagated through the system and caused the entire DMS system to fail at time 10.

FUTURE WORK
Current efforts are being focused on specific aspects of the evaluation of the reliability of inte- andcorrelatedfaults mustbetakeninto account, andtheappropriate modelin thatcaseis the Markov model (refs. 10,11). Methodsfor includingMarkov modelsfor the softwarecomponents in the fault tree modelof the overallsystemarealsobeingstudied.