A systematic risk management approach employed on the CloudSat project

The CloudSat Project has developed a simplified approach for fault tree analysis and probabilistic risk assessment. A system-level fault tree has been constructed to identify credible fault scenarios and failure modes leading up to a potential failure to meet the nominal mission success criteria. Risk ratings and fault categories have been defined for each low-level event (failure mode) and a streamlined probabilistic risk assessment has been completed. Although this technique or process will mature and evolve on a schedule that emphasizes added value throughout the development life cycle, it has already served to confirm that project personnel are concentrating risk reduction or elimination/retirement measures in the appropriate areas. A cursory evaluation with an existing fault tree analysis and probabilistic risk assessment software application has helped to validate this simplified approach. It is hoped that this will serve as a model for other NASA flight projects.


INTRODUCTION
As the prime NASA (National Aeronautics and Space Administration) center for the unmanned, robotic exploration of space, the Jet Propulsion Laboratory has employed a number of traditional techniques to identify and mitigate risks that could have deleterious effects on overall mission success.
Typically, extensive ground-based verification and validation testing is combined with reliability evaluation techniques such as worst-case analyses and failure modes, effects, and criticality analyses.
In an effort to improve overall mission success probability, NASA now requires each project to enhance and augment its standard set of risk identification methods and mitigation procedures to include preparing both a system-level fault tree analysis and probabilistic risk assessment. The fault tree analysis provides for a systematic trace of the failure modes leading up to an undesired top-level event (i.e. mission failure); and the probabilistic risk assessment provides for a quantitative comparison of potential risks, so that trade studies can be conducted and mitigation options examined.

Purpose
All space flight projects are challenged with developing andlor following a technique or approach that accommodates the enhanced risk mitigation initiative consistent with NASA's overall intent. To this end, the CloudSat Project has developed a simplified approach to both fault tree analysis and probabilistic risk assessment that enhances or augments the overall risk management program. This fault tree will evolve and mature to include links to worst-case analyses and failure modes, effects, and criticality analyses. Risks that can be reduced or retired through spacecraft system design changes (to include the addition of physical andlor functional redundancy) receive the most visibility at this stage of project development. Risks that remain after the design baseline is finalized are later addressed through development of contingency plans.
In addition, any remaining, miscellaneous residual risks are also included on the project's significant risk list for close monitoring. This process will be completed on a schedule that emphasizes added value throughout the entire development life cycle, and takes into consideration both cost and schedule constraints. Before delving directly into CloudSat specifics, a general introduction of .both fault tree analysis and probabilistic risk assessment are provided below.

Fault Tree Definition
A fault tree is a graphical representation of the known faults or combinations of faults that will result in an undesired top-level event. Subordinate faults are linked through a series of logic "gates" that are similar to the logic gates that are frequently used in a typical engineering analysis. These gates permit or inhibit fault propagation to the nextlhigher level. There are a number of different logic gates, but two of the more frequently used are the OR gate and the AND gate. The OR gate is used to indicate that output to a fault event or transfer function occurs if one or more of the input events occurs. The AND gate is used to indicate that the output event occurs only if all input events occur. Fault tree generation and analysis is regarded as a top-down, systematic approach that entails the use of deductive U. S. Government work not protected by US. copyright.
reasoning. This approach involves the identification of a general top-level event and then developing a detailed set of possible causal events that eventually surface or manifest themselves as the top-level event. As with many other types of analysis this is a qualitative technique requiring one to understand the environment and the operations of the system, subsystem, andlor assembly being examined, so as to identify only credible scenarios. Identification and analysis of unrealistic events with only a remote possibility of occurrence may not only compromise the validity of results, but also overburden usually constrained cost and schedule resources.
Fault tree analysis enhances overall risk management by bringing to light likely, potential "show-stoppers". This technique is most effective when done early in the development life cycle (e.g. add in physical redundancy), but may still add significant value added when completed later (e.g. development of contingency plans More recently it has been used for the functional analysis of highly complex systems, evaluating system reliability. evaluating software interface,s, and identification of potential design defects and safety hazards. An example fault tree taken from the Reliability Toolkit: Commercial Practices Edition [2] is shown in Figure 1. It shows the logical linking of events leading up to elevator passenger injury. Once can readily see that passenger injury can result from one of two events: the elevator car (box) free falls or the elevator door opens wii:hout the car being present. Let's examine the fault tree more closely. In the case where'the car free falls, there are three subordinate or causal faults: the cable slips off the pulley, the holding brake fails, and the cable breaks. A diamond symbol is used for the first and third faults to denote them as "undeveloped events", which means that even though the faults can be further decomposed they are regarded as the lowest level of examination for this purpose. In contrast, the "holding brake failure" has been decomposed to the point of identifying three basic events denoted by the circle symbol: worn friction material, stuck brake solenoid, and the control unit disengages the brake. One additional symbol that has not yet been described is the pentagon or "house". It contains a normal system operating input, but is regarded as an external event.
Finally, even though fault tree generation for a number of years since its inception had been regarded as an art form it has been demonstrated time and time again that the most accurate fault trees appear to conform lo a set of guidelines. The Nuclear Regulatory Commission's Fault Tree Handbook [3] states five basic ground rules for fault tree construction which are given below.  Complete-the-Gate Rule -All inputs to a particular gate should be completely defined before further analysis of any one of them is undertaken.

Probabilistic Risk Assessment De3nition
Although fault tree analysis yields significant benefits it has limitations, not the least of which is the inability to predict likelihood of occurrence. Probabilistic risk assessment, which is a quantitative analysis technique, provides a complement to the more qualitative fault tree analysis. This technique involves the creation of a reliability model that utilizes the fault tree as an input. Expert judgement, operational experience, test data, andlor analysis are used to assign probabilities of occurrence and standard deviations. Probabilities are then assessed individually or combined according to the fault tree to identify "weak spots" and where to concentrate reliability options. Therefore, the key benefit in completing a probabilistic risk assessment is that it assists personnel with comparison or trade studies, so that resources can be allocated accordingly.

MISSION ASSURANCE P R O G M
The objective of the CloudSat Project's mission assurance program is to identify, communicate, control, and mitigate potential risks to mission success. The challenge lies in achieving this objective in an efficient, yet effective manner that successfully accommodates the dual realities of finite program resources and high customer expectations.
The traditional approach to improving the efficiency of a project's risk mitigation activity focuses on streamlining planned verification and validation testing activities. This path is straightforward, conveniently lends itself to logic and the application of lessons learned, and usually produces tangible, quantifiable results. Often overlooked are the unrealized, potential gains in risk mitigation that may be obtained by improving upon analytical tools such as system-level fault tree analysis and probabilistic risk assessment. In the following sections we describe the CloudSat Project's attempts to improve these tools and their methods of application

SYSTEM-LEVEL FAULT TREE
In response to the NASA directive requiring the use of formal risk management processes and technologies, the CloudSat Project will construct a system-level fault tree to identify credible fault scenarios and failure modes leading up to a postulated failure to meet the nominal mission objectives. Fault scenarios and failure modes to be considered include, but are not limited to, interface faults (i.e. fault propagation), computer logic errors, environmental exposure, test, and test configuration errors.
The high-level procedure for constructing the syste:m-level fault tree is described below. Since construction of a system-level fault tree involves the definition of a considerable number of events and logic gates, the fault tree itself could not fit on a single, English standard 8-1/2in x 1 lin or metric A4 sheet of paper. In addition, the use of multiple sheets could lead one to become lost in a sea of paper. Finally, constructing the fault tree on larger "poster" size paper does not lend itself to portability and convenience. Therefore, a decision was made to (construct the tree using a hyper-linked version of the Microsoft Powerpoint software application. This allowed each of the events and logic gates to remain legible and allowed one to move up and down the tree with relative ease using the hyperlinks. In the case of the spacecraft, the fault would originate from either the spacecraft bus or the payload instrument. If we take the spacecraft branch, this will lead us to the specific subsystems shown in Figure 3. From this point, we will investigate three subsystem faults in rnore detail: a power subsystem fault (negative power balance), a propulsion subsystem fault, and an attitude determination and control subsystem fault to assist in providing a better understanding of the subordinate fault scenarios and low-level failure modes. Figure 4 shows the negative electrical power balance fault tree branch. One can see that in addition to the solar array, battery, and electronics fault scenarios., a test configuration fault case involving the application of external power is also included for completeness. The solar array fault scenario contains primarily active mechanical failure modes, while the three other intermediate fault scenarios contain primarily active electrical failure modes. Events denoted by the circle again are those that are basic, low-level failure modes, while

PROBABILISTIC RISK ASSESSMENT
In response to the NASA directive requiring the use of formal risk management processes and technologies, the CloudSat Project will complete a relativistic probabilistic risk assessment to assist with the identification of residual risks and the application of 'appropriate' risk reduction or eliminatiodretirement actions.
The probabilistic risk assessment will be conducted at the subsystem level. "0": There is a 1EO or one-in-one probability of occurrence for this completely deterministic event.
"1": There is a 1E-1 or one-in-ten probability of occurrence for this event.
"2": There is a 1E-2 or one-in-one-hundred probability of occurrence for this event.
"3": There is a 1E-3 or one-in-one-thousand probability of occurrence for this event. This is the threshold for 'active' mechanical components (e.g. actuators and springs).
"4": There is a 1E-4 or one-in-ten-thousand probability of occurrence for this event. This is the threshold for 'active' electrical components (e.g. relays and transistors).
"5": There is a 1E-5 or one-in-one-hundred-thousand probability of occurrence for this event.
The high-level procedure for completing the probabilistic risk assessment is described below.   6": There is a 1 E-6 or one-in-one-million probability of occurrence for this event. This is the threshold for 'passive' mechanical , components (e.g. structural members).
N. Actions Taken -State any risk reduction or elimination measure(s) already taken, if any.
Fault Categorization -Determine the appropriate fault category. It will be evident later that this is a key step in streamlining the overall process. Definitions for each of the four fault categories are derived from the Nuclear Regulatory Commission's Fault Tree Handbook [3] and are provided below.
"Level I": A negligible fault producing no impact to system performance or capabilitylcapacity.
"Level 11": A fault reducing overall system capacity, but does not impact on-line performance. An example of this is a failure of a physically redundant unit.
"Level 111": A fault reducing overall system capacity and degrading on-line performance. An example of this is dual versus single-star tracker operations. The latter scenario will result in reduced knowledge and pointing accuracy.
"Level IV": A catastrophic fault rendering the system being analyzed as completely non-operational.

For Each Level IV Fault.. .
Since we are only interested in faults that would cause a potential failure to meet the nominal mission success criteria, only level IV faults need be examined further. The steps to be taken at this stage are as follows.
VI. Minimal Cut Sets -Identify "minimal cut sets", each one being an intersection of one or more events (lowest level failure modes), and calculate the failure probability for each. In t h s case, the latter would simply require that each rough order of magnitude failure probability of each event in a given cut set be multiplied together.
VII. Rare Event Approximation -Use rare event approximation to calculate the failure probability at the subsystem level (i.e. the sum of minimal cut set probabilities).
VIZI. Levels of Importance -Calculate the relative quantitative importance of each minimal cut set (i.e. the ratio of the minimal cut set failure probability and the sum).
IX. Focus Areas -Highlight minimal cut sets with the highest relative quantitative importance percentages, and focus reliability actions on these items.

IV
In order to gain a better understanding of the seven steps required to complete the probabilistic risk assessment, we'll turn our attention back to the three subsystem fault tree 'branches' examined earlier and then apply the defined series of activities on each one.

11
Risk ratings and fault categories for the negative power balance fault tree branch are shown in Table la. One can see that in almost all instances the lowest level events are either active mechanical or electrical failure modes, therefore, risk ratings are either 3 or 4. The next step is to list any and all fault reduction or eliminatiodretirement actions already taken. This way when fault categorization is done, the assessment can be done taking into consideration as many factors as possible. The failure modes that could potential result in a level IV catastrophic fault are the three active mechanical components in the solar array fault scenario and the battery charge/discharge failure. These four failure modes are defined as the "minimal cut sets". Table l b shows the relative quantitative importance of each of these cut sets. From these percentage levels it is obvious, that the three active mechanical failure modes in the solar array fault scenario are the ones that merit the most attention.  Table 2a. One can see that active mechanical failure modes are contrasted by passive failure modes. Therefore, most risk ratings are either 3 or 6 . The next step is to list any and all fault reduction or eliminatiodretirement actions already taken. The failure modes that could potential result in a level IV catastrophic fault are four of the five failure modes in the propulsion tank fault scenario and the structural integrity failure mode of the plumbing fault scenario. These five failure modes are defined as the "minimal cut sets". ' Table 2b shows the relative quantitative importance of each of these cut sets. From these percentage levels it is obvious that the two active mechanical failure modes, fill/dmin valve and fill/vent valve, in the propulsion tank fault scenario are the ones that merit the most attention.

I Minimal Cut Set
Risk ratings and fault categories for the attitude determination and control fault tree branch are shown in Table 3a. One can see that most of thce lowest level events are either active mechanical or elecirical failure modes.
Therefore, most risk ratings are either 3 or 4. The next step Probability  associated with a command build or interpretation failure mode. These two failure modes are defined as the "minimal cut sets". Table 3b shows the relative quantitative importance of each of these cut sets. From these percentage levels it is the command build failure that merit the most attention.

. CURRENT ASSESSMENT
The goal at the conclusion of project formulation phase was to have a preliminary version of the CloudSat system-level fault tree and the relativistic probabilistic risk assessment prepared. This would enable the team to identify residual risks and to reduce or eliminatefretire them through changes in design. After meeting the goal and completing an initial assessment the following conclusions were drawn. Firstly, most failure mode risk ratings were determined to be either "3" or "4", one-in-one-thousand and one-in-ten-D thousand probability of occurrence, indicating that most faults are active mechanical or electrical component potential failures.
5 1 E-6 0.1% Secondly, most spacecraft failure modes fell into the level I1 category, a fault reducing overall system capacity, but not impacting on-line performance, due to the extensive use of physical and functional redundancy.
Thirdly, based on minimal cut set levels of importance the focus should be on: \ -Single-string components -- Components with little to no flight heritage (Human) error-prone processes (e.g. command generation Fortunately, all three of these items were already identified as focus areas for the project team. Therefore, the addedvalue to date by completing the preliminary version of the CloudSat system-level fault tree and the relativistic probabilistic risk assessment is confirmation that the project team focusing attention in the proper areas. Finally, there had been some discussion about the possibility of all NASA flight projects being required to use a standard software application or tool suite in preparing a fault tree and conducting a probabilistic risk assessment. In addition, the CloudSat team was also interested in knowing whether or not the streamlined process was valid or completely orthogonal to more traditional methods. After being informed about one such tool suite being considered -SAPHIRE (Systems Analysis Programs for Hrands-On Integrated Reliability Evaluations) a request was made to obtain user manuals and demonstration software, :so that a test case could be run and compared with the results of the CloudSat process. The negative power balance 'branch' was selected. After carehlly inputting the fault tree and probabilities occurrence for each of the low-level failure modes into the software applications, the output showed that the results were very similar. For example, the minimal cut sets with the highest levels of importance were demonstrated to be within 3 percentage points of that resulting from the CloudSat process. The reason for this is clear. The SAPHIRE tool takes into consideration all of the failure rnodes identified in the fault tree, while the streamlined CloudSat process only considers those failures modes that could potential result in a llevel IV catastrophic failure.

FUTUREWORK
The commitment through project implementation phase is to construct the final CloudSat system-level fault tree and complete the final probabilistic risk assessment. Links to bottom-up analyses such as FMECAs, and WCAs (Worst-Case Analyses) will also be demonstrated to ensure that an accurate and valid investigation was completed. The team would still attempt to identify residual risks, but the

CONCLUSIONS
The CloudSat Project has taken to heart the NASA directive to augment the traditional risk management approach with fault tree analysis and probabilistic risk assessment. In order to comply with current budget and schedule constraints and still respond positively to the directive, a streamlined approach has been developed that will yield value-added results throughout the development life cycle. Initially, this will be used to assist with design improvements and later to assist with contingency planning.
To date, the results of the preliminary CloudSat Project fault tree analysis and probabilistic risk assessment have confirmed that the team is concentrating limited risk mitigation resources in the proper areas. However, final versions of each are to be prepared and made available at the critical design review, and future "as needed" revisions will be generated through the remainder of the development life cycle. Analyses will be made periodically, and these or may not necessarily result in the same assessment.
It is hoped that this streamlined approach will encourage other NASA flight projects to look at both fault tree analysis and probabilistic risk analysis as additional tools or methods to achieve mission success.