Fault tolerant system design in the concept exploration stage of a mission critical computing system

As the DoD enters a new era of weapon systems procurement it faces the critical question of how to manage and procure dependable and cost effective, mission critical computing systems. Program offices must, within the extreme time pressures of modern weapon systems development, apply fault tolerant systems design principles early in the design cycle before major resources are committed to a particular systems architecture. This paper describes the dependability evaluations and trade studies that should be done in the concept exploration stage of a mission critical computing system. These studies are considered from the point of view of a government program office in charge of the RFP requirements and Statement Of Work (SOW) design review and milestone deliverables. This paper will describe a dependability paradigm that involves the interplay between the analysis of field failure data, analytic and functional modeling and fault injection experiments. The paper will then outline the dependability requirements and evaluation criteria for the System Requirements Review (SRR) and System Design Review (SDR) that flow from this paradigm.

Abstract-As the DoD enters a new era of weapon systems procurement it faces the critical question of how to manage and procure dependable and cost effective, mission {xitical computing systems. Program ofices must, within the extreme time pressures of modern weapon systems development, apply fault tolerant systems design principles early in the design cycle before major resources are committed to a particular systems architecture.
This paper describes the dependability evaluations and trade studies that should be done in the concept exploration stage of a mission critical computing system. These studies are considered from the point of view of a government program office in charge of the RFP requirements and Statement Of Work (SOW) design review and milestone deliverables. This paper will dlescribe a dependability paradigm that involves the interplay between the analysis of field failure data, analytic and functional modeling and fault injection experiments. The paper will then outline the dependability requirements and evaluation criteria for the System Requirements Review (SRR) and System Design Review (SDR) that flow from this paradigm.

BACKGROUND
The DoD is entering a radically new era of weapon systems procurement. The Cold W a r is over and new scenarios and radical force structure changes are altering the requirements of the next generation of weapon systems. Also, budgets are decreasing, development cycles are being stretched out, prototypes and demonstrations are being emphasized, open systems standards and Computer Off-The-Shelf (COTS) parts will dominate, and the next generation of systems will be intensively modeled and simulated before any hardware is built. Future complex weapon systems will increasingly rely on digital systems and the dependability of these digital systems will play a critical role in the effectiveness of those systems in the field. The frequency of hard fault rates is decreasing for both military and commercial digital parts but transient fault rates are increasing. The next generation of systems needs to automatically handle transient faults and notify the operator only when recovering from hard faults, yet record all fault activity to ensure quick diagnosis of failing parts.
A dependable computing system is one that reliably delivers its expected and specified service. The two methods of achieving dependability are fault avoidance and fault tolerance.
[ l ] Fault avoidance is achieved by adequate testing at each stage of a system's development. [2] Fault tolerance is a designed-in attribute that enables an operational system to deliver its expected service even when faults are manifested in the system.
TheRand Corporation in 1988-89 studied the military's process of acquiring weapon systems and concluded: (1) there was no meaningful set of management measures for Reliability and Maintainability (R&M), (2) there was no adequate process for setting rationally based R&M goals for future systems, and (3) there was no strong assurances that needed levels of R&M would be delivered in the systems then under development. The Rand report concluded that R&M has many dimensions, (it is not an easily quantified attribute) and that the elusive quality of R&M will grow more illusive as system complexity grows. [3,4] In response to the Rand study the Navy, in coordination with the Air Force, set up a number of programs and working groups to clarify the management milestones and metrics for the design of dependable mission computer systems. Two key efforts were the Advanced Avionics Subsystem Technology (AAST) Fault Tolerant Demonstration and the triservice Dependability Working Group ' (DWG).
The AAST Fault Tolerant Demonstration set out to clarify the Navy's fault tolerant avionics specifications, validation methods and acceptance tests. The program demonstrated clear and precise fault tolerant requirements for specifications and SOW validation metrics and acceptance tests from initial concept exploration to the brassboard Demonstration and Validation of the system (see Figure 1) as the government oversees the design and procurement of a modem, complex, computer based weapon system. The DWG was composed of leading researchers in the fault tolerant community and leading industry developers of fault tolerant computer systems who, under the leadership of the DoD, addressed the topic of the dependability validation of mission critical computer systems. The goal of the DWG was to coordinate industry and the research community to come to a consensus on: a) the specifications of the key dependability factors in computer based weapon systems, and b) the necessary and sufficient dependability validation criteria for these systems. This paper is based on the findings and documentation resulting from these efforts and their initial application in selected Navy and Air Force platforms. costs.
[5] By milestone I most of the decisions that will determine these costs have been made, see the top graph of Figure 1, while less than 5% of the Life Cycle Cost (LCC) has been spent, the lower graph of Figure 2. Thus the cycle of evaluations, candidate designs, analysis and trade studies done in the conceptual evaluation phase of a mission critical computer systems is the critical phase in the system's life cycle cost benefits analysis.
For a program ofice to successhlly oversee the development of a fault tolerant, mission critical, computer system they must specify adequate dependability and fault tolerant metrics (SOW exit criteria) and follow a rigorous and thorough fault tolerant design methodology that serves as the basis of the government and contractor interactions throughout the system's development. The government does not design the mission critical computer system or give direct supervision of the ongoing work. The government's role is to specify and validate the work. The SOW contains the governments specifications and the validation testing that will be performed. The system Specification requirements and the SOW will call for specific information as deliverables in reports and demonstrations throughout the design and development process. The design review meetings will focus on this deliverable information to ensure that the contractors have considered all relevant trade offs for dependability and fault tolerance. The courts have generally held any ambiguities in the SOW against the government. Thus, for the government to ensure a dependable computing system it is essential to have accurate dependability specifications and validation methods and for the government and the contractor to have a clear, agreed upon, model of the system design methodology that will be followed.
A mission critical computer system development methodology is a collection of methods (the methods being some disciplined process for generating the models of the system) that apply a unified approach across the development life cycle of the system. This collection of methods serve a number of purposes: (1) they instill an ordered discipline into the development process that ensures that each element of the problem will be incrementally addressed, (2) they provide a common language for all members of the development team, and (3) they provide management with clear milestones to measure the progress of the system's development. [6] Failure to manage the complex development of a mission critical computer system with some disciplined approach is the cause of expensive cost overruns, patching and repairing a system late into its life cycle, and the eventual failure of the fielded system to fully perform up to its specified performance.
The design decisions of the concept exploration stage are going from a definition of the algorithms, capabilities and services of the system, to exploring initial requirements and architectures, to selecting candidates and testing them on a variety of analysis tools, and ending with the selection of a particular architecture and system design approach. The rationale for these choices should be provided in the deliverables of the SDR.

A DEPENDABILITY PARADIGM
"A model is an abstraction of something for the purpose of understanding it before building it" [7] It omits nonessential details and is thus easier to manipulate than a complete system. A model extracts certain aspects of the system for closer examination. The first models of the new system are either analytic, mathematical models, or functional, behavioral models. In analytic modeling the first models of the new system are based on the observations of past similar systems and components. From this measured data of past similar systems the developer formulates a theory that aims at explaining the behavior of the current, proposed system, and attempts to forecast that future system's behavior. This theory is the basis of the analytic model. The model is a check on the theory in practice. Laboratory fault injection experiments on the early prototypes or behavioral models o f the system will further validate and refine the analytic model. The analytic model will1 also reveal which system parameters are most sensitive to change and need closer laboratory examination, fault injection expeximemts or field measurements. Figure 3 shows this overall flow and feedback from field data, to analytic model, to fault injection experiments. implementation.
The dependability examination of this transKorming stream of simulations is fault injection. Fault injection is available at each stage of the system simulations and into hardware breadboard and brassboard stages.
Due to the vast number of possible failure modes of a system, fault injection can never completely validate a dependable design.    rates, repair rates, and costs of repair and parts.
The system behavior modeling elements can be combinatorial or noncombinatorial. The combinatorial elements are reliability graphs, predicting the hard failure rate of the current system from the statistical past failure rates of similar components, and fault trees, trees of conditions that lead to certain system failures. Fault trees describe the hierachial flow of faults that will trigger certain critical system faults. Fault trees come later in the design process as they are heavily dependent on a detailed knowledge of the system. The noncombinatorial modeling elements are Markov Chains and Petri Nets. A Markov Chain is a graph of failure states and probability of transitions that determine the Petri Nets, directed graphs of place and transition nodes, are useful when the limits of Markov Chains are reached. Petri Nets can be used to examine inherent concurrencies such as coincident faults (a second fault arriving while the system is recovering from the first fault). Monte Carlo simulations are useful as more system reality is put into the model and the analytic solutions become intractable. This cycle of field data to analytic modeling to hnctional simulation and fault injection experiments is the necessary dependability validation paradigm that forms a framework for the information to be provided to the government in the SRR, SDR and subsequent design reviews .

SRR (SYSTEMS REQUIREMENTS DESIGN REVIEW)
of the need and the desired result that will satisfy the need, start the process. This The object of the SRR is to specify the initial specification should include a clear description computing, and dependability models of the of the services, algorithms, system system. The application of the system dlrives responsibilities, and constrahts of the system.  The contractor derives a performance model from the specifications and begins the analysis of the specification in light of the dependability requirements of the system. The system performance model and dependability models are developed and analyzed in parallel with constant feedback between them. (Figure 5, taken from [9]). These models will be refined in subsequent phases and are used to make the architectural trade offs. Figure 5 shows a global view of this modeling and feedback between the initial performance and dependability models of the system. (The application is from mission avionics where the key scenario/specification system drivers are: mission phases, target tracking, engagement of targets and planning for the next targets.) The design of a dependable mission critical computer system begins with a mission scenario and the associated sensor processing requirements (the top bubble in Figure 5). Added to these requirements would be the security and fault tolerant system requirements.
After a coarse workload description, a coarse performance model is derived and an initial architecture description is derived and an interaction between the coarse performance and dependability models would begin until these initial performance/architecture and dependability requirements are met.
This combined modeling of performance and dependability continues throughout the design cycle. Even at the earliest stages of a systems development, reliability modeling can be used to analyze parametric sensitivities in order to determine which factors have the strongest impact on the systems reliability. [lo] The range of possible dependability parameters is wide and a reliable validation of them is an open question. To ensure that the delivered system is dependable and fault tolerant the government must be informed customers, active in the evaluation of the current models of the design, and not merely passive customers using broad and vague language to describe the system's requirements.
Typical dependability requirements for the system might include: mission life, reliability, readiness or availability goals, identification of critical information to be protected, recovery times, requirements for testing, maintenance and checkout, minimum performance levels, safety goals, and subsystem fault containment regions. The modes of service should also be specified: full, reduced, degraded, emergency, and safe shut down. And a generic set of faults should be identified according to timing (how often), duration (how long), and extent of damage, [ 1 1, 121 and some indication of how the final design will be validated against these faults should be given. Choosing a reasonable set of trade-off criteria and fault tolerant metrics is critical to ensuring that the evolving system is capable of being verified analytically and experimentally at each stage of the design process.
The SRR lays the groundwork for establishing a chain of traceability for the contractor's design decisions in subsequent reviews The RFP/SOW packages should specify a careful series of checks on the design development that track the contractor's progress and validate that the resulting system will perform its specified services within its functional bounds (specified fault environment) The goal of this stage is to begin the design implications of the high level dependability requirements that distinguish between broad classes of design. As the dependability models mature they will produce a finer discrimination between designs.
The qualitative analysis will become more through and the quantitative analysis more exact.
In addition to the exploratory work of this phase a preliminary implementation plan should be specified. The reliability aspects of this plan should include: Computer Aided Engineering (CAE) and Computer Aided Design (CAD) tools and reliability prediction tools that will be used.

DELIVERABLES of SRR
(These deliverables are a partial list, with a focus on the dependability information.) The government, in this phase, will assess the contractor's knowledge of the problems being addressed, their knowledge of the algorithms needed and their ability to use this knowledge to make the trade offs and justify the architecture that will be chosen in the second phase of the conceptual exploration.
A list of deliverables with a special focus on dependability are:

Dependability Report
The focus of this report is the dependability features of the proposed system. It should contain the dependability requireiments: mission life, reliability, critical state, recovery time, testability, safety goals, availability, and modes of service.
The dependability requirements should flow down from the mission of the system to justify the particular dependable design strategies that are being considered. There should also be a list of the hardware and software fault types that the system will be designed to protect agaimst. This report should also contain the rationale for the trade offs and criteria for selecting the fault handling techniques. This methodology will be used in the phase II trade off studies and selection of the architecture and design approach.

Implementation Strategy Report
This report outlines the management plan for the selection and implementation of the selected architecture.

Review and Evaluation Criteria at the System Requirements Review
The questioning at the design review should probe the work done so far with questions like: Are the specified computer requirements consistent with the mission requirements?
Does the contractor have the adequate experience base, tools, and methodology to select the best architecture and implement it?
Is the fault set cornplete and accurate and the methodology for testing against them adequate? Is the computational model complete and accurate?
Are the candidate architectures of a high quality and do they have serious potential to meet the objectives? Is the management plan complete, including the plans 1.0 acquire the tools and develop the needed technology?

SDR (SYSTEM DESIGN REVIEW)
The objective of the SDR is to conduct trade off studies on the alternative architectures and select the architecture that best meets the requirements. The dependability modeling at the SDR stage is exploring the high level dependability requirements. Thus for each potential architecture a preliminary functional design and preliminary dependability modeling should be used to evaluate the architecture against the system dependability requirements.
The selected architecture should have a fuller evaluation and justification. including a block diagram partition and allocation of the computing fbnctions, identification of the fault types, containment approaches, recovery mechanisms, redundant sparing needed, assumptions made in the analytic modeling, and executive strategies for recovery of critical state information.
Validation is the demonstration that a given system meets its specification. This validation in the concept exploration consists of the quantitative and qualitative evaluation of the selected candidate designs.
The quantitative evaluation of a design is done by using reliability prediction tools (HARP, SURE, etc.) and fault injection. Fault injection and reliability prediction research is of vital concern to the government. Currently there is no agreed upon way to validate a complex computing system.
The FAA uses a combination of Failure Modes and Effects Analysis (FMEA), fault tree analysis and pin level fault injection on the actual hardware.
[ 131 Research is ongoing with a wide variety of fault injection methods and levels of injection: hardware pin faults and gate level injection [ 14,151, hardware pin faults with gate at the Register Transfer (RT) level [16,17,18], hardware pin faults at the bus RT level [ 191, simulation at the opcode RT level [ 201, simulation and fault injection at the gate and chip level [ 211, fault emulation at the memory word and RT level [22], fault emulation at the RT level [23], memory corruption at the RT level [ 241, and ion radiation at the device level [ 251.
At the concept exploration stage the architectures are defined at a knctional level. Fault injection in the concept exploration would better be characterized as failure injection in the fbnctional descriptions of the architectures being considered. Simulations at this stage are experiments with high level architectural descriptions of the proposed system. This description is a compromise between the full, complex, description of the system and the fault handling features of the system that is the focus of the simulation.
The g d of the qualitative evaluation of the fault tolerant design in the concept exploration and subsequent stages of the design is to evaluate the completeness of the design and it's potential to meet the stated goals. This qualitative evaluation comprises a series of probing questions at design reviews or working group meetings before the formal design review. [26] This type of examination of the system requires expert knowledge. The reviewers should question such things as: the fault assumptions, man machine interaction faults, design faults, the occurrence of a fault while the system is still in the process of recovering from the first fault, and faults in the fault tolerance functions themselves.
The Air Force (using the services of The Aerospace Corporation, a Federally Funded Research Center (FFRC)) has incorporated a successfbl "expert team" approach to the qualitative evaluation of proposed fault tolerant designs for a number of critical satellite projects and in the SDI0 BM/C3 project. 1271 This expert team was comprised of leaders in the field of fault tolerance that the government asked to meet with the contractor and evaluate the design in informal working group meetings. These experts in the field, comprised of university researchers, technical experts from FFRCs and industry, provide the necessary team synergy to give: the government an excellent qualitative evaluation of the proposed design. Their comments and criticisms are aimed at improving the design and ensuring that the hture govemment owned mission critical computer system is state of the art and not simply the product of the contractor's in-house efforts. This expert knowledge base of the team can be applied at every stage of the development hut is especially important in the conceptual exploration down-select of the final design approach and architecture. This report presents the rationale behind the choice of the architecture chosen. It lists the selection criteria, describes the alternative architectures, the evaluation of those architectures using the analysis and reliability tools, and a definition and evaluation of the architecture chosen.
The fau1.t tolerant features of this fill definition of the selected architecture should include: (1) a block diagram of the architecture with the allocation of applications processing to partitions and the allocation of redundant sparing; (2) error detection mechanisms and error containment strategies used in each partition; (3) a description of the error and fault recovery mechanisms and estimates of their effectiveness; (4) the software executive strategy of preserving critical information during fault recovery; ( 5 ) design features to support the initial testing and periodic diagnosis and repair.

CONCLUSION
This paper presented a description of a dependability paradigm and its ramifications in the concept exploration sltage of a mission critical computing system. The purpose of the paper was not to give an exhaustive list of the government evaluatiolns of a mission critical computer system in the concept exploration stage but an outline of those evaluations when a paradigm similar to the one presented in this paper is the agreed upon design and validation procless of the system. The paradigm described is one of constant checks between field data, analytic modeling and experimental fault injection on the hnctional system (simulation to brassboard). For successful oversight of the development of a fault tolerant, mission critical computer system, the government must specify adequate dependability stnd fault tolerant metrics and with the contractor follow a rigorous and thorough fault tolerant design methodology that serves als the basis of the government and contractor interactions throughout the system's development. The government does not design the mission critical computer system or give direct supervision of the ongoing work.
The government's role is to specify and validate the work. But this specification and validation should be done as informed customers, active in the evaluation of the current models of the system's design, and as cooperative members of the integrated team that is developing the complex weapon systems of the future.