Reliability predictions - continued reliance on a misleading approach

Reliability prediction methodologies, especially those centered on Military Handbook (MIL-HDBK) 217 and its progeny are highly controversial in their application. The use of reliability predictions in the design and operation of military applications have been in existence since the 1950's. Various textbooks, articles, and workshops have provided insight on the pros and cons of these prediction methodologies. Recent research shows that these methods have produced highly inaccurate results when compared to actual test data for a number of military programs. These inaccuracies promote poor programmatic and design decisions, and often lead to reliability problems later in development. Major reasons for handbook prediction inaccuracies include but are not limited to: 1) The handbook database cannot keep pace with the rapid advances in the electronic industry. 2) Only a small portion of the overall system failure rate is addressed 3) Prediction methodologies rely soley on simple heuristics rather than considering sound engineering design principles. Rather than rely on inaccurate handbook methodologies, a reliability assessment methodology is recommended. The reliability assessment methodology includes utilizing reliability data from comparable systems, historical test data, and leveraging subject-matter-expert input. System developers then apply fault-tree analysis (or similar analyses) to identify weaknesses in the system design. The elements of the fault tree are assessed against well-defined criteria to determine where additional testing and design for reliability efforts are needed. This assessment methodology becomes a tool for reliability engineers, and ultimately program managers, to manage the risk of their reliability program early in the design phase when information is limted to: 1) The handbook database cannot keep pace with the rapid advances in the electronic industry. 2) Only a small portion of the overall system failure rate is addressed 3) Prediction methodologies rely solely on simple heuristics rather than considering sound engineering design principles.


INTRODUCTION
The use of reliability predictions in military applications produces misleading and inaccurate results [1]. The National Academy of Sciences, along with lessons learned from the US Department of Defense (DoD) over the past decade, suggests several reasons why military systems fail to achieve their reliability requirements. These reasons include a "reliance on predictions instead of conducting engineering design analysis [2]." Reliability predictions represent a single "number" that attempts to describe a complex system through the estimation of its failure rate. Although predictions can be a valuable tool in the design process, they are often improperly developed, misreported, and/or misinterpreted. A main reason for this problem is the use of MIL-HDBK 217 and associated methods.
These methods include any handbooks or commercial applications based on MIL-HDBK-217 (e.g. Telcordia/Bellcore, HRD, PRISM, 217Plus, etc). MIL-HDBK-217 uses historical data of electronic systems to determine a constant failure rate of electronic parts. The associated part prediction is a function of a generic failure rate and a series of adjustment factors. The final system-level prediction assumes a series structure and is a summation of the individual electronic parts.
Because of the technical limitations associated with the prediction documents, as discussed in this paper, the handbook results have no connection to real product reliability and, can in fact, promote poor reliability practices and reliability decisions. This paper discusses the limitations of the MIL-HDBK-217 methodology, its continued misuse in military applications, and an alternative method for assessing reliability early in system development that provides more valuable insight to both the system developer and customer.

PREDICTION HISTORY
Reliability prediction approaches started soon after World War II with the formation of several ad hoc reliability groups. The desire of these groups was to standardize requirements and improve the reliability of increasingly complex electronic components. The original version of MIL-HDBK-217 was published in April 1962 by the US Navy. The first revision, MIL-HDBK-217A, occurred in December 1965.
MIL-HDBK-217A became the standard for reliability predictions. The main reason for its ascension was that it was often cited in contractual documents [3].
In 1974 the responsibility for preparing MIL-HDBK-217 was transferred to RADC, under the preparing activity of the US Air Force. They published Revision B and addressed rapidly changing technology. They also incorporated overly simplified versions of the RCA models, which are still in the handbook nearly 40 years later [1]. As electronics grew more complex MIL-HDBK-217B received several changes, eventually leading to MIL-HDBK-217 revision C. The 1980's brought about revisions D and E of MIL-HDBK-217 attempting to keeping pace with the changes in technology. The 1980's also brought several reliability prediction models unique to select industries. Examples of this include the Society of Automotive Engineers Reliability Standards Committee and Bell Communications Research (now Telcordia). These industries, along with others, based their prediction techniques on the MIL-HDBK-217 models.
In December 1991, RADC (now renamed Rome Laboratory) released MIL-HDBK-217 revision F. In 1994, the former US Secretary of Defense, Dr. William J. Perry, announced the reduction of reliance on military specifications and standards and encouraged the development of commercial standards that could be used by the military in his memorandum, "Specifications & Standards -A New Way of Doing Business". In 1995 the redistribution of MIL-HDBK-217F contained the following notice, "This handbook is for guidance only. This handbook shall not be cited as a requirement. If it is, the contractor does not have to comply." The following year the Assistant Secretary of the Army for Research, Development and Acquisition, Gilbert F. Decker, declared that MIL-HDBK-217 was not to appear in any Army request for proposal acquisition requirements [4].
Since 1995 there has been no update to MIL-HDBK-217. However, there have been efforts by an industry working group to update the standard. The working group, which was lead by the Naval Surface Warfare Center (NSWC) at Crane, IN and consisted of government and private industry personnel, developed a three phase plan for revisions. All three phases were planned to be completed by December 2011 [5]. However, the effort to acquire appropriate data and differences in opinion on the methodologies to incorporate has led to significant delays with no revisions published.

TECHNICAL LIMITATIONS
Reliability predictions can be useful when determining early-on reliability allocations or forecasting life-cycle costs. However, the technical limitations of MIL-HDBK-217 methodologies misrepresent a system's true reliability metric (i.e. reliability mean time between failure, mean miles between system abort, etc). Technical limitations of MIL-HDBK-217 have been a topic of debate since its development in the 1960's with copious research examining its strengths and weaknesses.
Four major limitations of these methodologies that impact DoD system design and development are discussed in the following sections.

Keeping Handbooks Up-to-Date
MIL-HDBK-217 has not been updated since 1995. When a developer uses it for predicting a system's reliability today, over 15 years of technology is not included. Prior to 1995 there were only 6 major updates since its original release in 1962. During this time new devices were not covered for approximately five to eight years, penalizing system developers for utilizing new technology. Revisions also failed to update connector models for over 35 years. Handbook models also require historical field data. These data are acquired from a variety of sources, over different periods of time, and under various field conditions. No standard for verification or statistical control of these data exists. The handbooks do not supply information regarding any of these factors.
Given these limitations the handbook databases cannot keep pace with the rapid advances in electronics technology and products. Any plans to simply update the database and models would exclude any emerging technology.

System Failure Rate
Reliability estimates of MIL-HDBK-217 methodologies assume a constant failure rate.
However, electronic components' failure rates can vary depending on many factors to include the usage conditions and the remaining life of the component. Instead of assuming the system or the component to be a black box, a better understanding of how and why components fail can be obtained by studying the physics of failure [6]. For example, for power electronic modules and Insulated Gate Bipolar Transistors (IGBTs), wire bond failure and die attach failure have been found to be the two most dominant and critical failure mechanisms [7]. These mechanisms could induce failures in the package depending on the usage and loading conditions and thus cannot be represented by a constant failure rate.
However, the mechanisms and their associated time to failure can be characterized by well-established models and equations.
While power electronics are specifically addressed above, Pecht et al. [8] have discussed failure mechanisms found in other applications in the field. Similarly, based on the field returns, the manufacturers can identify the dominant failure mechanisms, identify the associated models and use them to estimate the lifetime of components being used in a particular application, under certain conditions.
Even if MIL-HDBK-217 methodologies accurately depicted electronic parts failure rates, they would only account for a small portion of the overall system's failure rate, as depicted in Table 1. DoD systems follow the same trend. Figure 1 displays the chargeability (determined cause for a failure) for a DoD network and aviation system. Hardware failures account for only 7% and 47% of the overall system's failure rate, respectively. It should be cautioned that the hardware failures represent both mechanical and electronic failures and therefore the failure rate due to electronic components may be even smaller. System predictions should not only account for electronic components, but must also factor in failure models due to design, manufacturing, wear out, software, and external factors (crew/maintainers).

Critical Design Factors
Prediction methodologies do not consider sound engineering design principles.
For example, handbook predictions for a circuit card are not affected by how the device is mounted and supported, the natural frequency of the board, or where the largest deflections are located in relation to the components. They do not consider the impact of temperature cycling, humidity cycling, vibrations, and/or mechanical shock throughout the components' life-cycle. The life-cycle of a product consists of manufacturing, storage, handling, operating and non-operating conditions. The lifecycle loads, either individually or in various combinations may lead to performance or physical degradation of the product [9]. Extensive research shows the effect thermal aging and thermal cycling. This research demonstrates the need to account for multiple deployments with sequential thermal stresses and uncontrolled thermal environments [10]. Handbook methodologies overemphasize steady-state temperature and voltage as operational stresses and do not take into account any of these engineering design decisions. For example, the use of MIL-HDBK-217 methods led to poor design decisions on the F-22 advanced tactical fighter and the Comanche helicopter [11]. In both cases the designs indicated the need for significantly lower temperatures of the avionics components. The resulting temperature cycling created unique failure mechanisms that ultimately impacted both programs' cost and schedule.

Insight into How or Why a Failure Occurs
Practitioners use handbook predictions as a design tool. The pitfall of using predictions is that the methodology does not give insight into the actual causes of failure since the cause-effect relationships impacting reliability are not captured. Therefore, the developers cannot implement the appropriate corrective action or mitigation plan. Handbooks simply sum the failure rate from the total parts on a given component. An example of this can be seen by examining the vibration displacement for a circuit board. Although the components and their placement on the two circuit boards in Figure 2 are the same, the reliabilities are significantly different. In this example circuit board (a) is a four screw configuration versus circuit board (b), a six screw configuration. The difference in design (four screws versus six) impacts the vibration displacement and consequently impacts the reliability. The addition of two screws to the design significantly increased the circuit board's reliability. However, both designs would have the exact same reliability prediction using MIL-HDBK-217.

(a) Not Reliable (4 Screws) (b) Reliable (6 Screws) Figure 2 -Comparison of Vibration Displacement
The placement of components is another crucial design consideration. Figure 3 Table 2 displays the results shown in that paper. These results were one of the first examples of how handbook predictions produce misleading results on DoD systems.
In another study, Jones and Hayes [13] compared circuit board field data from commercial electronics manufacturers to handbook predictions. They not only found a difference between the prediction and the field failure rate, but also found significant differences between handbook methodologies. Figure 4 shows the results discussed in the paper.  Despite these results and the documented technical limitations of predictions, there are still several reports that support the use of the current handbook methodologies. Brown [14] used the Modular Airborne Radar program (a U.S. Air Force system) to compare field data from Plastic Encapsulated Microcircuits to two prediction tools (MIL-HDBK-217 and a commercial tool based on MIL-HDBK-217). Initial findings revealed the predictions were optimistic in comparison to the observed field performance. Further evaluation showed that modifying the default values of the model improved the accuracy of the prediction. She also noted that the use of experience data (field data) proved valuable in refining the prediction results. In addition to this, Smith and Womack [15] compared a commercial prediction tool (based on MIL-HDBK-217 methodologies) to actual observed field failure rate for three military electronic units. The initial results showed the predictions were approximately one-half of the observed field failure rate. This was in contrast to an earlier study by TRW Automotive which showed the predicted failure rates were approximately twice the actual field values. Just as in Brown's study, they found that experience data aided in refining their prediction estimates.
The US Army Materiel Systems Analysis Activity (AMSAA) recently surveyed various agencies throughout DoD requesting system level predictions and demonstrated results (either from testing or fielding). When compiling the data only those systems whose predictions where solely developed using MIL-HDBK-217 or its progeny were examined. If the prediction was a combination of field data and predictions it was excluded from the final analysis. Figure  5 displays the results of the survey.
In total the survey explored 15 systems. One missile system is excluded from Figure 5 for graphical purposes (the only system without a mean time between failure metric). These systems represent a variety of platforms to include communications devices, networks command and control, ground systems, missile launchers, air command and control, aviation warning, and aviation training systems.
The ratio of predictions to demonstrated values ranges from 1.2:1 to 218:1. This shows that original contractor predictions for DoD systems greatly exceed the demonstrated results. In addition, statistical analysis of the data using Spearman's Rank Order Correlation Coefficient show that MIL-HDBK-217 based predictions cannot support comparisons between systems. These data demonstrate the inaccuracies of predicted reliability using handbooks to demonstrated results. It should also be noted that these predictions could lead to improper programmatic decisions impacting reliability (minimizing growth testing, Design for Reliability (DfR) activities, etc).

Figure 5 -Comparison of Predicted Versus Demonstrated Values for DoD Systems
These results demonstrate the misuse of predictions in the DoD with the same consequences (unreliable systems with high operating and sustainment (O&S) costs) as documented in the DoD "Guide for Achieving Reliability, Availability, And Maintainability." This begs the question, "Despite its known technical inadequacies and misleading results, why is MIL-HDBK-217 still being used in Department of Defense Acquisition?" There are several potential answers, but the most prominent is that despite its shortcomings, system developers are familiar with MIL-HDBK-217and its progeny. It allows them a "one size fits all" tool that does not require additional analysis or engineering expertise. The lack of direction in contractual language leaves also government agencies open to its use.

RELIABILITY ASSESSMENT
When system developers are asked to provide a reliability prediction as part of the contract there are two issues: 1) The source of the prediction and 2) The method for the prediction.
Based upon data from the Naval Surface Warfare Center Crane Division [16] approximately 50 percent of reliability predictions have no traceable source. The 23 percent that had a traceable prediction turned to MIL-HDBK-217 or its progeny 44 percent of the time despite the limitations and inaccurate results (as demonstrated in the previous sections).
The purpose of predictions is more than just a need for a "reliability number." It should be cautioned that simply updating MIL-HDBK-217 based upon current technology does not alleviate the underlying fundamental technical limitations addressed in the earlier sections. Predictions should provide design information on failure modes and mechanisms that can be used to mitigate the risk of failure by implementing design changes. This places the onus on the government to request the appropriate data in the contract. The government or customer should ask the contractor to perform a reliability assessment which consists of two components: • System Reliability Model (SRM) • An assessment of the contractor's planned reliability activities The government or customer should use these two components to shape the reliability program for the system. They can then leverage design for reliability techniques to manage areas of medium to high risk. The reliability assessment then becomes a bi-directional tool that the reliability team uses to influence and trace changes to the system design and O&S costs.

SRM
The SRM is a graphical depiction of the system with an underlying analysis such as a Reliability Block Diagram, Fault Tree, or Event Tree. The analysis should identify critical weaknesses in the system design. Critical weaknesses are defined as those elements whose failure impacts mission completion, essential functions, safety, or those elements whose failure rates contribute significantly to the overall system (i.e. drive O&S costs). The SRM shall consist of the lowest identifiable functions/elements of the system and their relationship to each other. The SRM shall encompass all hardware and non-hardware elements including, but not limited to, commercial off-the-shelf, non-developmental items, government furnished equipment, software, human factors, and manufacturing.
Once the SRM is developed each element should be assigned an assessed and consistent reliability metric (such as reliability, MTBF, failure rate, etc). When assigning these values the system developer should rely on the following methods: • Reliability analysis from comparable systems • Historical reliability from predecessor systems (to include test or field data) • Documented subject matter expert engineering opinion All assumptions, sources of information, and justifications for methods selected should be documented. As the design matures, so should the SRM. Once the initial analysis is complete, each element should be assigned a level of risk based upon the guidance in Table 3. Once the risk assessment is complete, a mitigation plan for all critical elements rated as high or medium risk should be completed. In addition to this mitigation plan, the system developer or program manager should provide an assessment of their planned reliability activities and capture useful programmatic information for engineering reviews. The AMSAA Reliability Scorecard [17] is the proposed tool for conducting this assessment. This methodology has its limitations relying on subject matter expert opinion when similar elements do not exist; however, rather than hide this uncertainty with additional adjustment factors, this assessment method promotes a transparent product and a better understanding of the system. This method captures the risk associated with a particular system design and becomes a management tool for the program and its decision makers.

PATH FORWARD
MIL-HDBK-217 prediction methods produce inaccurate and technically invalid results that influence design and programmatic decisions negatively. These inherent limitations cannot be addressed by simply updating handbook versions or databases. The more technically sound approach (reliability assessment) as discussed above is the start of a close-looped failure mitigation process. This assessment promotes a transparent product and, when coupled with DfR activities, ensures a better understanding of the system and its reliability. The use of this assessment methodology requires the DoD acquisition community to develop and adopt contractual language that eliminates the use of MIL-HDBK-217 methods and requires system designers to provide a reliability assessment based on sound engineering analysis and practices.