Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Working in the UK defence industry on helicopter logistic support contracts it was observed that helicopters required daily maintenance test (MT) action to find and replace failed Line Replaceable Units (LRUs). This is usually achieved by conducting a test using Built in Test Equipment (BITE) and/or using Automatic Test Equipment (ATE) and the maintainers’ knowledge and practices. However, with all of these No Fault Founds (NFF) still occurs.

There is the assumption, based upon experience, that these maintenance techniques are 100% effective [E] at detecting and replacing the failed component/LRU. Moreover, it is also incorrectly assumed that the BITE and ATE test (from this point forward referred to as Test Equipment (TE)) covers 100% of the component/system and thus will identify all failure modes.

It is suggested that this is not the case and that the MT have errors such as NFF and can leave a failed component/LRU on the system. This chapter is to suggest a model to give insights into the MT effectiveness and errors and its effect on the component/system reliability from before the test (R) to after the test (RL) and NFFs. The effectiveness of the maintenance test is not usually considered [1], neither is the condition of the component before the test. So often the component is disturbed before testing the whole system.

2 System and Component Test

TE is designed to test certain parameters. This may not be a complete test of the whole component or the system. This is therefore the effectiveness of the test. For this example image an LRU and use TE to test a component, the effectiveness of the test may be only 80%. That means that 20% has not been tested. A fault may still exist in this untested 20%, as shown in Fig. 16.1.

Fig. 16.1
figure 1

Untested part of a component

3 Analysis

To develop a MT model the component/system reliability [failure rate] is partitioned into two parts. The reliability of the tested part is shown by \( {\text{R}}^{\text{E}} \) whilst the reliability of the NOT tested part is shown by \( {\text{R}}^{{\left( {1 - {\text{E}}} \right)}} \). So the overall reliability of the system/component is given by Eq.  16.1 .

$$ \varvec{R} = \varvec{R}^{\varvec{E}} .\varvec{R}^{{\left( {1 - \varvec{E}} \right)}} $$
(16.1)

For the analysis it was assumed that the component was still fitted to the aircraft when the system is initially tested. A MT only has two possible initial outcomes a PASS or a FAIL and this only applies to the tested part \( \left( {{\text{R}}^{\text{E}} } \right) \) of the component/system. These two outcomes have a further outcome:

PASS

  • TRUE—A correct result. The item is serviceable.

  • FALSE—An incorrect result. An unserviceable item has been passed. This will be identified as a failed item β probability.

FAIL

  • FALSE—An incorrect result. A serviceable item has been failed. This will be identified as a serviceable item with α probability. This will later be identified as a NFF.

  • TRUE—A correct result. The item is unserviceable.

A FAIL-FALSE item may when tested at Depth maintenance result in a NFF as the fault is not within this component.

Also, it is assumed that failures in the tested part of the component/system will be successfully repaired and that the reliability of the not tested part \( \left( {{\text{R}}^{{\left( {1 - {\text{E}}} \right)}} } \right) \) will remain the same as before the MT. Figure 16.2 shows the MT results and their probabilities.

Fig. 16.2
figure 2

Maintenance test possible outcomes

4 Discussion of the MT Model Parameters

R shall be used to indicate the reliability of the system or component before a test has been conducted. I have seen values range from 0.5 for older, in-service complex machines to 0.999999 for weapon systems. This can be assessed by a simple count of successful uses divided by the total number of uses, or by use of complex distribution models of random failures from exponential, a constant failure rate, to Weibull, using say, increasing failure rate.

\( {\mathbf{R}}^{{\mathbf{E}}} \) shall be used to indicate the reliability of the tested part of the component/system. As previous stated the whole system will not necessarily be tested by the test equipment.

α shall be used to indicate the probability of a FAIL–FALSE (i.e. a no fault found—the component is serviceable but is removed for repair). From experience this is a subjective probability and is usually in the range 0.01–0.20 (1–20%). This includes the probability that the test equipment is too sensitive and incorrectly detects faults or the judgement of the maintainer is too harsh. This probability will be affected by the maintenance practice. If the maintainers have to ensure the system will be serviceable and available [2] and there is any doubt about the test equipment result the item may be removed. Additionally, they may replace most of the components (LRUs) in a subsystem even if this is up to four components. Thus, assuming only one LRU failed, α is now 0.75 (75%)—a high value. Alternatively, a technical manual may narrow the fault down to four LRUs and so the maintainer will remove all four LRUs just to be on the safe side. This reduces the equipment downtime but should maintain or improve the operational availability [2] so long as there are sufficient spares within the supply chain.

β shall be used to indicate the probability PASS–FALSE. The component has failed but remains on the aircraft. There is a fault but it has remained undetected (in the tested part) by the TE. From experience, this is a subjective probability usually in the range 0.01–0.20 (1–20%). This includes the probability that the test equipment is not sensitive enough to detect failures, so perhaps the threshold has been set too high. Again this probability is affected by the maintenance practice. If the maintainers are poorly trained and under stress they may pass the aircraft/component as serviceable when an LRU has actually failed. Questions must be asked as to whether the maintainers know how to use the test set. How frequently they use the test set? What test set refresher training is provided and at what periodicity [3]?

There is a relationship between α and β; they both cannot be above 0.5 at the same time because the total probability is greater than 1.0. Figure 16.3 below shows this relationship as a guide.

Fig. 16.3
figure 3

Alpha-Beta relationship

E is the effectiveness of the fault coverage of the test. It must lie somewhere between 0.0 and 1.0. This is how much of the machine/system failure rate is tested or even testable by the TE and the maintainers’ knowledge. This is usually in the range of 0.5, for systems in operation and poor test equipment, to 0.99 at the original equipment manufacturer (OEM) who will normally have the best test equipment and engineers’ knowledge. The maintainers’ knowledge will be affected by the training of the system and the test equipment, the manuals currency, the time available to conduct the diagnostics and the testing and final the culture of the organization [3].

An estimate of the system E is

$$ {\text{E}} = \left( {1 - \left[ {\frac{{\log \left( {R_{L} } \right)}}{{\log \left( {\text{R}} \right)}}} \right]} \right) $$
(16.2)

where R is the reliability of the system/component before the test and RL is the reliability of the tested part of the system. This estimate suggests the effectiveness is in the range of E = 0.50 to 0.80.

5 The Maintenance Test Model

There is now have a MT model to investigate the reliability of the system after the test (RL) and the quantity of NFF produced by the test. For the following explanation it is assumed that a system contains four components, namely LRUs, as depicted in Fig. 16.4 and the TE can only effectively test LRUs A and B.

Fig. 16.4
figure 4

The system containing four LRUs being tested

So if α and β for LRUs A and B are zero, these would be serviced or repaired as necessary and it would be assumed that they are now 100% serviceable; assuming a perfect/correct repair. The reliability of the system is limited by the reliability of LRUs C and D, or at least the untested aspects of these LRUs. The crux of the problem is what percentage of the system does the TE test? So the reliability of the system can be shown by Fig. 16.5.

Fig. 16.5
figure 5

Reliability of system

So

Firstly, the reliability after (RL) the test:-

$$ {\text{R}}_{\text{T}} = \frac{{\left( {{\text{PASS}} - {\text{TRUE}}} \right)}}{{\left[ {\left( {{\text{PASS}} - {\text{TRUE}}} \right) + \left( {{\text{PASS}} - {\text{FALSE}}} \right)} \right]}} $$
(16.3)
$$ {\text{R}}_{\text{T}} = \frac{{{\text{R}}^{\text{E}} .\left( {1 -\upalpha} \right)}}{{{\text{R}}^{\text{E}} .\left( {1 -\upalpha} \right) +\upbeta.\left( {1 - {\text{R}}^{\text{E}} } \right)}} $$
(16.4)

But, this reliability has to be multiplied by the reliability of the NOT tested part of the system/component, \( {\text{R}}^{{\left( {1 - {\text{E}}} \right)}} \) (See Fig. 16.1). So,

$$ {\text{R}}_{\text{L}} = R_{T} \times R^{{\left( {1 - E} \right)}} $$
(16.5)
$$ {\text{R}}_{\text{L}} = \left( {\frac{{{\text{R}}\left( {1 -\upalpha} \right)}}{{{\text{R}}^{\text{E}} .\left( {1 -\upalpha} \right) +\upbeta.\left( {1 - {\text{R}}^{\text{E}} } \right)}}} \right) $$
(16.6)

Therefore, RL = (RC x RD). There is unknown confidence in LRUs C and D. If α = β = 0 there are no test error terms.

Then \( {\text{R}}_{\text{L}} = {\text{R}}^{{\left( {1 - {\text{E}}} \right)}} \) the reliability after test (RT) is the reliability of the NOT tested part of the component/system, when α = β = 0, no test errors.

So if there have 100% fault coverage [E = 1] then RL = 1.0, a perfect test and repair. If no part of the system can be tested there will be 0% fault coverage, [E = 0], then RL = R. The reliability after test (RT) is the reliability of the system/component before the test (R). The test has no effect.

Figure 16.6 is a plot of reliability after the test (RL) against fault coverage E for various reliabilities before the test (R).

Fig. 16.6
figure 6

Reliability of whole system after testing (RL) against fault coverage (E) for various reliabilities before the test (R)

5.1 Probability of No Fault Founds

The probability of NFF can now be determined.

$$ {\text{NFF}} = \frac{{\left( {{\text{FAIL}} - {\text{FALSE}}} \right)}}{{\left( {{\text{FAIL}} - {\text{FALSE}}} \right) + \left( {{\text{FAIL}} - {\text{TRUE}}} \right)}} $$
(16.7)
$$ {\text{NFF}} = \frac{{{\text{R}}^{\text{E}} .\upalpha}}{{\left[ {\left( {{\text{R}}^{\text{E}} .\upalpha} \right) + \left( {1 - {\text{R}}^{\text{E}} } \right)\left( {1 -\upbeta} \right)} \right]}} $$
(16.8)

Figure 16.7 shows the percentage NFF against MT effectiveness [E] for a range of reliabilities [R] with α = β = 0.05. For the same reliabilities Fig. 16.8 shows what happens when α is decreased from 0.5 to 0.3. The percentage of NFFs increases quite substantially.

Fig. 16.7
figure 7

Percentage NFF versus effectiveness for a range of reliabilities and α = β = 0

Fig. 16.8
figure 8

Percentage NFF versus effectiveness for a range of reliabilities and α = 0.3 and β = 0.7

6 NFF and Effectiveness

With the average MT efficiency [failure rate coverage] [E] of 80% and reliabilities between values of 0.6–0.8 the percentage of NFF is in the range 10–22%. This is a typical range of percentage for NFF that were observed in the Helicopter support contracts for military avionic components. If a component has a reliability of 0.2 but NFF is in the range of 5–10%, then should be queried. Something does not tally; either α and β could be very high or the coverage is not as expected.

When the reliability of the component/system is high, say above 0.99, the quantity of failures is reducing and there are fewer failures to detect. However, in contrast the percentage of NFF is approaching 100%. Therefore, when measuring the percentage of NFF, the reliability of the system should be taken into account. A high percentage NFF is not a major cost when the system reliability is high. This suggests that with very high reliability/system, above say 0.999, each MT will produce 100% NFF at α probability. There may be minimum time between each MT to ensure the probability of failure is high enough so the MT can detect the failures, i.e. the probability of a failure is equal to α, the probability NFF.

6.1 So Why Carry Out Unnecessary Tests?

The periodicity of the test needs to be queried. The question to be asked is how often does the system (e.g. a missile) need to be tested if it has a proven high reliability? Then for any frequent test there is a probability α that the result is a FAIL-FALSE. However, this is not an actual true failure but as a consequence of the test result additional maintenance is required.

6.2 NFF and In-Service Failure Modes Effect on Reliability [R]

A usual assumption for a component failure rate (MTBF) prediction is that the failure rate is ‘random’ with a constant hazard function and, all failure modes are attributable, i.e. no NFF. This assumes that any design or manufacturing defects have been found and corrected in the product development phase. In defense products experience by the authors has that up to an additional 40% of the ‘random’ failure rate has to be added to account for the design or manufacturing defects in the products in-service phase. It may be these may have been detected with greater Production Reliability Assessment Testing [2]. Did the prediction include a percentage of random failures and was this the same value of the actual in-service number of failures? If not it will need to be adjusted to account for reality.

6.3 Design, Manufacturing and Testing Error Examples

The following are some examples of design, manufacturing and testing errors that only became evident during operations.

6.3.1 Example—Runway Denials System

A major design failure mode is for a mine of a runway denial system, which during trials when the mine landed it detonated prematurely. The fault was traced to a gas motor that on operation generated an electromagnetic pulse, which fired the trigger circuit. The corrective action was to put a Faraday cage around the trigger circuit.

6.3.2 Example Military Helicopter Main Rotor Blades

Another design failure was on a military helicopter’s main-rotor blades’ leading-edge shields. There was no design requirement for the shield butt joints to be parallel and with a specified minimum clearance distance. The butt joints are not visible during operations as they are covered with a butt strap. During flights the temperature changes caused the leading edge shields, to expand and touch and buckle at the joint and thus making the butt strap loose. As a result corrective action was instigated, which was to specify a minimum distance between the two sides of the butt joints and how parallel or not they should be. Originally during manufacturing it was not possible to assemble the shields with a perfectly parallel butt joint of say 2 mm apart. So for two ‘parallel’ butt joint lengths of 50 mm original assemblies permitted a 1 mm difference between joints’ two ends, and thus 2 mm over 100 mm etc. However, with a temperature change this 1 mm gap closed and the shields touched. The design engineers had to set an off- parallel limit. For example, for a 50 mm length of shield the gap between the two butt joints had to be 0.5 mm in 50 mm but with a minimum distance of 2 mm. Therefore, the maximum off-parallel difference between the two butt joint ends would be 2.5 mm and the minimum distance 2 mm over 50 mm and 5 mm and 4 mm respectively for a 100 mm length of butt joint.

6.3.3 Example Military Helicopter Common Control Unit

An example of a major manufacturing fault is for the common control unit, which is a key pad-screen interface into the military helicopter’s computers. On operation of a single key press the screen went through two menu screens [key bounce] instead of only one. The fault was traced to a small pip (a small protrusion) on the underside of the key molding remaining from the manufacturing process. The corrective action was to remove the pip.

6.3.4 Example Military Helicopter Main Windscreen

Another was a military helicopter main windscreen that was made with six layers of alternate glass and polycarbonate. The screen is not a load carrying part of the airframe. One of the layers cracked in the hangar overnight. The fault was traced to the manufacturing process producing the screens on a horizontal jig. This process placed unwanted stresses into the screen when fitted to the airframe in the vertical position. The corrective action was to change the jig to the vertical position.

Also, large military products are usually integrated with the subsystem by the prime contractor. This introduces additional failure modes when the integration process fails and the system has failures when all the subsystem LRUs are functioning correctly but the system fails, i.e. the interface design documents are in error between the subsystems. This produces NFF by definition at the original equipment manufacturer (OEM).

6.3.5 Example Military Helicopter Tail Oscillation

Another examble was on a military helicopter that had an oscillation of the tail of the airframe during the development phase. The investigation looked into the stiffness of the airframe, the airflow around the airframe, hydraulic servos in the flight control system and the flight control auto-stabilization system. None of these were found to be at fault; they were all operating as designed and were built to the design interface specification from the system integrator. It was an integration fault. The corrective action, chosen by the system integrator, was to change the software in the flight control auto-stabilization system. This was also the cheapest solution.

6.3.6 Example Air-to-Air Missile

An example of a major integration fault was on the air-to-air missile. When testing the electronic units from different vendors together it was found that there were high electrical currents between the units. The fault was traced to the electronic units. One had been designed at 4.5 V whilst the other was designed for 5.0 V. Unfortunately, the design specification did not specify the power supply voltage.

7 Calculating in-Service Reliability

These examples show therefore that, these additional failure modes of extant design, manufacturing and subsystem integration faults have to be taken into account in the in-service reliability (R) measurement methods and the definition of NFF. Training analysis for reviewing component record cards and maintenance job cards for trending may be necessary.

Failure modes generated by use of the system outside its design envelope and any failure modes introduced by maintenance action together with any disturbance of a LRU/system, will degrade its reliability. These failure modes are not random and are strictly NFF and classed as non-attributable [4].

Failure modes introduced by maintenance action, that causes any disturbance of a LRU/system will degrade its reliability. The above are example of failure modes caused by design, manufacturing or integration issues, which will have corrective action devised and implemented. But, similar types of failure modes which are of a minor nature will be usually classed as ‘random’ and have no corrective actions devised. This suggests that a repairable system, e.g. a helicopter, will not have a constant hazard function. These failure modes are not random and are strictly NFF and should be classed as non-attributable.

When modelling the life-cycle costs (LCC) of a helicopter fleet it was common practice to ensured the reliability (R) used for LRU modelling included elements extant design/manufacture and non-attributable failure modes (i.e. design, manufacture or design). From experience it was observed that it was common practice for the helicopter mean time between failure (MTBF) to be reduced by up to 40% and there was a further reduction 15% to account for non-attributable failure modes. This 15% was considered to be the percentage generated by normal behavior of the helicopter operator.

The reason for this is, the support contract was fixed price for the cost of all repairs and new buy spares-scrapped items. Including these normal behavior, non-attributable failure modes ensures the risk to the prime contractor of losing money is minimized. As part of a support contract was in operation the normal behavior for non-attributable failure modes were monitored along with the design or manufacture failure modes to ensure compliance. Again, better trending of data may be required.

The following examples demonstrate what happens if more time, money and effort were invested in the test equipment, manuals and engineers’ training and the effect on RL and NFF.

Table 16.1 shows the effect if the MT coverage (E) is increased from 0.8 to 0.9. It should be assumed that there are 150 maintenance tests in any period.

Table 16.1 Relationship between reliability and increased test period

The results show an increase in reliability (RL) from 0.948 to 0.968 and a decrease in NFF of 2%.

8 Conclusion

The MT model can be used when you need to assess whether to invest in better test equipment and engineers’ training to reduce α and β probabilities. This would provide better test results and benefit the overall operational availability and reduce the occurrence of NFF. A good result for all involved.

Furthermore, the MT model can be used to analyze current maintenance scenario for their effectiveness.

NFF is related to E, α and R. Whereas RL is related to E, β and R. Therefore, E has a major effect on the MT results and should therefore not be disregarded when considering system and component reliability.

Highly reliable systems may have a minimum time between MT to reduce the quantity of NFF. Due consideration should be given to the periodicity between each MT. Too frequent testing will increase the number of FAIL-FALSE and NFF. Consequently this will impact the operational availability.