Failure Analysis

Failure analysis is an investigative engineering approach to determine how and why equipment or a component has failed. It has the following four major objectives:

  • Verifying that a failure occurred.

  • Determining the mechanism of failure.

  • Determining the underlying cause(s) of the failure.

  • Recommending corrective and preventive action.

However, it has often been observed that in spite of performing a robust and systematic failure analysis (including both determining the cause of the failure and recommending preventive measures), the failure or similar failures have reoccurred. There could be two reasons for the recurrence of failure:

  1. 1.

    Preventive measures were taken based on proximate cause of the failure, while the root cause of the failure was not sufficiently determined and addressed. Preventive measures taken to mitigate the cause of the failure could be a stopgap but may fail to address the underlying systemic issues.

  2. 2.

    Failure analysis approach was simply not combined with a robust failure management process.

Primary Cause Versus Root Cause Determination

While performing failure analysis to prevent future failure, it is very important that the primary cause of the failure not be considered as the root cause of the problem.

The primary cause is the set of conditions or parameters from which the failure originated. The old saying, “For want of a nail the shoe was lost, for want of a shoe the horse was lost, for want of a horse the battle was lost, for want of the battle the kingdom was lost,” illustrates a classic example of primary cause determination. The failure analyst must discover what is fundamentally responsible for the failure and determine the sequence of events that led to the final failure.

By contrast, the root cause of a failure is a process or procedure which went wrong (failed). For example, the root cause of the failures mentioned above will be the horse’s groom not checking to see that the horse’s shoes were properly nailed on before sending him into battle. Most failure analysis stops short of this final step. Instead, the primary cause of the failure is presented to the client is, which limits our focus the short-term solutions (an approach of “fix the immediate problem and move on”).

Examples given in Table 1 may provide the critical difference between primary cause and root cause of the failures:

Table 1 Difference between primary cause and root cause

To avoid the recurrence of failure and to determine the root cause of the failure, the primary cause must be supplemented by intimate understanding of the entire history of the failed system or part, including its design, manufacturing, and use. From this information (root cause), a new procedure should be crafted which will prevent repetition of the failure and failures of similar or related nature.

Total Failure Management

Identifying the nature of failure and determining its cause in a way that properly guides mitigation, monitoring, and integrity management activities is vital to having effective failure management. Total failure management (TFM) is a process through which knowledge of failure mechanisms can be applied and lessons learned can be implemented to multiple areas of design and manufacturing.

TFM should focus on the physical and circumstantial evidence associated with materials degradation or failures caused by mechanical loading or corrosion. The objective of TFM should include characterizing the basic failure mechanisms, identifying the environmental/operational factors that contribute to the failure mechanism, and proposing effective means of mitigating failure for similar facilities or equipment.

Thus, TFM seeks to apply the “lessons learned” to reduce or prevent similar failures in the future by developing engineering solutions. It is also important that TFM should provide a predefined process for effectively applying the lessons learned to the immediate problem and implementing the learning corporately so that the organization truly learns from the failure. When failures repeat themselves in any organization, it is clear that the organization is not learning from these incidents.

The three main components of TFM are: failure mechanism investigation, root cause analysis, and implementation of the lessons learned. These components can be divided into two domains: failure analysis and failure management. TFM should use failure analysis to determine physical failure mechanism, and root cause analysis to identify the underlying reasons (process or procedure) why the failure occurred. The output from these analyses is then used to develop engineering solutions to improve future failure mitigation and monitoring activities, and reliability management efforts. To effectively employ TFM, procedures should be established for each step as outlined in Table 2.

Table 2 Total failure management

The procedures used for each step will vary based on the complexity and size of the organization and the system being maintained and managed, but the general progression and objective will remain the same.

In the context of risk, failure analysis reduces the likelihood of a failure by verifying the failure threats that are present, ensuring that the recommended failure mitigating measures are appropriate and directed toward the specific threats and suggesting monitoring such that meaningful performance measures are in place. In short, failure analysis should be a part of the continuous improvement process, providing feed back from failure that help to prevent future occurrence not only in the equipment that failed but also for similar products.

Since the failure management contributes to managing risk by reducing the likelihood of failures, some discussion of the contribution of failure analysis to failure management is worthwhile. There are a number of ways in which the outcome of failure analysis supports the elements of failure management. Figure 1 presents some of the typical elements or activities covered by failure management system and shows how failure analysis supports these elements.

Fig. 1
figure 1

Role of total failure management in failure risk reduction

A plan for implementing lessons learned must address how solutions are determined, communicated to the proper audience, and fed back to the failure management. Adequate resources must be committed to support these activities for the program to be successful.

Conclusions

The existing confusion about the primary cause and root cause and their blurred boundary with FA process and TFM has had dire implications for many products, parts, and equipment. To tackle this problem, it is recommended to understand the concept of TFM where FA work process supports the failure management system objectives of continuous learning and improvement by providing a structured process for understanding the root cause of the problem and how it (or similar failures) can be prevented in the future. Consequently, implementation of TFM system will bring about many integrity, cost-, and time-saving benefits.