A learning process fails when one or more of its stages are deficient and when the same events or similar events recur. Learning deficiencies always involve different levels of the socio-technical system in a hierarchical dimension.

In mature industries, accidents often act as a trigger, showing that certain beliefs were incorrect and that some fundamental, implicit assumptions concerning the safety of the system were wrong. This requires a search for new models that better represent realitya process that can turn out to be both painful and expensive, and may therefore be rejected. As a result, people are in denial. They tell themselves that accidents could not happen to them and refuse to accept the risk to which they may be exposed. At an organizational and institutional level, group-think phenomena or commitment biases can lead to collective denials.

However, a number of major accidents have been preceded by warnings raised by people familiar with the respective systems and who attempted, unsuccessfully, to alert actors who had the ability to prevent a danger they perceived. Very often, the dissenting opinions and whistleblowers were not heard due to cultures in which bad news was not welcome, criticism was frowned upon, or where a “shoot the messenger” attitude prevailed.

Although there are many situations in system design and daily operations in which engineers and managers prioritize safety over production/cost goals, a lot of evidence in accident cases suggests that safety is not receiving sufficient attention. Performance pressures and individual adaptation push systems in the direction of failure and lead organizations to gradually reduce their safety margins and take on more risk. This migrationor drift into failure, normalization of deviance, and the associated erosion of safety marginstends to be a slow process, during which multiple steps occur over an extended period of time. As these steps are usually small, they often go unnoticed, a “new normal” is repeatedly established, and no significant problems are noticed until it is too late.

This chapter describes why the failure to learn the lessons from incidents and accidents is a common weakness in safety management and a significant causal factor of many accidents. Most major accidents are caused by a combination of multiple direct and root causes, but the failure to learn is, in fact, often one of the recurring root causes.

In safety management, the goal of analyzing an incident is to understand what, how, and why it happened a certain way; the goal of learning from an incident is to avoid the recurrence of a similar one. This may happen within or outside of the organization, within another organization, or even in another industrial sector.

Since the 1980s, high-hazard industries have established processes to learn lessons from incidents and implement these findings in system designs and operations. This is recognized today to be a key safety function that is requested within regulatory safety management systems. Most industries devote significant resources to these processes and have developed several tools and methods to investigate the events and learn lessons (Carroll et al. 2003; Frei et al. 2003; Sklet 2004; Kingston et al. 2007; ESReDA 2009; Ferjencik 2011; Dechy et al. 2012; Dien et al. 2012; Hagen 2013; Ramanujam and Carroll 2013; Marsden 2014; Drupsteen and Guldenmund 2014; Rousseau et al. 2014; Blatter et al. 2016).

However, this pillar of prevention strategy is not always as effective as expected. There are safety-relevant incidents that are not detected, notified, or reported upon, internally or externally. There can be a lack of depth in the analysis. Corrective actions may be superficial or not implemented in due time in order to prevent the recurrence of a similar event. There may be whistleblowers no one listens to. It can also be that the memory of some lessons is lost or that the resources dedicated to the learning process are not adequate.

Several accident investigations often find the learning processes to be deficient. This weakness is a recurring root cause, among others such as production pressures, organizational complexities, regulatory complacency, human-resources deficiencies, and also blindness, deafness, or denial from a part of the management (Dien et al. 2004, 2012; Rousseau and Largier 2008; Dechy et al. 2011a; ESReDA 2015; Starbuck and Baumard 2005).

Let us illustrate some of the failures to learn, especially the weaknesses in the internal reporting processes of a company and their interactions with control authorities. Indeed, an important characteristic of a learning organization is its ability to create a culture and work climate in which reporting potentially safety-relevant events—including errors and mistakes—becomes systematic. If a work climate/culture leads to fear of blame and discourages people from raising questions and expressing concerns, the basis for learning is not given.

In all industries, severe incidents have to be reported to the control authorities. In some industries—such as aviation and nuclear power—the control authorities conduct inspections and regulatory assessments of the safety management systems, including the learning systems. Open and transparent discussions are necessary for the regulators to exercise control in an informed manner. However, reporting is not always systematic and does not always lead to adequate measures being taken by a company’s management and the regulatory authority. We explain some causes of this problem later.

The four following case studies that were chosen as examples of (internal and external) underreporting comprise major accidents in different industries, countries, and periods of time. The accident-investigation reports often provide very detailed accounts of root causes and connect issues that, when brought together, allow us to identify and learn from some organizational patterns of failure.

Industrial Accidents

The Crash of the McDonnell Douglas DC-10 at Ermenonville, 1974

On March 3, 1974, a DC-10 crashed in a forest close to the Paris Orly International Airport shortly after takeoff . The door of the cargo compartment had opened suddenly, causing a decompression of the cargo compartment, which led to the rupture of the floor of the plane, damaging vital control cables. Another reason for this accident, though, was the failure to learn, with special problems regarding the internal and external reporting (Eddy et al. 1976; McIntyre 2000; Llory 1996).

  • Already during pre-certification ground tests of the DC-10 in 1970, a sudden, explosive opening of the cargo compartment door had taken place.

  • A major near miss due to almost the same problem occurred in Windsor (Canada) in June 1972. A crash was avoided because the plane had only 56 passengers and the pilot was highly experienced and trained. The damage to the floor, due to the depressurization of the cargo compartment, was limited, and some electrical commands remained operational, allowing the pilot to maintain control.

  • After the 1970 ground test and the 1972 Windsor incident, a manager of the subcontractor Convair of McDonnell Douglas—in charge of the engineering design of the door-locking mechanisms—wrote the so-called Applegate memorandum, pointing to the risks relating to the cargo compartment door.

  • Some training pilots at McDonnell Douglas warned their management that the start of operations of the DC-10s sold to Turkish Airlines was premature.

  • After the 1972 near miss in Windsor, due to the pressure of the American and Dutch control authorities, McDonnell Douglas had to divulge to the Federal Aviation Administration (FAA) that “there had been about one hundred airline reports of the door failing to close properly during the 10 months of DC-10 service” (McIntyre 2000). The FAA was found to have been lax in analyzing the answers of McDonnell Douglas regarding this situation.

  • After the accident nearby Paris, detailed statistics showed that 100 DC-10 incidents dealing with the door mechanism had been recorded in the six months prior to the 1974 crash, giving evidence of a very high rate of incidents related to the opening or closing of the door (3.3 incidents per plane per year) and to the door in general, with a rate of 20 incidents per plane per year.

Learning Failures in Radiotherapy: Therac 25

Between 1985 and 1987, six severe incidents with four fatal accidents occurred in the United States and Canada involving excessive radiation being emitted by a cancer treatment machine, the Therac 25. Since 1983, the machine had been installed in 11 treatment centers (5 in the United States, 6 in Canada) without incident. There were several causes of these accidents, including software failures (Leveson and Turner 1993; Leveson 1995; Llory and Montmayeul 2010).

We do not know much about the first three incidents, which occurred in three different centers. After the fourth incident, the investigation from the center did not find the cause nor did they manage to re-create the incident, despite the presence of engineers from the manufacturer, Atomic Energy Canada Limited (AECL). Instead, they assured everybody that the machine could not cause over-irradiation and that no accidents had occurred.

However, three weeks later, another incident in the same center allowed the hospital radiologist to find the anomaly and re-create it in front of AECL engineers. After the sixth accident in 1987, other software bugs were found.

The AECL engineers were convinced that no over-irradiation could occur as a result of their design. For them, the incidents had to do with incorrect use of the machine. As it turned out, the software bug was a new problem, underestimated due to the engineers’ overconfidence in technology.

By now we know that software is difficult to test and often impossible to fully check. However, the AECL engineers ignored the most basic quality assurance procedures and safe-design principles, such as redundancy and defense-in-depth applied within other industries.

Given that they denied having problems, the AECL was deficient in informing other user centers about the incidents or anomalies when they occurred—an attitude that prevented the emergence of collective risk awareness.

After the fifth incident, the American Association of Physicist in Medicine set up a user club and organized meetings with the three stakeholders—that is, users, designers, and control authorities from the United States and Canada—to exchange experiences and discuss the incidents. They learned that some users had implemented supplementary mechanical barriers to add controls.

The lack of reaction of the US control authorities was at first explained by their lack of competencies and risk awareness regarding software issues. However, the geographical dispersion of the users in the United States and the rule that the system designer was required to inform the control authority had diluted the warning signs coming from users. Later, the control authorities changed their reporting requirements so that users could report events both to the control authority and the system designers.

Learning from Failures at the Davis-Besse Nuclear Power Plant

In March 2002, more than 20 years after the Three Mile Island accident in 1979, another incident in the US nuclear industry provided striking safety lessons.

A cavity the size of a football was found in the upper section of the nuclear reactor vessel during a planned outage . Due to corrosion, the cavity had perforated 6.63 inches of carbon steel over several years, reaching the stainless steel layer of the vessel (Department of Energy 2005). This last layer had resisted the primary circuit pressure of 2500 psi-172 bars, though it was not designed to do so. The corrosion was activated by the boric acid used in the primary circuit that was leaking through a crack located at the crossing of top pipes. The plant was within two inches of a severe accident, namely, a loss of primary coolant circuit function.

The task force of the operator, FirstEnergy Nuclear Operating Company (FENOC), investigated the causes of the incident and identified several organizational and managerial flaws (Myers 2002) showing the multiplicity of causes. Among them were that FENOC’s top management was production-oriented rather than safety-oriented, and that organizational changes were conducted without assessing their safety impact. The deficiencies regarding the learning process were numerous (Department of Energy 2005) and contributed to the underestimation of risks, despite the fact that the vessel corrosion had been known for years.

Already in the 1980s, the risk of the crossing rods potentially cracking was known. In 1991, it led to an incident in a French nuclear power plant (Llory and Montmayeul 2010). In 1993, the Nuclear Regulatory Commission (NRC) asked for a corrective action plan. The industry answered that there was hardly any risk involved and problems would be discovered through inspections. The NRC accepted this statement but required more monitoring and asked operators to develop techniques to better detect irregularities.

In 2001, a group of inspectors found extensive cracks at the Oconee Nuclear Power Plant. Mandated by the NRC, the Electric Power Research Institute then classified the most vulnerable plants. These were those that had been designed by Babcock and Wilcox, just as the Oconee and Davis-Besse plants had been.

In November 2001, cracking had been identified in all the Babcock and Wilcox nuclear power plants, except in the Davis-Besse plant, which had decided not to shut down for an inspection, despite the fact that the NRC had required one before the end of 2001. FENOC wanted to postpone the inspection until the planned outage in March 2002 and convinced the NRC to wait for the outage to inspect the risks related to cracking.

After the incident, the FENOC, the NRC (2002), and the Department of Energy (2005) enumerated the learning failures:

  • A lack of learning from internal events: There was a long list of leaks at Davis-Besse, a large number of which were not examined in depth, evaluated, or corrected. The focus was on treating symptoms rather than identifying root causes. In addition, reports of leaks were not kept in the archives, and no adequate risk analyses were performed.

  • A lack of external learning on the national and international levels with poor benchmarking on the boric acid issues: Interviews with operators showed that they did not know about some of the lessons of the incidents that occurred in the 1980s in other US plants and, as a consequence, believed that the corrosion risk was rather low, despite the deposit of dry boric acid on the vessel.

  • Some employees were aware of the isolated symptoms and the risks related to extensive corrosion of the vessel but did not alert managers to take precautionary measures: The management relied excessively on the findings of the resident inspectors to identify (serious) issues. All independent control functions (internal quality assurance, system engineers, resident inspectors, local commission of the plant) missed the degradation of the reactor vessel signaled by dry boric acid deposits as well as the increase of primary circuit leaks between 1996 and 2002.

  • An ineffective corrective action program in which recurring problems were not treated, including the underestimation of deficiencies: There were superficial analyses of the causes, in particular of the boric acid deposits that indicated leaks in 1996, 1998, and 2000. All pertinent reports were downgraded as “normal,” which implied that no root-cause identifications or corrective actions were necessary. Within the operator, there was an agreement on a “well-defined” problem, namely leaks at the flange, without verification through inspection, which was the key to downgrading the risk. A corrective action item could be considered as completed and closed by referring to a document on recurring issues such as boric acid deposits, or by work that was limited to a removal of the deposits. All this amounted to insufficient evidence for an unplanned shutdown.

  • All these severe and recurring deficiencies, in particular the final delay of the inspection, were allowed by the regulator, the NRC: The NRC had sufficient evidence to require an inspection or a shutdown, but it accepted the compromise for a delay requested by FENOC. This is what we can call regulatory complacency.

Learning Failures at the Texas City Refinery

At BP’s Texas City Refinery, an explosion, followed by a fire, occurred in March 2005 during the startup of an isomerization unit (ISOM), killing 15 workers. Several components, including safety devices, had not been functioning adequately, and a sequence of actions had been taken that led to the accident. Several failures to learn were identified after the accident (U.S. Chemical Safety and Hazard Investigation Board 2007; Hopkins 2010):

  • The US Chemical Safety and Hazard Investigation Board (CSB) noted that “[m]any of the safety problems that led to the March 23, 2005, disaster were recurring problems that had been previously identified in audits and investigations. […] In the 30 years before the ISOM incident, the Texas City site suffered 23 fatalities. In 2004 alone three major incidents caused three fatalities. Shortly after the ISOM incident, two additional incidents occurred […].”

  • There was a repeated failure to analyze (exhaustively and in depth) severe incidents that could have, in other circumstances, caused catastrophic effects.

  • The CSB also noted a failure to implement an effective learning system, despite several audits pointing to its deficiencies: “BP had not implemented an effective incident investigation management system to capture appropriate lessons learned and implement needed changes.”

  • BP and the petroleum industry did not learn from their incidents and violated a number of standards.

  • BP did not learn from a series of incidents nor from an accident that occurred at BP Grangemouth in Scotland. All were investigated by the Health and Safety Executive, the UK control authority. Several of the root causes identified were similar to those of the Texas City accident.

  • The CSB observed that several managers were aware of the degraded state of the refinery. For example, the new director of BP’s South Houston Integrated Site (consisting of five BP businesses, including the Texas City site) observed in 2002 that the Texas City Refinery infrastructure and equipment were “in complete decline.”

  • In March 2004, a $30 million accident occurred on site, and many learning failures were identified by the dedicated investigators, especially in reporting (see later in the paragraph on reporting and learning culture).

  • Corrective actions and change management were poor and declining at Texas City, as noted by CSB: “Texas City had serious problems with unresolved PSM [process safety management] action items. […] At the end of 2004, the Texas City site had closed only 33 percent of its PSM incident investigation action items; the ISOM unit closed 31 percent.”

  • Concerns were growing and were also shared by the BP management. A November 2004 internal presentation made for BP management titled “Safety Reality” was intended as a wakeup call for the Texas City site supervisors. It stated that the plant needed a safety transformation and included a slide titled “Texas City is not a safe place to work.”

  • In late 2004, the BP management called for a safety culture assessment to be performed by a consulting company called Telos. The assessment identified some of the key root causes that would lead to the March 2005 accident. Here are extracts from the CSB report (U.S. Chemical Safety and Hazard Investigation Board 2007): “Production and budget compliance gets recognized and rewarded before anything else at Texas City. […] The pressure for production, time pressure, and understaffing are the major causes of accidents at Texas City. […] There is an exceptional degree of fear of catastrophic incidents at Texas City.”

Analysis

Structuring the Numerous Failures to Learn

The accidents described are far from being the only ones that highlight the difficulties of learning from failure. Other accidents, such as Three Mile Island (1979), the Ladbroke Grove trains collision (1999), and the Space Shuttle Columbia explosion (2003), show additional features of the failure to learn (Cullen 2000; Columbia Accident Investigation Board 2003; Llory 1996, 1999; Dechy et al. 2011a).

When analyzing learning failures as being contributing causes of accidents, we found deficiencies at nine key stages of the learning process (Dechy et al. 2009):

  1. 1.

    the definition of the learning system and policyFootnote 1

  2. 2.

    the detection of the event or recognition of the safety threat

  3. 3.

    the collection of adequate data

  4. 4.

    the analysisFootnote 2 of the event(s)

  5. 5.

    the definition of the corrective measures

  6. 6.

    the implementation of the corrective measures

  7. 7.

    the assessment and long-term monitoring of the effectiveness of corrective measures

  8. 8.

    the memorizing and recording of the event, its lessons, its treatment, and its follow-up

  9. 9.

    the communication of the lessons to be learned by stakeholders and potentially interested parties

A learning process fails when one or more of its stages are deficient and when the same events or similar events recur, since one of the objectives that was assigned to the learning process has not been reached (Dien and Llory 2005). This has been clearly demonstrated in the accidents analyzed above—not with just one but rather multiple failures within the learning process at different stages.

At the very least, there was the inability to implement an adequate learning policy (Texas City); detect events (Davis-Besse); analyze events or trends (DC-10, Therac 25, Davis-Besse, Texas City); or implement effective corrective actions (DC-10, Therac 25, Davis-Besse, Texas City).

The learning steps always involve different levels of the socio-technical system in a hierarchical dimension, but other elements as well. Indeed, we also highlighted deficiencies in the learning process from external events, systems, and countries (Therac 25, Davis-Besse, Texas City), which led us to consider that there was another learning policy and full learning process to manage. Indeed, inter-organizational learning requires a dedicated will and the ability to translate the lessons and corrective actions from a first system into the context of a second system in order to compensate for the loss of context (Koornneef 2000).

In addition, we have found repeated learning deficiencies over a long period of time, based on responses to single incidents rather than groups of similar ones (DC-10, Therac 25, Davis-Besse, Texas City). High-hazard companies should be able to gain insights from history in order to identify trends, patterns, recurring events, or differences in order to detect new lessons that could have been missed during the first analysis. Such analyses means reopening old cases and requires a system such as an event database to maintain records. These types of deficiencies point to the importance of a third dimension of the learning policy: the historical dimension (Dechy et al. 2009), which is situated on the third dimension of organizational analysis (Dien et al. 2004, 2012).

We suggest that the learning process should also be analyzed along a fourth dimension, namely communication (Dechy et al. 2009), a dimension that is transverse from and shared by the three others (vertical, transversal, historical). Indeed, learning systems are there to process information, extract it from a context related to the event, and formalize lessons in different formats such as reports, databases, safety alerts, and stories. They involve several actors with their inputs and biases. Expert judgment also means using the right rhetoric to convince others of a threat to safety.

These dimensions are summarized in Fig. 6.1, which provides a framework for the improvement of actions at each stage of the learning process and an awareness of potential deficiencies to address in an audit, for example.

Fig. 6.1
figure 1

The learning-process issues in four organizational dimensions. (Source: Adapted from Dechy et al. 2009)

Underlying Patterns of Underreporting and Lack of Reaction to Warning Signs

Above, we provided examples and a framework of what, where, and how the failure to learn can happen and be observed. These elements can be seen as symptoms of underlying problems, similar to the distinction between symptoms and pathologies of learning barriers made by the European Safety, Reliability & Data Association project group on dynamic learning (see Fig. 6.2; for further developments, see chapters 3 and 4 of ESReDA 2015).

Fig. 6.2
figure 2

Symptoms and pathogens related to failure to learn. (Source: ESReDA 2015)

The objective of the following discussion then is to deepen the analysis and look for some underlying patterns and potential syndromes of learning failures. We have therefore divided our remaining analysis into four parts: the role of beliefs and safety models; the role of reporting and learning culture; the role of the regulators; and the way in which safety concerns are integrated into the decision-making process and trade-offs dealing with the question: Is safety really first?

Beliefs and Safety Models

In mature industries, accidents too often act as a trigger , showing us that our beliefs were incorrect, that some fundamental, implicit assumptions we made concerning the safety of the system were wrong.

Accidents are often defined in terms of their technical origins or their physical impacts (damages, emergency response). Turner (1978, in Turner and Pidgeon 1997) defines them in sociological terms: a significant disruption or collapse of the existing cultural beliefs and norms about hazards. The end of an accident is achieved for Turner when the wrong beliefs about safety have been changed.

More generally, failure indicates that our existing models of the world are inadequate, requiring a search for new models that better represent reality (Cyert and March 1963). This challenge of the status quo can turn out to be expensive, which can then lead to people not looking closely enough into warnings that something is not quite as it should be.

Often, there are biases we have and share with colleagues. Sometimes we are in denial and tell ourselves that “it couldn’t happen to us.” On an individual level, denial is related to cognitive dissonance, a psychological phenomenon in which people refuse to accept or face the risk to which they are exposed. At an organizational and institutional level, group-think phenomena or commitment biases can lead to collective denials.

A general trend has been observed in several process-driven industries showing the pursuit of the wrong kind of excellence. System, industrial, or process safety is too complex to be monitored through safety or key performance indicators only. Several organizations have confused safety excellence with occupational safety indicators, such as injury rates, which are easier to measure (see McNamara’s fallacy in Kingston et al. 2011).

In 2000, there was a severe accident at the BP refinery of Grangemouth in Scotland. It was investigated in depth by the UK Health and Safety Executive (HSE). Three senior BP process safety engineers, including the BP Texas City process safety manager, were aware that “traditional indicators such as ‘days away from work’ do not provide a good indication of process safety performance” (U.S. Chemical Safety and Hazard Investigation Board 2007).

However, this lesson was not learned by BP, because its Texas City Refinery was perceived by top managers as having a good safety record, based on the injury-rate indicator, which was improving. At the same time though, there was contradictory evidence on process safety, which was undergoing a severe degradation, as noted in the CSB report on the accident: “During this same period, loss of containment incidents, a process safety metric tracked but not managed by BP, increased 52 percent from 399 to 607 per year.”

Reporting and Learning Culture

A learning policy defines which types of events must be reported. Criteria are determined—some of them with the regulatory authority. Most of them are objective, but others are based on a subjective judgment regarding their relevance for reporting or for learning . The reporting depends on the voluntary input of people who have time constraints. Both the formal and informal reporting channels are affected by the organization’s work climate and, more broadly, its culture. This can cause underreporting of safety-relevant events, create biases in the information provided, and affect the way in which people discuss safety concerns.

A number of major accidents have been preceded by warnings raised by people familiar with the respective systems and who attempted, unsuccessfully, to alert actors who had an ability to prevent the danger they perceived. Very often, the dissenting opinions and whistleblowers were not heard due to cultures in which bad news was not welcome, criticism was frowned upon, or where a “shoot the messenger” attitude prevailed. The DC-10 accident provides an example: Both the Applegate memorandum from the subcontractor and the warnings from the McDonnell Douglas trainers were ignored by the company’s management.

In addition, in the absence of psychological safety (Edmondson 1999), people will hesitate to speak up when they have questions or concerns related to safety. This can lead to the underreporting of incidents and to investigation reports of poor quality since people do not feel safe mentioning possible anomalies that may have contributed to the event. It can create poor analyses of underlying factors, as it is easier to point the finger at faulty equipment rather than a poor decision made by a unit manager.

In many workplace situations, people do not dare to raise their concerns. They prefer to be silent and to withhold their ideas as well as concerns about procedures and processes. They have developed “self-protective implicit voice theories”—that is, self-censorship. This is usually based on taken-for-granted beliefs about speaking up at work and whether it is accepted or not—beliefs that they have internalized as a result of their interactions with authority over the years (Detert and Edmondson 2011).

One year before the Texas City Refinery accident in 2015, a $30 million accident occurred. Investigators found that “[t]he incentives used in this workplace may encourage hiding mistakes. […] We work under pressures that lead us to miss or ignore early indicators of potential problems. […] Bad news is not encouraged” (U.S. Chemical Safety and Hazard Investigation Board 2007). As an additional indication of the direct effects of financial pressures, the PSM manager indicated that the closure rate of corrective actions had fallen since the incentive metric had been removed in 2003 from the formula used to calculate bonuses (U.S. Chemical Safety and Hazard Investigation Board 2007; Hopkins 2010).

Although it is known that to err is to be human, we still find it difficult to be confronted with mistakes, as they are mostly associated with shame and embarrassment. It explains why 88 percent of managers prefer to talk privately about their employees’ mistakes rather than have an open discussion (Hagen 2013). Learning from mistakes requires an open and a just culture in which blame is considered to be counterproductive (Reason 1997; Dekker 2008).

In addition, the cultures of some nations contain strong customs about uttering or receiving criticism. The same goes for suggestions for improvement, which can be seen as the implicit criticism of the people who created a system or structure. In this instance, Fukushima comes to mind (Diet 2012).

The Regulators

Control authorities react both to mandatory reportable events and to the findings of their inspections (scheduled and unscheduled). The way they react to reportable events and safety alerts and how they prioritize them are key aspects of their role. Organizational learning is mostly thought of as a component of risk management, but it is also a component of risk governance.

At the Davis-Besse Nuclear Power Plant, plant managers relied on the resident inspector to take corrective actions. This lack of a proactive attitude, which nearly led to a severe accident, was a key factor.

Sometimes there are regulatory authorities who promote the wrong safety models, that is, some regulators use indicators of occupational safety to determine inspection priorities for hazardous plants. At Texas City, the US Occupational Safety and Health Administration decided, on the basis of these indicators, to focus on the construction work that had led to injuries and fatalities, rather than inspect the refineries. More generally, the effects of norms and regulations may lead some actors to play the compliance game, such as that found in bureaucratic quality approaches, but compliance is not safety. There is not sufficient time available for inspectors to go deeper than what is formalized and shown.

Another issue that shows up at the regulatory level in the accidents outlined above is organizational complexity, which hampers the learning process. It has to do with the inter-organizational dimensions when there are several operators or if several countries are involved.

A particular task of control authorities is to make sure that operators learn from each other, not just to check if internal learning is good enough. Actually, there are many institutional and cultural obstacles to the sharing of information and lessons between sites, firms in the same industry, and (even more) between industry sectors. The pathology is self-centeredness (ESReDA 2015). Several factors contribute to it. The major ones are geographic dispersion, which fragments the safety alerts reported to the control authority, and the way integration processes are handled. Among our case studies, Therac 25, Davis-Besse, and DC-10 are good examples.

However, complacency among the control authorities is often a root cause of accidents. The Davis-Besse Nuclear Power Plant is particularly intriguing, as the US NRC had all the evidence to conclude that the level of risk was high and that they should have required its immediate shutdown. The near miss of an accident occurring at Davis-Besse had, at the time, a strong negative impact on the credibility of the NRC. Similarly, the US FAA did not react after the first incidents involving the failures of the DC-10 cargo door locks and was found to have been complacent.

Is Safety Really First?

Given the accident case study approach we have chosen, the answer to this question is no. Although there are likely many situations in system design and daily operations in which engineers and managers prioritize safety over production/cost goals (Rousseau 2008; Hayes 2015), a lot of evidence in accident cases suggests that safety is not receiving sufficient attention. Let us look at some patterns related to the “safety first” motto and the extent to which it is implemented.

A first issue is the effect of conflicting messages. When management’s “front-stage” (Goffman 1959) slogans concerning safety and the reality of decisions or “back-stage” actions do not match, management messages lose their credibility. Langåker (2007) has analyzed the importance of compatibility between front-stage and back-stage messages regarding the effectiveness of organizational learning. In fact, mottos such as “safety first” can be counterproductive, as only a few people really believe in them. In reality, production comes first. Safety induces direct costs, which influence short-term economic results. It should be remembered, though, that in high-risk industries, non-safety can eventually induce higher costs. Trevor Kletz used to recall: “[T]here’s an old saying, if you think safety is expensive, try an accident. Accident cost a lot of money. And not only in damage to plant and in claims for injury, but also for the loss of the company’s reputation.” With the Texas City Refinery, BP faced a $1.5 billion accident and several billions in claims after the Macondo blowout in 2010 and the oil spill in the Gulf of Mexico.

Let us look again at the accident at BP’s Grangemouth refinery in Scotland. The UK HSE investigation showed an overemphasis on short-term costs and production, which led to unsafe compromises, causing long-term issues such as plant reliability.

Only a few years after the accident, the US CSB found that BP managers—including the company’s top management—were not aware of, or had not understood, the lessons of Grangemouth. No changes were made to BP’s approach to safety.

Similar findings were made later by the safety culture assessment conducted by Telos consulting group weeks before the 2005 accident at Texas City: “The Business Unit Leader said that seeing the brutal facts so clearly defined was hard to digest, including the concern around the conflict between production and safety. ‘The evidence was strong and clear and I accept my responsibility for the results’” (U.S. Chemical Safety and Hazard Investigation Board 2007).

Surprisingly, when presenting the results to all plant supervisors on March 17, 2005 (a week before the major accident), the same business unit leader manager stated (U.S. Chemical Safety and Hazard Investigation Board 2007) that the site had gotten off to a good start in 2005 with safety performanceFootnote 3 that might “be the best ever.” He added that Texas City had “the best profitability ever in its history last year” with more than $1 billion in profit, “more than any other refinery in the BP system.” As concluded by the board chairperson of the US Chemical Safety Board in front of the US House of Representatives, the levels of investment in and maintenance of infrastructure were too low, which explained the profitability and “left it vulnerable to a catastrophe” (Merritt 2007).

The 2005 Texas City Refinery explosion is a particularly pathologic example of the impact of production and financial pressures on safety, but the accidents of Therac 25, Davis-Besse, and the DC-10s also show that years of production pressure can defeat safety measures, and that warning signs are often ignored. Most of these severe problems were recognized by employees and disclosed by some. Managerial actions could have taken place long before the accidents.

These cases may look extreme. Indeed, in daily management, things are more complex and the boundaries are blurred, as technical, human, and organizational factors are levers of global performance, including production, quality, reliability, and safety. Also, problems can be difficult to root out, but even these can be found, investigated, and assessed (e.g., Rousseau 2008; Dechy et al. 2011b, 2016).

Performance pressures and individual adaptation push systems in the direction of failure and lead organizations to gradually reduce their safety margins and take on more risk. This migration (Rasmussen 1997; Rasmussen and Svedung 2000; Amalberti et al. 2006)—or drift into failure (Snook 2000), normalization of deviance (Vaughan 1996), and the associated erosion of safety margins—tends to be a slow process, during which multiple steps occur over an extended period of time. As these steps are usually small, they often go unnoticed, a “new normal” is repeatedly established, and no significant problems are noticed until it is too late.

At Davis-Besse, top managers relied too much on past successes to be able to consider that their decision making might have been inadequate. They did not distinguish between a reliable and a safe system (Llory and Dien 2006).

A common issue is the level of evidence of a threat to safety and when it demands action. Unfortunately, and as exemplified by the NASA space shuttle accidents of Challenger in 1986 (Vaughan 1996) and Columbia in 2003 (Columbia Accident Investigation Board 2003), the burden of proof is put on the “safety attorneys” who want to stop production, instead of requiring the “production attorneys” to provide evidence that everything is under control and “safe to fail.”

Conclusion

In order to study an activity as complex as learning safety lessons from critical incidents—which combines technical, human, organizational, and societal dimensions—we should look empirically at the detailed accounts that are especially provided by accident investigators. This case study approach may seem tedious, but reports of crises and accidents shed light on performance features and deficiencies that were partly hidden in the “dark side” of organizations (Vaughan 1996) or hard to see for most, but not all.

Some researchers (Llory 1996) even argue that accidents are the “royal road” (referring to Freud’s metaphor about dreams being the royal road to the unconscious) to discovering the real (mal)functioning of organizations. Other researchers refer to the “gift of failure” (Wilpert in Carroll and Fahlbruch 2011), because incidents offer an opportunity to learn about safe and unsafe operations, generate productive conversations across stakeholders, and bring about beneficial changes to technology, organizations, and mental models. It is an alternative strategy that is at least complementary to the study of normal operations advocated by high-reliability organizations and resilience engineering (e.g., Bourrier 2011; Hollnagel et al. 2006).

The systematic study of more than a hundred accidents shows a disturbing pattern: The root causes of accidents recur independent of the industrial sector, the organizational culture, and the period in which the accident occurred. This empirical finding is important, as this recurrence opens the possibility of capitalizing on the lessons from accidents. Those recurring root causes of industrial accidents have been analyzed and defined as the pathogenic (organizational) factors (Reason 1997; Dien et al. 2004; Rousseau and Largier 2008; Llory et Montmayeul 2010).

We propose to develop the “knowledge and culture of accidents” and promote its transfer through a culture of learning, especially for organizational analysis and diagnosis of safety management (Dechy et al. 2010, 2016; Dien et al. 2012). In our view, the knowledge provided by the lessons of accidents is put to insufficient use. This process has just started and should receive more support from high-risk industries and regulators.

By discussing accident case studies and emphasizing common factors, we want to demonstrate how we can better use the lessons from accidents instead of failing to learn from them. We should remember Santayana’s (1905) warning: “Those who cannot remember the past are condemned to repeat it.”

One of the main questions organizations need to answer is: Does the learning failure situation belong to a failure of reporting or of analyzing and implementing corrective measures? We have provided examples and a framework for monitoring learning deficiencies that should help managers to improve their learning process. It shows that from blindness—linked to reporting and analysis deficiencies—it should shift to an issue of deafness or even denial in responding to warning signs, implementing corrective actions, and avoiding the “too little, too late” syndrome recalled by Merritt (2007).

Safety through lessons learned from incidents requires avoiding the cultivation of a bureaucratic approach (an office mentality and sticking to procedures), and instead to go beyond official rules and transcend boundaries. Should we become complacent, we should remind ourselves that prior to Three Mile Island, there was a similar accident in Switzerland, but it was not mandatory to inform the American safety authorities about events occurring abroad. This task (on generic lessons) has still not been perfectly achieved within industries, and even less between industries.