1 Introduction

Patient safety relies on the interplay between humans, technology and organisations. Patient safety challenges in healthcare have been described previously, for example in anaesthesia (Cooper et al. 2002; Gaba 2000). The IOM report “To err is human” (Kohn et al. 1999) and the follow-up report “Crossing the quality chasm” (Corrigan 2005) made it apparent that even though there is increased focus on patient safety, the progress is far from impressive and that patient safety problems in healthcare seem to persist (Sevdalis et al. 2012).

Organisational changes and the use of technology have improved some safety indicators but the health professionals’ medical expertise, technical abilities and non-technical skills (NTS) play a major role in improving patient safety (Sevdalis et al. 2012). Traditionally, health professionals are trained in silos and are taught medical knowledge and technical skills, whereas NTS, such as communication, cooperation and leadership, are seldom taught. Yet, these skills are expected to be attained since they are indeed needed in the ill-structured world of clinical work (Rasmussen et al. 2012). NTS are important for safe and efficient teamwork especially in emergency situations, where time is an important factor. Training of NTS in healthcare was first systematically introduced with the crisis resource management (CRM) training in order to address the human factor issues and improve patient safety by building up team skills and improving individual cognitive abilities (Gaba et al. 1998). Training and assessment of NTS for individuals or teams in healthcare have been shown to improve safe and efficient performance in the operating room (OR) (Neily et al. 2010). Specific positive effects have been reported regarding communication, teamwork and technical performance. Healthcare professionals value such training highly, they learn and change behaviour (Andersen et al. 2010; Fuhrmann et al. 2009), and they apply what they learn in practice (Morey et al. 2002).

Although there is an evident need for NTS training in healthcare to improve patient safety, the implementation of training and assessment of NTS has been slow (Flin and Patey 2011). One reason for the poor implementation of NTS training might be the lack of a shared understanding of the underlying concepts. Much can be learned from high-risk organisations that are further ahead in working with such a shared understanding, but the culture, context and organisation of work differs in different domains and organisations (Klampfer et al. 2001; Kontogiannis and Malakis 2013). Healthcare cannot simply apply the understanding from different domains (Glavin 2011; Nestel et al. 2011; Yule et al. 2009). Last but not least, there has been a tendency to view behavioural rating instruments as easy to use. Combined with financial reasons, this assumption has led to an insufficient training of the raters. Consequently, the ability to rate and provide feedback to the learners might not be optimal (Sevdalis et al. 2012).

Behavioural rating instruments are one attempt to provide a conceptual framework for NTS and make the underlying topics more accessible for healthcare professionals in educational and clinical settings. Some of these instruments include overarching categories, with a number of elements and examples of observable behaviours. The instruments are typically defined based on job analysis studies using different methods, like interviews and observations with respective stakeholders. Thus, these are based on the definition of what is seen as the non-technical expertise of the profession for which the instrument was designed.

To our knowledge, there is no overview of the studies describing the construction of individual and team behavioural rating instruments in healthcare on a detailed level. Such an overview is needed in order to interpret the results of the instruments and use these in the best interest of patients, healthcare professionals and the organisations that exist to provide the framework for this care. Different stakeholders’ views might vary considerably so analysing the data basis for any instrument might help in identifying biases and in defining the scope of application.

The aim of this paper is to give an overview of the development of different behavioural rating instruments to assess NTS at individual and team level within different medical specialities and settings.

2 Methods

This paper is a critical review, identifying papers that represent certain types of instruments to assess healthcare professionals NTS at individual or team level. This review does not constitute an exhaustive list but provides an overview of different instruments for different professions and teams.

During our work with behavioural rating instruments to assess NTS (Jepsen et al. 2012; Lyk-Jensen et al. 2014; Spanager et al. 2012, 2013), we have gained insight into different aspects of these instruments for assessment of NTS. We used this knowledge to select instruments for the review. In addition, we searched online resources including PubMed Medline, EMBASE and Google Scholar for relevant research papers. Bibliographies from relevant research papers were consulted.

2.1 Selection of articles

Articles were selected if they fulfilled both of the following criteria:

  • The subjects of study were physicians and/or trainee physicians alone or in combination with nurses and/or operating department practitioners and/or midwifes.

  • They described instruments assessing NTS or behaviours of individuals or teams in simulated or clinical healthcare settings.

2.2 Data extraction

Two authors (DO, RMHGJ) reviewed the articles to assess their eligibility based on the selection criteria. Supported by the third author (PD), they reviewed all of the selected articles and decided which to include in order to illustrate the variability in development methods and validation procedures.

A coding sheet was developed focusing on relevant parameters that described:

  • The development of the instruments; the purpose of the instrument; the profession(s) that the instrument was designed for; the other stakeholders involved in the development; psychologists; other instruments they were based on; and methods of data collection.

  • The scoring system and scales.

  • The validation process; the training of the raters; methods used in assessing reliability and validity.

3 Results

Tables 1 and 2 show an overview of the 23 included instruments for assessment of teams’ and individuals’ NTS and include 16 and 7 instruments, respectively. A large variety of behavioural rating instruments for different target groups in different specialities was found.

Table 1 Instruments for assessment of teams’ NTS
Table 2 Instruments for assessment of individuals’ NTS

3.1 Purpose of the instruments

The instruments have all been designed for observation of NTS in different situations mainly for real-time, retrospective recalling or for video recordings of operations or simulations. Of all instruments, 11 are designed for assessment in the OR; five of these are for OR teams (Healey et al. 2004; Hull et al. 2011; Mishra et al. 2009; Schraagen et al. 2010; Sevdalis et al. 2008; Undre et al. 2007), three for anaesthesiologists (Crossingham et al. 2012; Fletcher et al. 2003, 2004; Jepsen et al. 2012) and three for surgeons (Parker et al. 2013; Spanager et al. 2012; Yule et al. 2006, 2008). Four instruments are developed for assessment of NTS during resuscitations (Cooper et al. 2010; Plant et al. 2011; Thomas et al. 2004; Walker et al. 2011), two for obstetric teams (Guise et al. 2008; Morgan et al. 2012), two for trauma teams NTS’ (Steinemann et al. 2012; Westli et al. 2010), two for teams in acute settings (Kim et al. 2006; Malec et al. 2007) and two for critical care teams (Lambden et al. 2013; Weller et al. 2011). Six out of seven instruments for assessment of individuals are designed for the OR.

3.2 Involvement of other stakeholders in the development besides target group and psychologists

Most researchers involved in developing the instruments have the same professional background as the target group of healthcare personnel involved, and most often psychologists are part of the team. The development teams usually consist of 3–6 persons. Table 2 shows that the research groups for the individual instruments for surgeons and anaesthesiologist, Non-Technical Skills for Surgeons (NOTSS), NOTSS customised for Danish surgeons (NOTSSdk) and Anaesthesiologists Non-Technical Skills customised for Danish anaesthesiologists (ANTSdk) have included health professionals from the different professions and specialties in the OR team. Two of the team tools in Table 1 state clearly that there have been human factor experts involved in the development process (Mishra et al. 2009; Schraagen et al. 2010).

3.3 Main sources of data which the instruments are based on

At least 19 of the instruments have been inspired by aviation instruments like Non-technical Skills system for assessing pilots’ CRM skills (NOTECHS) (Flin et al. 2005) and Line Operations Safety Audit (LOSA) (Klinect et al. 2003) or have been developed directly or indirectly on the basis of these instruments (Cooper et al. 2010; Fletcher et al. 2004; Guise et al. 2008; Healey et al. 2004; Jepsen et al. 2012; Lambden et al. 2013; Malec et al. 2007; Mishra et al. 2009; Plant et al. 2011; Schraagen et al. 2010; Sevdalis et al. 2008; Spanager et al. 2012; Steinemann et al. 2012; Thomas et al. 2004; Undre et al. 2007; Walker et al. 2011; Weller et al. 2011; Westli et al. 2010; Yule et al. 2006).

In Table 2, it can be seen that the Danish customised instruments for assessment of surgeons’ and anaesthesiologists’ NTS (Jepsen et al. 2012; Spanager et al. 2012) are developed on the basis of interviews with all members of the OR team; this is in contrast to the development of the original UK-developed instruments (Fletcher et al. 2004; Yule et al. 2006) which were developed on the basis of mono-disciplinary interviews. It is also in contrast to the other instruments for assessment of individuals in the OR (Crossingham et al. 2012; Parker et al. 2013) which were developed without involvement from other OR members than the observed physicians.

3.4 Structure

Most instruments consist of two or three levels with four to eight overarching categories/dimensions and underlying examples of skills or behaviours. The skills or behaviours can be rated after observation using a numerical scoring scale or a set of anchors, which are examples of different expressions of the NTS. The overarching categories comprise both cognitive and social skills. Examples of cognitive categories are; ‘situation awareness’, ‘decision making, ‘empathy and sensitivity. Examples of social categories are; ‘communication’, ‘team work’, (shared) ‘leadership’, ‘task management’, ‘organisation’, ‘working under pressure’. The same categories in different instruments can encompass different concepts and also overlap with other categories in other instruments. Four of the instruments also assess technical skills (Healey et al. 2004; Lambden et al. 2013; Schraagen et al. 2010; Undre et al. 2007).

3.5 Validation of instruments

Many of the instruments are not only based on aviation instruments but also process mapping and cognitive task analysis. They have gone through multiple iterations and adaptations to work in healthcare. Most of the instruments are well validated for the setting that they are developed for; inter-rater/observer reliability has been tested for 17 instruments, internal consistency (nine instruments), test–retest reliability and generalisability coefficient (three instruments), construct validity (eight instruments), construct validity (six instruments), face validity (five instruments), concurrent validity (four instruments) and convergent validity (two instruments). Examples of the adaptation of instruments to other settings and cultures are also seen (Jepsen et al. 2012; Lambden et al. 2013; Spanager et al. 2012; Undre et al. 2007).

4 Discussion

In this review, the development of 23 instruments is presented illustrating different instruments used to assess NTS in different healthcare settings. Many of the teams for which these instruments were developed handle emergencies. Some of the research groups have looked beyond their own speciality and included other members in the development process (Jepsen et al. 2012; Mishra et al. 2009; Schraagen et al. 2010; Spanager et al. 2012; Yule et al. 2006). Most instruments are based on experiences learned in aviation, but are now well validated and adapted to different healthcare settings. All the instruments consist of very similar categories of NTS, with few exceptions. Similar categories have been found in many high-risk organisations, including different medical specialities (Flin et al. 2005).

In this discussion, we reflect upon the development of the instruments described and the way they are used.

4.1 The use of behavioural rating instruments

The behavioural rating instruments for individuals are primarily developed for assessment of competence in the clinical setting in order to structure learning and illustrate a development over time. The instruments can be used for assessment several times during specialist training. The ability of raters to rate and provide feedback to the learner is vital. The risk that the assessment of NTS seems deceptively simple has been discussed before (Flin and Patey 2011; Schraagen et al. 2010; Sevdalis et al. 2012), but it is still important to emphasise a focus on the training of raters. Thus, the implementation of the instruments would depend on helping people to acquire the ability to use them. This aspect, however important, was not part of our review.

One possible advantage of the NTS behavioural rating instruments on a higher level could be that they allow for aligning the definition of educational aims and objectives, the selection of contents and methods to deliver them and the analysis of the progress in the learning.

The behavioural rating instruments for teams of health professionals are mainly used to identify weaknesses (learning objectives) and to evaluate the effect of training by comparing pre- and post-values, ideally repeatedly over a longer period of time. This implies that the instruments are mainly used for formative assessment of the team. A summative assessment of teams of health professions does seldom make sense, since the team might seldom work in the same constellation again.

Our basic view is that the behavioural rating instruments described above should be integrated into the actual clinical environment. No matter, whether used in a formative or summative way, on individual basis or on team level, the instruments are used to record observations of ratings by raters. What actually happens with the recorded values and how they are interpreted is important for the usefulness of the instruments in the larger context of healthcare—for patients, their relatives and those involved in their care.

The instruments should be used in a way that is intended by their designers, but they may also be used against this intention or without considering this intention. An instrument that was designed for repeated formative feedback during training might be used in a research setting to evaluate the effect of training. While it might not have the best psychometric qualities, it will still produce results of some value.

All assessment situations should be followed by feedback to the learner. There might be differences in how helpful the feedback conversation is, depending on skills and attitudes of the rater as feedback provider, based on training, personal preferences etc. (Sevdalis et al. 2012). All these surrounding factors of integrating the instruments into context are not inherent in the instruments themselves, but need to be learned by the one using the instruments. The description of the context in which an instrument is to be used, the scope of application and warnings against using them outside of this scope are thus important (Flin and Patey 2011).

In any case, each measurement provides a snapshot only. Combining several measurements over time will increase the validity of the result. Rating performed over time by different raters and learners combining the view of different people about different people will help overcome biases and analyse the “normal variation” amongst those measured and measuring. The results from the different ratings would need to be integrated in performance assessment portfolios to allow collection of data over time.

Considerable skill is required to make observations and ratings and to provide constructive feedback to those being rated. The amount of training provided to the raters varies considerably in the papers included in this review. In addition, the methods used for training vary. In recent years, training includes rating of simulation-based scenarios, but there are no standards for how to do this. It might, however, be necessary to provide a more focused description and to, even more clearly, demand the training of raters for the whole process in which the instruments are used, before they are allowed to perform ratings. A mere training that focuses on understanding the development and the dimensions in the instrument might simply reach too short when it comes to providing useful feedback to learners (Sevdalis et al. 2012). The feedback provided based on the values achieved is what helps learners develop.

Another aspect worth considering is the challenge that there is an evident lack of gold standard for ratings of NTS (Flin and Patey 2011; Graham et al. 2010; Malec et al. 2007; Schraagen et al. 2010), this further challenge rater training and implementation. However, one might speculate if there can ever be a gold standard for NTS in healthcare as the patient safety culture develops continuously. One challenge for progress is the complex connection between NTS and technical skills in healthcare.

In the future, the instruments can be used in the simulated setting and can help in structuring debriefing and learning. The instruments might also be used for summative or “high stakes” assessments. In these situations, the training of the raters is extremely important. It is important to use the instruments in a relevant way, in realistic conditions or scenarios in which the behaviour is observed.

4.2 Criteria and their definition and ratings

The instruments aim to be based on observable behaviours—which form the criteria on which the assessment is based. The numeric values assigned to the criteria reflect how the rater mapped her impression onto the scales provided. Much is written about the mapping process and its biases. Raters who are too strict or lenient, fail to look behind the halo of specific episodes, do not agree on what they saw or how they should evaluate it, or points that are missed completely (Bested et al. 2011). There seems to be less reflection about the definition of the criteria, or in other words, the basis on which they would be called valid or not.

It has been shown that what is seen as an (in-)competent healthcare professional changes over time (Hodges 2006). Requiring such a thing as NTS is rather new, stimulated by the 2000 IOM report and the pioneering work in patient safety and simulation (Cooper et al. 2002; Gaba and DeAnda 1988; Gaba et al. 1987). The underlying skills might have been discussed for much longer, but for example, the instruments discussed in this paper help in creating terms that can be used to describe phenomena in a more consistent way. Certain problems can now be termed ‘a loss of situational awareness’, others ‘breakdown in communication’, where formerly there might have been just ‘problems’. Helping in making NTS a relevant part of the professional roles of current healthcare professionals, stimulating the discussion, and providing terms for it is a key achievement of these and other instruments. The evidence that the NTS do have an influence on the technical performance has been collected (Lingard et al. 2002; Manser 2009).

Yet, it might not be the final version of the criteria—there might not even be a final version in the ever-changing flow of thinking about healthcare professionals. The current versions are one view on various professional roles and their non-technical parts. As the patient safety culture in healthcare organisations develop (improve), the criteria on which the behavioural markers are developed will change. This has been seen in other high-reliability organisations. Therefore, it remains to be seen how much more refinement there will be over time.

Besides changes over time, these criteria are also subject to changes based on the frame of reference from which they are seen. One might stress health models that look beyond the absence of illness and consider a salutogenic model (Antonovsky 1996). For patients, the experience in the healthcare system might still not be “good” despite their outcome being recorded as “good”, e.g. all organs functioning within reasonable limits in a highly depressive patient. The underlying question concerned is who should actually be part of forming the criteria against which both non-technical and technical performance should be evaluated? Who are the stakeholders who should be heard? How many stakeholders can be heard and in what way, so that development processes are not stalled in overwhelming complexity? Neither the instruments reviewed here nor we have an answer to these questions. Yet, one of the reflection points for their use, for new developments and further refinements should be based on this question. Can anaesthesiologists (alone) provide the criteria against which anaesthesiologists’ work is evaluated in a non-technical sense? Is it enough, if psychologists help in refining those? Our group and others have extended the definition of the criteria by asking other healthcare professionals as well. Is this enough? How about the allied health professionals and, last not least, also the patients and their relatives. The current instruments do reflect a starting point and that is laudable. We think, however, that all reports about the instruments should (a) reflect in more detail who was involved in defining the criteria contained and (b) which limitations naturally follow from selecting one frame of reference.

Further, one might need to consider cultural differences (Klampfer et al. 2001), be it national cultures, but also organisational and departmental cultures, maybe even team cultures (Klampfer et al. 2001). In principle, these should play a modifying part when mapping the observations onto the scale values.

The instruments cannot just be transferred from one context to another (Flin and Patey 2011). We should evaluate how well a specific observable behaviour supported the element and category to which it belongs in this specific team and context. In the logic of reliability and validity testing, all these context influences are hoped for to level out as zero, as they are thought to be random errors. For an individual, however, this might mean a lot—possibly high-stake decisions based on observations. In a training context, any feedback provider would need to be trained to consider those issues. In the research context, it might be necessary to enlarge the description of the research context and to describe how it was accounted for. When adapting NOTSS to NOTSS.dk and ANTS to ANTSdk, differences were found not only on the level of the behavioural markers, but also on the level of elements (Jepsen et al. 2012; Spanager et al. 2012). This might reflect a difference in the methods used, but also differences in cultures between Scotland and Denmark. Over time, it might be necessary to adapt the rating instruments for local context—and still balancing the need for criterion-based discussion of healthcare.

4.3 Ratings on scales versus ability of people

One challenge that underlies each assessment instrument is that its values might be taken for the real thing. It is important to note that each of the instruments contains constructs that aim to describe complex socio-technical systems. They do so by reducing complexity, by levelling fine-grained differences and by forcing raters into perceiving and thinking in a standardised way, what could be perceived as the downside of bringing the topics into the discussion by using standardised terms. Bateson described a similar line of thought with his distinction between an actual territory and a map depicting this territory and points out that we should not mistake the map for the territory (Bateson and Bateson 2000). A worst-case scenario could be that the values on the assessment instrument replace the real thing, the professional capabilities of a human being. There are elements that these instruments have difficulty in capturing, such as sense of responsibility, thinking processes—all that is inside the head of the learner.

5 Conclusion

In this review, we discussed the development and the use of 23 behavioural rating instruments to assess NTS at the individual and the team level. There are several caveats to be aware of when using the instruments. There is a need for increased knowledge of how to validate these instruments, how to train the raters and how to continuously refine these instruments in order to help health professions develop their NTS. Overall, we recommend the continuous development and implementation of these instruments in healthcare to increase the awareness of the importance of human factors, to facilitate the training and assessment of NTS and the quality of the feedback provided to the health professions, with the long-term goal to improve patient safety.