Introduction

With advances in medical technology, teaching in medicine is often shifting from the traditional apprentice-based model to more active engagement through proficiency-based training [1]. Often, this training includes simulated situations of short duration that allow for immediate feedback, reflection, and corrective actions. The evaluation of performance plays a key role in adult learning, facilitating reflection and refinement of behavior [24].

Levine et al. suggest that simulation is important for real-time assessment of physician competency in the modern age of medicine [5]. Indeed, over the past two decades, the role of simulation has grown in certification and credentialing [6]. Evidence suggests that simulation scenarios perform as well as OSCE (Objective Structured Clinical Evaluation) or observation of actual patient encounters [7, 8].

Unfortunately, it is all too common in medicine for evaluation tools to be used without first proving their reliability and validity. In “The Metric of Medical Education,”, Downing states that, like all scientific data, assessment data must be reliable to be meaningfully interpreted [9]. In addition, he describes validity as the “sine qua non of assessment,” and cites that educational assessment tools have little intrinsic value without evidence of validity [10]. Thus, the burden of proof of reliability and validity of any evaluation instrument clearly rests on the examiner. The adoption of objective and validated assessment tools is paramount to ensure the correct measurement of achieved competencies.

The authors were unable to find any published Italian language simulation evaluation tools that have been documented to be reliable and valid. The English language Ottawa Crisis Resource Management Global Rating Scale (Ottawa GRS) has been thoroughly assessed for validity and reliability, and appears to contain many features that suggest an ideal evaluation tool. The overall performance field of the Ottawa GRS is rated on a 1–7 semi-anchored scale, where 1 is “Novice” and 7 is “Clearly Superior.”. The scale includes anchored text at the 1, 3, 5, and 7 values that describe in more detail the requirements needed to reach each score. In addition, the Ottawa GRS includes five specific domains: leadership skills, problem-solving skills, situational awareness skills, resource utilization skills, and communication skills. The five domains use the same 1–7 scale, but with anchors at the 1, 3, 5, and 7 positions clearly describing the details in each domain to reach each score. The authors have investigated the inter-rater reliability and construct validity in a very detailed study. In this study, intraclass reliability was found to be 0.590–0.613 for the overall score [11]. In addition, the inter-rater reliability of the problem-solving, leadership, and situational awareness fields were 0.474–0.626 [11]. The fields of resource utilization and communication showed lower intraclass coefficients ranging from 0.236 to 0.384 [11].

Translation of the Ottawa GRS into Italian may offer a solution. However, translation of an assessment instrument can be problematic. Translation of a reliable and valid English language instrument into Italian—no matter how carefully the translation is performed—is no guarantee of producing a reliable and valid Italian language tool. The subtleties of language can be extremely important, particularly when evaluation involves the very nebulous principle of assigning an objective measurement to an observed human performance. Unfortunately, direct assessment of reliability and validity of a new grading instrument can be very difficult and time consuming. Fortunately, however, comparison of an Italian translation to a proven English language tool offers a much simpler method: if the Italian language tool is reliable compared to the English language tool, then the Italian language tool can be expected to have a reliability and validity comparable to the English language tool.

The primary objective of the current study was to assess the reliability of an Italian language Ottawa GRS translation by comparing it to the English language GRS. The null hypothesis of no correlation between the Italian language GRS and English language GRS scores was tested against the two-sided alternative hypothesis of significant correlation between the scores in different languages.

A secondary objective of the study was to assess the validity of the Italian language GRS by comparing it to a skills score. The null hypothesis of no correlation between the Italian language GRS and the skills score was tested against the two-sided alternative hypothesis of significant correlation between the two scores.

Methods

The overall study design was a measure of the strength of association between the Italian language GRS compared to both the English GRS and the skills score. This association was measured by applying each of three scoring tools concurrently by separate examiners to simulation scenarios.

As all data were identified and reported in aggregate, the study was deemed exempt from formal institutional review by the Ethics Committee of the Universita’ del Piemonte Orientale where all simulations took place.

To ensure face validity of the Italian language tool, the translation of the original GRS from English to Italian was performed through an iterative group process involving all authors. All authors are fluent in both English and Italian: three are native first language Italian (MV, LC, PLI) and one is native first language English (JMF). Appendix 1 shows the final tool produced by the iterative process.

The English language GRS tool was used exactly as originally published with no modifications.

A technical skills scoring sheet was also developed for each simulation scenario. These scoring sheets were simple checklists scored as 0 (no), 2 (yes), or 1 (yes, but incomplete). A sample technical skills scoring sheet is shown in Appendix 2.

The overall structure of the simulations was a convenience sample of a previously scheduled simulation session in the SimWar format as detailed by Okuda et al., where simulation participants work as a team, and teams advance to the next round of competition based on judgment using pre-determined scoring tools [12]. Study participants were recruited through mailing lists to reach residents and program directors throughout Italy. Participants registered on a voluntary basis in teams of four. Each team rotated through seven initial scenarios. Five of these scenarios involved simultaneous measurement of English and Italian GRS and were included in the study. This included three high-fidelity scenarios: neonatal (birth asphyxia), adult (acute pulmonary edema), and obstetric (eclampsia). In addition, there was a single medium fidelity station of adult cardiac life support (cardiac arrest due to hyperkalemia). A virtual reality station scenario of triage of ten casualties following a motor vehicle collision was also included using the XVR simulator (E-semble, Delft, The Netherlands). The four highest ranked teams following the initial round of scenarios advanced to the semi-final high-fidelity simulation of traumatic brain injury management. The two highest ranking teams from the semi-final proceeded to the final scenario of cardiac arrest with peri-mortem c-section and neonatal resuscitation. Simulations were conducted in Italian in all cases.

During the simulation scenarios, teams were independently evaluated by three examiners using three separate tools: (1) Italian GRS, (2) original (English) GRS, and (3) skills score. All raters were faculty at the Universita’ del Piemonte Orientale, and none were familiar with the GRS tool. Each rater was given a single tool for the entire study. Raters of the GRS in each language were not given access to the GRS of the other language. Raters using the Italian GRS were all first language Italian speakers. Raters using the English language GRS tool were all fluent in both Italian and English. Although for each scenario the three raters were in the same room, they were asked to complete their evaluations independently without consulting one another or discussing the content of their tool. The simulation raters did not change as the teams rotated through each scenario, so that in the end for each scenario all teams were rated by the same raters. Raters completed their ratings on paper that was then collected after each scenario.

Data were entered into a MySQL (Oracle, Redwood, California, USA) database by one of the study authors (MV). Statistical analysis was performed using “R: A language and environment for statistical computing.” (R Development Core Team, Vienna, Austria).

To assess the reliability of the Italian GRS, the results were compared to the English GRS using the Pearson–product correlation [13]. To assess the validity of the Italian GRS, the results were compared to the skill score, again using the Pearson–product correlation.

As suggested by Litwin, in ‘How to Measure Survey Reliability and Validity’, correlation coefficients greater than 0.7 were considered adequate [14]. Two-sided alternative hypotheses were used in all cases.

In total, seven statistical tests were performed. To account for multiple testing, p values were adjusted using the Holm method, and confidence intervals were adjusted using the Bonferroni method to ensure that a family-wise error rate of 0.05 was maintained [15]. The statistical hypotheses and methodology were completely specified prior to any data collection, and no modification was permitted after data collection. Post hoc exploratory data analysis was performed for hypothesis generation only and clearly identified as such. Thus, no confidence intervals or p values were obtained during the post hoc analysis.

To hold to the highest standards for reproducibility as described by Peng and to encourage further research in this area, both the raw data and the R code for the statistical analysis are available for download [16, 17].

Results

Twenty-eight residents took part in the study: seven teams of four residents each. Residents came from eight different Italian universities (Bari, Chieti, Genova, L’Aquila, Messina, Novara, Padova, and Sassari). Median resident age was 30 years and median postgraduate training years was three. Residents came from a variety of training programs including anesthesia (16 residents), emergency medicine (7), cardiology (2), internal medicine (1), pulmonary (1), and geriatrics (1).

In total, 41 simulations were evaluated: seven teams completed each of the simulations in the initial round, four teams participated in the semi-final, and two teams participated in the final. There were 123 completed scoring observations: three for each simulation. In total, 19 raters participated: 12 raters completed GRS scores (6 in Italian and 6 in English), while 7 raters performed skill scores. As not all teams completed the same number of simulations, and not all raters performed the same number of ratings, the final study design was not strictly orthogonal [18].

The correlation coefficient between the Italian language overall GRS and the English language overall GRS was 0.82 (adjusted 95 % confidence interval 0.62–0.92) as shown in Fig. 1. The adjusted p value for significant correlation was less than 0.000001. The figure also shows that the range of assigned values was distributed well across all values from 1 to 6, and that correlation appears to hold for values throughout the range. Correlations between the other five domains of the Italian GRS and the English GRS are seen in Table 1.

Fig. 1
figure 1

Correlation between Italian and English overall global rating scale (GRS)

Table 1 Correlation between Italian and English global rating scale (GRS)

The correlation coefficient between the Italian language overall GRS and the skill score was 0.85 (adjusted 95 % confidence interval 0.68–0.94) as shown in Fig. 2. The adjusted p value for significant correlation was <0.000001. As the total number of questions on each technical skills sheet was not the same, the sheets were scored as a percentage of the maximum total marks. The figure also shows that values on the skill score were also distributed quite evenly across the range from low to high and that correlation tends to hold for values throughout the range.

Fig. 2
figure 2

Correlations between Italian overall global rating scale (GRS) and skill score

Chronbach’s alpha for the Italian language GRS was 0.97, although this calculation was not part of the originally planned study analysis.

Discussion

For the primary study objective of assessing the reliability of the Italian language GRS in comparison to the English language GRS, the null hypothesis of no correlation was rejected. Thus, the study demonstrates that the Italian version of the Ottawa GRS is sufficiently reliable in comparison to the English language tool. Given the lack of current Italian language gold standard, this represents an important achievement.

Interpretation of the value of the correlation coefficient is somewhat problematic. As there was only one rater in each language, it is impossible to determine if differences between the scoring of the two raters is due to differences in the tool (different language), or due to normal inter-rater differences. While the estimates of correlation between the Italian language and English language varied in this study from 0.71 to 0.82, the English language GRS alone was found to have intraclass correlation coefficients between 0.24 and 0.62 [11]. Notably, although the authors stated that the Ottawa GRS had “acceptable inter-rater reliability,” in none of the evaluation fields was the intraclass correlation greater than 0.7 [11]. Thus, this study suggests that correlation between the Italian language and English language GRS is similar or better than the correlation between two English language observers, suggesting that the Italian language GRS is at least as reliable as the English language tool. The adoption of objective and validated assessment tools is paramount for correct measurement of performance in any medical field. Downing (2004) suggests that evaluation instruments should have a reliability of >0.9 for “high-stake” evaluations such as licensing, 0.8–0.89 for summative examinations such as end of year evaluations, and 0.7–0.79 for informal formative classroom-type examinations. Overall, it appears that the Italian language GRS has acceptable reliability which is equal to the English language Ottawa GRS [9].

For the secondary study objective of assessing validity of the Italian language GRS, the null hypothesis of no correlation between the Italian language GRS and the skill scores was rejected. Thus, it appears that there is significant correlation between the two. The point estimate of correlation (0.85) suggests good correlation, although the lower limit of the confidence interval (0.68) is slightly below the desired lower limit of 0.7. Behavioral performance and adequacy of actions taken from a technical perspective have been studied in detail in the past decade [1921]. The findings of this study are consistent with those of Riem et al., who showed that residents who demonstrated good performance in technical skills also showed good performance when exposed to an inter-operative crisis scenario [22]. Brunckhorst et al. demonstrated a similar correlation between technical and non-technical skills [23]. Overall, the study suggests that the Italian language GRS tool has adequate reliability for assessing simulation performance.

Post hoc analysis of Chronbach’s alpha, which should be used for hypothesis generation only as it was not initially part of the planned study analysis, was high (0.97). This suggests that the Italian language GRS may meet the criteria for excellent reliability in the form of internal consistency, adequate for even high-stake examinations [9]. These post hoc analyses should be of course confirmed in dedicated studies.

Unfortunately, there are several limits to the present study. Firstly, although the null hypotheses were rejected, and the point estimates for all correlation coefficients were greater than 0.7, the confidence intervals were very wide, and in all cases the lower limit of the estimated correlation was below 0.7. As the p values were small in all cases, the study suggests that although significant correlation exists, the present study may have been too small to be entirely confident that the correlation coefficient was above 0.7. Clearly, further research would be needed. In addition, the study was performed in the context of team simulations, where the raters graded the simulation team as a whole. It is possible that there is additional variability introduced into the study, as different team members likely had different levels of performance. Repeating the study with evaluation of individuals rather than teams may be insightful to further categorize this variability. Finally, although the study shows that the Italian GRS tool has good reliability in comparison to the English language tool—and should share similar validity and reliability—further studies to measure directly the reliability and validity of the Italian tool are clearly required before use in high-stake situations.

Conclusion

The present study suggests that the Italian language GRS translation has reasonable reliability when compared with the English language GRS, and reasonable validity when compared with the assessment of the skills scores. Data suggest that the instrument is adequately reliable for informal and formative type of examinations such as classroom assessment, and perhaps end of year examinations, but may require further confirmation before used for high-stakes examinations such as licensing.