Introduction

Central venous catheterization (CVC) is a commonly performed bedside medical procedure. Competency in this procedure is an explicit objective for a number of postgraduate training programs, including emergency medicine, internal medicine, critical care medicine, and general surgery (ACGME 2007; RCPSC 2003, 2005, 2008, 2010; Joint Royal Colleges of Physicians Training Board 2009). The Accreditation Council for Graduate Medical Education (ACGME) recommends the use of simulation and checklists as the “most desirable” evaluation methods for the assessment of competency in procedural skills (ACGME 2000). The use of a global rating scale, on the other hand, is only listed as a “potentially applicable method”. Perhaps related in part to these recommendations, procedural checklists for CVC have become commonly used (Barsuk et al. 2009a; Dong et al. 2010; Evans and Dodge 2010; Velmahos et al. 2004). Indeed, in our systematic review of twenty studies examining the use of simulation-based education for CVC (Ma et al. 2011), only two studies used global rating scales for the evaluation of procedural performances (Lee et al. 2009; Millington et al. 2009).

Despite the frequency with which checklists are used to evaluate CVC skills, the rationale for the recommendation of their use is unclear. A common misconception about checklists is that they are more objective and therefore result in more reliable ratings than global rating scales (Cohen et al. 1996). However, this misconception has been previously challenged (Norman et al. 1991; Van Der Vleuten et al. 1991). The use of checklists has not been shown to consistently result in an improvement in reliability (Cohen et al. 1996; Van Der Vleuten et al. 1991). Furthermore, ratings by experts using global rating scales can outperform checklists in terms of their reliability and validity measures (Hodges and McIlroy 2003; Regehr et al. 1998). Moreover, in objective structured clinical examinations (OSCEs) for clinical skills, unlike global rating scales, checklists have been shown to have low sensitivity to increasing levels of expertise (Hodges et al. 1999; Hodges and McIlroy 2003). It has been postulated that the use of checklists runs the risk of trivializing steps by rewarding thoroughness rather than clinical competence (Cunnington et al. 1996; Norman et al. 1991). Therefore, rather than automatically adopting objectified methods of assessment such as checklists (Van Der Vleuten et al. 1991), the choice of assessment tools should be made based on best available evidence (Norman 2005).

The use of subjective expert judgments in defining competency is not unprecedented. For example, pass/fail scores of the National high stakes OSCE examination for the Licentiate of the Medical Council of Canada (LMCC) are based on experts’ overall judgment on global performance (Dauphinee et al. 1997). Borderline checklist scores are then calculated based on performances of candidates deemed borderline on their overall global performance. Expert judgment using global rating scales has also been previously used in the assessment of surgical skills (Reznick et al. 1997).

To explore the use of a global rating scale in the assessment of bedside CVC skills, this study seeks to compare its use with two checklists, within the context of a formative examination using simulation. To do so, we first explored the dimensions captured by our constructed global rating scale. We then evaluated the correlations of scores obtained among the different tools. Lastly, we evaluated the diagnostic performance of checklist scores in identifying competence, based on expert physician global judgment of candidates’ performances.

Method

Participants

During the academic year of 2008–2009, all first year internal medicine residents at the University of British Columbia, who provided written informed consent, were included in the study. The study was approved by our university ethics review board.

Participants were enrolled in a 2-h simulator training session on CVC. Details of this curriculum involving a different cohort of participants have been previously described (Millington et al. 2009). At the end of the simulator training session, the participants underwent a formative examination using simulators. The examination consisted of a performance of an internal jugular CVC on a simulator (Laerdal IV Torso; Laerdal Medical Corp, Wappingers Falls, New York) using a standard kit provided. Participants were instructed to perform a CVC as they would in real-life, without externally imposed time limits on the procedure. Feedback about their examination performances was given only at the end of the procedure. Participants who failed the formative examination were requested to enroll in additional practice sessions. Each examination performance was video-taped in a blinded fashion, with no personal identifying information recorded.

Evaluation tools

Three tools were used for this study. The global rating scale is an eight-item scale, with an additional ninth summary item on “overall ability to perform procedure” (“Appendix 1”). The eight items were adapted from two validated global rating scales: Direct Observation of Procedural Skills (DOPS; The Foundation Programme 2009) and the Objective Structured Assessment of Technical Skills (OSATS; Reznick et al. 1997). Items on the original scales not applicable to our simulator examination were removed. After piloting our assessment tool, this rating scale was modified and chosen based on group consensus. The summary item was a 6-point Likert scale with descriptive anchors ranging from 1 = “not competent to perform independently” to 6 = “above average competence to perform independently”. We dichotomized the scores for this item such that a score of three or more was considered “competent to perform the procedure”, while a score of two or less was considered “not competent to perform the procedure”.

Two checklists were used for this study. The first checklist (“Appendix 2”) consists of ten items, adapted from a previously published checklist (Velmahos et al. 2004).

The second checklist (“Appendix 3”) consists of 21 items, adapted from a previously validated twenty-seven item checklist (Barsuk et al. 2009b). This checklist was published after the completion of our initial simulation assessments and was included in our study after it was made available. Items on both original checklists deemed not applicable to our simulator examination were removed. The overall checklist score for each checklist was calculated as the number of completed items divided by the total number of items, presented as a percentage.

Rather than assuming validity of our modified assessment tools, content validity of the final global rating scale and checklists was re-addressed through input from an expert panel consisting of one nephrologist, two internists, one intensivist, and one general surgeon. Consensus was achieved with the final items.

Video performance evaluation

All video-recorded performances were evaluated by two independent trained evaluators who are faculty members with experience in simulator teaching. Evaluators were trained for 3 h on the use of each assessment tool, by review of four videos recorded specifically for the purposes of training. After training, the intraclass correlation coefficients of the evaluators were >0.80. For the evaluation of video-recorded performances, for each evaluator, 50% of the videos were rated first using the ten-item checklist while the remaining 50% of the videos were rated first using the global rating scale.

To assess the extent to which one tool may have systematically influenced the rating of a subsequent tool, the same raters re-analyzed each video approximately 2 years after the completion of the study. During the re-analysis, all assessment tools that were initially rated after completion of another tool were re-analyzed independently. This re-analysis allows for the assessment of reliability of ratings between tools completed with and those completed without the influence of another tool. Average intraclass correlation coefficients for the global rating between ratings with and without the influence of another tool was 0.92. Overall Kappa score for the ten-item checklist was 0.92, with a summary checklist score reliability of 0.97. Evaluation using the 21-item checklist was done independently by two raters 1 year after the initial evaluation and therefore was not subject to the influence of rating by another tool.

Statistical analysis

To explore the dimensions assessed by the global rating scale, the following analyses were performed: after confirming the appropriateness of performing factor analysis using Bartlett’s test of sphericity (Chi-square = 114.7, p < 0.001) and the Kaiser–Meyer–Olkin measure of sampling adequacy (0.60), principal components analysis was performed on the eight items in the global rating scale, using a VARIMAX rotation. A Scree plot was inspected (Cattell 1966). Factors with eigenvalues greater than 1 were retained. Item loadings ≥0.40 are reported. Inter-rater reliability was evaluated using intraclass correlation coefficient (ICC), Pearson’s correlation coefficient, and Cohen’s Kappa where appropriate. Both correlation coefficients and disattenuated coefficients are reported (Spearman 1904). Disattenuated coefficients represent the hypothetical correlation between two measures assuming the two measures are perfectly reliable.

The sensitivity and specificity of the overall scores on the checklists against the dichotomous measure of competence on the global rating scale (a score of three or more) were evaluated at various checklist cutpoints with a Receiver Operating Characteristic (ROC) analysis. The area under the curve (AUC) was then estimated and used as an index of diagnostic accuracy.

Comparisons between groups were made using Student’s t tests, Chi-square, and Wilcoxon rank-sum tests where appropriate. All analyses were performed using PASW Statistics software, version 18.0 for Windows (PASW, IBM Corporation, Somers, NY) and Stata 11.0 (StataCorp LP, College Station, TX).

Results

Thirty-five participants were invited and 34 (97%) consented to and completed the study protocol (Table 1).

Table 1 Participants’ demographic characteristics (total N = 34)

Dimensions assessed by the global rating scale

We identified two factors with eigenvalues greater than 1, accounting for 84.1% of the overall variance. Post-rotation, five global rating scale items loaded on the first factor, while two factors loaded on the second (Table 2). The first factor consisted primarily of behaviors that can be characterized as relating to “technical ability” (α = 0.78). The second factor consisted of behaviors that relate to procedural “safety” (α = 0.76). The item “appropriate preparation of instruments pre-procedure” did not load on any factor.

Table 2 Rotated factor loadings for scale items

Correlation of the checklist scores with factor scores on the global rating scale

Inter-rater reliability of individual items and overall rating of the global rating scale and the two checklists is shown in Table 3.

Table 3 Inter-rater Reliability (intraclass correlation coefficients or Kappa statistics) for items in the global rating scale and the checklist

The correlation between the overall ten-item checklist score and the weighted factor score on technical ability was positive (0.49; 95% confidence interval 0.17–0.71), while the correlation between the ten-item checklist score and the weighted factor score on procedural safety was negative (−0.17; 95% confidence interval −0.48–0.18).

The correlation between the overall 21-item checklist score and the weighted factor score on technical ability was also positive (0.43; 95% confidence interval 0.10–0.67). The correlation between the 21-item checklist score and the weighted factor score on procedural safety was −0.13 (95% confidence interval −0.45–0.22).

Diagnostic performance of checklist scores in identifying competence based on expert global judgment

Based on expert global judgment of competence, twenty-one participants (62%) were rated overall as being competent, while 13 (38%) were rated as not. There were no significant baseline differences between the two groups (Table 1).

The mean overall score on the ten-item checklist for those deemed competent was 95.2 ± 8.1%, which is significantly higher compared to those who were deemed not competent (81.0 ± 18.6%, p = 0.002). Correlation between the overall ten-item checklist score with the summary measure on the global rating scale was high (r = 0.58, p = 0.0003). Corrected for attenuation, the correlation was 0.80.

Using the 21-item checklist, the mean overall score for those deemed competent was not significantly different from those deemed not competent (92.0 ± 5.2% vs. 84.1 ± 14.7% respectively, p = 0.08). The correlation between the overall 21-item checklist score with the summary measure on the global rating scale was high (r = 0.60, p = 0.0002). Corrected for attenuation, the correlation was 0.79.

On ROC analyses, the overall ten-item checklist score had an acceptable discrimination (AUC = 0.79, standard error = 0.078, 95% confidence interval [0.64, 0.94]) (Hosmer and Lemeshow 2000), while the 21-item checklist’s AUC was 0.68, standard error = 0.098, 95% confidence interval [0.48, 0.87]). Table 4 shows the sensitivity and specificity for different cut-off points for the checklist score. For maximum sensitivity (100%), a cut-off point of 80% was chosen as the optimal cut point for both checklists. At this threshold, a sensitivity of 100% allows us to reliably “rule out” competence for individuals with a checklist score of <80%. However, the poor specificity for both checklists did not allow us to “rule in” competence despite high checklist scores.

Table 4 Sensitivity and specificity of various cutpoints on the overall checklist scores in determining competence

Out of the 13 participants who were deemed incompetent on the global rating scale, 11 individuals scored ≥ 80% on the checklists. Reasons for incompetence in these 11 participants included significant breaches in sterility (n = 5), loss of wire control (n = 4; median duration of time without wire control 35 s, range 17–38 s), multiple attempts (n = 2; both cases >6 attempts were made).

Discussion

For the assessment of competence of CVC in a formative examination using simulators, this study evaluated the use of three assessment tools: a global rating scale and two checklists. Our results suggest that for the assessment of competency in CVC skills, the use of checklists is not always the “most desirable” evaluation method. First, with respect to content validity, our results indicate that while dimensions captured by the global rating scale were technical ability and safety, neither checklist adequately captured errors relating to safety issues. This finding is consistent with the literature on procedural checklists in general. A systematic review on procedural checklists found that 30–50% of checklists did not assess for competencies in the area of ‘infection control’ and ‘safety’ (McKinley et al. 2008). Errors identified by our study were considered serious in nature. In particular, breaches in sterility, loss of wire control, and an unsafe number of attempts at venous access are all errors associated with patient safety implications. Infectious complications relating to CVC can be as high as 26% (McGee and Gould 2003). Meticulous attention to sterility is an important aspect of the procedure. In a study evaluating malpractice claims for CVC, the most common complication was wire/catheter embolism (Domino et al. 2004). Wire control, therefore, has important safety implications. Lastly, the incidence of mechanical complications increases significantly with three or more attempts at insertion (Mansfield et al. 1994; McGee and Gould 2003). Therefore, multiple attempts by a trainee should be flagged as problematic.

While the commission of the aforementioned serious errors appeared to have resulted in an overall global impression of incompetence, commission of these same errors resulted in only a minimal reduction in the checklist scores. Our study identified a positive correlation of both checklist scores with the technical ability factor score on the global rating scale. However, we found a negative correlation of both checklist scores with the safety factor score. Although differences in the two correlations were not statistically significant, the two differed in direction, thereby lending support to the impression that these checklists perhaps capture items relating to technical ability more than they do on safety parameters.

Lastly, our results indicate that the use of checklist scores in predicting competence was associated with a higher sensitivity than specificity. A low checklist score (<80%) was uniformly associated with procedural incompetence, while a number of individuals with high checklist scores (≥80%) were still deemed incompetent. All of these candidates committed errors that were considered serious in nature.

What are the implications of these results? Consistent with the literature on evaluations for OSCE, our results suggest that checklists should not automatically be assumed to be the preferred method of assessment. While we do not advocate that checklists be abandoned altogether for the assessment of procedural skills, we do however recommend that their use be evaluated prior to their adoption.

Both of the checklists evaluated in our study were constructed carefully; one used a cognitive task analysis approach (Velmahos et al. 2004), while the other used a rigorous checklist development procedure (Barsuk et al. 2009b). Nonetheless, despite careful construction, improvement in content validation may be made to these tools either by including additional items of safety parameters or differentially weighting critical items.

Limitations

Our study has several important limitations, including the fact that the study was performed in a single centre with a relatively small sample size. Secondly, in the absence of an independent gold standard measure, it may be problematic to use physicians’ subjective judgment on the global rating scale as the standard against which checklist scores were compared. One can easily argue for the use of checklist scores as the standard instead.

Our study chose the use of a global rating scale as the standard to maximize the number of trainees identified as potentially benefiting from additional practice. For a formative examination, we were willing to accept some degree of false positives in the identification of incompetence. However, we were less willing to accept false negatives (i.e., missing individuals who may need additional instruction or practice). Indeed, all candidates deemed incompetent based on a low checklist score were also deemed incompetent by the global rating scale, while the use of a global rating scale identified additional incompetent performances that were rated highly by both checklist scores. Furthermore, these additional individuals identified in our study sample did not appear to be a result of a false positive identification in that all these individuals committed serious errors that were considered to pose significant harm to patients.

Thirdly, results of our study conclusions regarding the use of assessment tools are context-specific. For example, our conclusions relate to the two checklists and the global rating scale used in our study, in a formative simulation examination on CVC performance by first year medical residents, using landmark technique, evaluated by expert trained raters. Checklists constructed in a different manner may outperform our global rating scale. Likewise, the reliability of scores from these assessment tools is unknown in the hands of untrained raters or on CVC performances on patients using ultrasound technique (NHS 2002). Context, therefore, needs to be taken into account in the interpretation of our results.

Fourthly, although we attempted to estimate the degree to which the use of one tool influenced the next, our raters were trained on the use of both checklists and global rating scale. Therefore we cannot exclude the possibility that intimate knowledge of items on the checklists may have influenced assessments using the global rating scale, and vice versa, even when the tools were not filled out at the same time.

Fifthly, we did not explore the effects of modifying checklists, such as including additional items on safety parameters, differentially weighting critical items, or the use a three-point scale for each item on the checklists rather than their current binary form.

Finally, the validity of score from our constructed global rating scale cannot be assumed, despite the fact that it was constructed based on two previously validated tools (The Foundation Programme 2009; Reznick et al. 1997). The compilation of two tools into one resulted in the inclusion of behaviorally anchored scales for some items but not for others. The uneven distribution of anchors may have resulted in the variable inter-rater reliability observed amongst items on the global rating scale as well as the differential weighting on the factor scales. Furthermore, the cut-point for competence was chosen arbitrarily, albeit by consensus, based on concerns that evaluations may be positively skewed (Streiner and Normal 2003). As a result, three categories were available to assist examiners in differentiating among above average performances. The trained evaluators in our study, however, ultimately deemed 38% of the candidates as incompetent. Therefore, in future studies, consideration should be made for the inclusion of additional categories to assist examiners in differentiating among below average performances rather than above average performances.

Conclusion

Despite these limitations, results from our study raise an important question regarding the practice of automatically adopting checklists as the preferred method of assessment of procedural skills. Our study provides an example whereby the use of a global rating scale may in fact be preferred over the use of two currently available checklists. Future study should focus on further optimizing the construction of assessment tools and correlating assessment results with clinical outcomes.