Introduction

Knee arthrocentesis is a procedure performed to aspirate fluids or inject medications into the knee joint cavity. It can be used as a diagnostic or therapeutic procedure [1, 2]. Diagnostically, knee arthrocentesis is a cornerstone in the differential diagnosis of inflammatory knee effusions [3]. As a treatment, joint injection of steroids or other medications plays a key role in the symptomatic relief of knee osteoarthritis (OA).

While the American Board of Orthopedic Surgery (ABOS) included arthrocentesis as a core competency for orthopedic residents [4], other guidelines advise general physicians to learn the competency, regarding it as a significant milestone to obtain during undergraduate medical degree formation [5]. In our postgraduate orthopedic program, only 2 of the last 16 accepted residents (13%) had performed a knee arthrocentesis at the undergraduate level. Most last-year medical students feel ill-prepared to execute the procedure [5]. Furthermore, previous studies have shown that the success rate of palpation-guided therapeutic knee arthrocentesis among orthopedic surgery residents can be as low as 55% [6, 7].

Several medical education methodologies have been described to train knee arthrocentesis. These models have focused on the procedure’s technical aspects, leaving nontechnical dimensions out of the training scenario [8, 9]. This leads to an incomplete and underperforming educational model [10, 11]. The inclusion of nontechnical skills promotes meaningful experiences that have shown to improve learning [12]. Moreover, exclusion can induce the belief that nontechnical competencies are less relevant when approaching a patient for a procedure [12]. In addition, previously published teaching models have lacked explicit instruments to assess trainees, limiting their reproducibility and external validity [4, 9].

The aims of this study were to (1) design and implement a high-fidelity hybrid simulation scenario of knee arthrocentesis, (2) compare last-year medical students and general physicians through a four-session workshop with experts in the knee arthrocentesis procedure in a simulated environment, and (3) adapt a DOPS scale to assess technical and nontechnical skills related to knee arthrocentesis.

Materials and methods

Institutional board review was obtained (no. 190107005). Last-year medical students and general physicians were recruited to complete four nonconsecutive training sessions in a 3-month period. We recruited six orthopedic surgeons with experience in knee arthrocentesis as experts against whom we could compare trainee improvement. Five days before encountering their initial training session, trainees received instructional documentation including written directions and rationale for performing a diagnostic knee arthrocentesis, a description of the training scenario including the assessment tool used to evaluate performance in technical and nontechnical competencies, and a video describing the procedure step by step.

Training scenario

A high-fidelity hybrid simulation scenario was created. A patient-actor was trained with a script consisting of a 30-year-old patient arriving in the emergency department (ED) with two-day left knee pain associated with fever and joint inflammation. The patient-actor was stationed on a gurney; upon uncovering her left knee, trainees encountered a simulated knee (Sawbones©, Pacific Research Laboratories; Vashon, WA, USA). The model joint was a non-articulated knee with a partially mobile patella. On a side table, trainees had to select the required materials to perform the procedure, including hospital paperwork and informed consent. A health care assistant was posted to assist the trainee upon request but limited his participation to orders given by the trainee. During each session, trainees were required to take an abbreviated history and perform a physical examination of the patient. They had to explain the procedure and obtain written informed consent. After preparing the required instruments, they had to execute the procedure and load the laboratory test tubes. Finally, they completed hospital exam forms and gave the patients postprocedure recommendations (Fig. 1). A single orthopedic surgeon evaluated all trainees, and sessions were recorded for secondary evaluation to determine the inter-rater reliability of the evaluation tool. We used a specific direct observation of procedural skills (DOPS) scale designed for the scenario (supplemental material 1 and 2). After the procedure, all trainees were immediately conducted to a debriefing room to receive feedback from the surgeon who evaluated their performance. The surgeon had been previously trained to give effective feedback using the Pendleton model [13, 14]. Feedback was also registered. Each trainee’s DOPS result constituted a point in their individual learning curve. Trainee learning curves were compared with expert performance to measure student proficiency in the training scenario. Proficiency was defined as the trainee’s ability to safely conduct the procedure with careful consideration for the patient and following the best practices outlined in the educational material they received [15] measured through the de novo DOPS scale.

Fig. 1
figure 1

Key steps of the training scenario following the medical history and informed consent: a selecting required materials; b sterile environment preparation; c patient and skin preparation; d puncture site selection; e puncture and fluid extraction; and f tube filling and paperwork

After feedback, each trainee completed a validated satisfaction scale [16]. This tool measured the trainees’ perceptions regarding scenario realism, quality of the instructional material sent beforehand, feedback received, and perceived utility of the training session. One year after training, all participants were contacted to determine if they had performed any knee arthrocentesis. Those who had performed the procedure were given a questionnaire to measure how confident and prepared they felt to undertake the real-life procedure. Specifically, we asked if training had allowed them to perform patient consent and education, perform a safe knee arthrocentesis, fill laboratory tubes and paperwork, and explain postprocedure care to the patient. In addition, we asked the trainees to assess the perceived utility of participating in the training sessions.

DOPS adaptation and validation

We adapted the new score from a DOPS previously validated in the same cultural setting [17]. The adaptation maintained the 11 items included in the original DOPS but adjusted their descriptors to assess knee arthrocentesis. We determined the content validity of the de novo DOPS by conducting a Delphi panel composed by experts in Orthopedic Surgery, Rheumatology and Emergency Medicine. Ratings and commentaries for each item were registered, and modifications were made for repeat expert assessment. We repeated expert consultation through the Delphi panel until we obtained at least 80% agreement on all items.

With the second evaluation performed by another orthopedic surgeon, inter-rater reliability was assessed [18]. Validity analysis was carried out with DOPS scale applications in consecutive sessions for each trainee. The construct validity was determined through an exploratory and confirmatory factor analysis. The exploratory factor analysis detected latent variables or constructs underlying the base of the observed variables [19, 20]. A confirmatory factor analysis was further performed to validate the factor structure identified in the prior exploratory analysis [20]. The final scale graded each of the eleven items from 1 to 7 (with grades 1–3 as insufficient; 4 as standard setting; 5–6 partially accomplished; and 7 being completely accomplished).

Statistical analysis

Adaptation and validation of the DOPS scale

Interrater reliability was measured with the weighted Kappa (wK) coefficient. Levels of agreement for wK were determined as proposed by Landis et al., considering wK values 0.00–0.20 as slight agreement; 0.21–0.40, fair agreement; 0.41–0.60, moderate agreement; 0.61–0.80, substantial agreement; and 0.81–1.00, almost perfect agreement [18].

In the exploratory factor analysis, the number of factors (or dimensions) was selected considering the Kaiser–Guttman [21] and Cattell [22] criteria. Thus, the factors with Eigenvalue above one, and those above the inflection point in the scree plot were retained. The determination coefficient (R2) was estimated to quantify the percentage of the scale’s items’ variance explained by the two factors identified in the exploratory analysis. Internal consistency for each dimension detected in factor analysis was performed using Cronbach’s Alpha.

Learning curve analysis

A mixed-effects/multilevel model with a random intercept was constructed to study differences in consecutive DOPS results of each trainee. The use of multilevel models was based on the fact that each trainee’s performance was assessed in repeated training sessions. Thus, as DOPS scores in consecutive sessions for the same subject are compared, a correlation among them is expected, producing biased estimates of the standard errors and confidence intervals. Mixed-effects/multilevel models can be used to obtain standard errors that take the clustering within subjects into account. Multilevel statistical modeling enables quantitative analysis of learning curves and has been proven to have higher statistical power than conventional repeated-measures analysis of variance (ANOVA) [19]. This statistical method has also been used in previous research to analyze how trainees acquire skills [28,29,30].

Given that residuals of the mixed-effects model did not have a normal distribution, the standard error was estimated using bootstrapping (10,000 replications). Thus, a 95% confidence interval (95% CI) was obtained using the bias-corrected and accelerated method. Mean scores and 95% CI were expressed for each training session.

All analyses were conducted on Stata version 16 (StataCorp. 2019. Stata Statistical Software: Release 16. College Station, TX: StataCorp LLC).

Results

Adaptation and validation of the DOPS scale

A Delphi panel composed of 17 orthopedic surgeons, 4 rheumatologists and 2 emergency department physicians. They received the 11-item DOPS with knee arthrocentesis descriptors. Two rounds were necessary to obtain over 80% agreement on every item.

In the exploratory factor analysis, two factors (or dimensions) with an Eigenvalue greater than one were identified (bidimensional). These factors explained 91.7% of the variance observed in the de novo DOPS, and they could be classified into a technical domain (two items) and a nontechnical domain (nine items). Regarding the confirmatory factor analysis, all standardized factor loadings were above 0.3 (cutoff point) and statistically significant (Table 1). The R2 fluctuated between 0.28 (item one) and 0.93 (item ten). It means that at least 93% of the variability of item ten’ scores was explained by the factor identified. Internal consistency for each dimension was α = 0.86 in the technical domain and α = 0.70 in the nontechnical domain. Interrater reliability was almost perfect (wK 0.87).

Learning curve analysis

Twenty-eight trainees were recruited (10 last-year medical students and 18 general physicians). Three had clinical experience as general physicians, with 4 years of experience each, resulting in a mean experience of 0.43 years among the 28 trainees. Only two of them (7%; both general physicians) had performed a single knee arthrocentesis prior to the training sessions. We also recruited six orthopedic surgeons to serve as experts.

Performance significantly improved between the first session (mean score 5.89 [95% CI 5.70–6.07]) and second session (mean score 6.51 [95% CI 6.30–6.72]; p < 0.01) and between the second and third session (mean score 6.80 [95% CI 6.59–7.00]; p < 0.05). We found no difference between the third session and the fourth (mean score 6.78 [95% CI 6.48–7.00]; p = 0.94) (Fig. 2).

Fig. 2
figure 2

Mean performance and standard error (vertical axis) for each trainee session (one through four of the horizontal axis) and the experts’ performance (“Experts” on the horizontal axis)

The second mixed-effects model was designed to compare each trainee session with the performance of the six expert orthopedic surgeons. We found that surgeons had a significantly higher mean score in their session (mean score 6.94 [95% CI 6.88–7.00]) than trainees did in their first (p < 0.01), and second session (p < 0.01). The third and fourth trainee sessions did not differ significantly with the experts’ performance (p = 0.21 and 0.31, respectively).

Training satisfaction survey and follow-up

After debriefing, 24 of the 28 trainees (85%) answered a questionnaire to evaluate the training experience. They all agreed that (1) the instructional material they received prior to the training sessions was useful, (2) the training scenario allowed them adequate training, (3) they perceived the feedback received as useful, (4) the assessment tool allowed them to focus on improving specific tasks, (5) the addition of a clinical case at the beginning improved scenario fidelity, and (6) the use of a trained patient-actor allowed improvement of nontechnical (communication) skills. The only question that received less than 100% of agreement was the perception that the knee model used was realistic (only 19 of 24 agreed; 80%).

Seven of the 28 trainees (25%) had performed a real-patient knee arthrocentesis by the 1-year follow-up. All the students thought that the training was useful and allowed them to perform the procedure safely. They felt confident regarding patient consent and education, performing the procedure and explaining postprocedure cares to the patient. The only item where two trainees (29%) felt insecure was selecting the correct laboratory tubes and completing paperwork.

Discussion

Our high-fidelity simulation scenario allowed student trainees to improve the technical and nontechnical skills required to perform a safe knee arthrocentesis.

A key to training and learning is repetition. Previous studies have focused on proving differences between students prior to training and after training [4, 9] or on measuring satisfaction after a single use [20]. Repeated training and evaluation increase student performance and avoid jumping to conclusions that could be due to chance [21, 22]. This is the first study to determine how many training sessions are required to achieve proficiency in knee arthrocentesis. Students’ learning curves showed a learning plateau after three sessions. Measuring the fourth session allowed us to confirm that performance was sustained and not just a one-session peak [21]. The short learning curve could be explained by the detailed training documentation received prior to the sessions, the direct observation scale known to the trainees, and personalized feedback received after each session that allowed the trainees to focus on their mistakes and gave them advice on how to improve. This learning curve behavior is related to obtaining a proficiency in the scenario. Obtaining an expertise in knee arthrocentesis requires experience and hence is not obtainable in a brief simulated training course [15].

Teaching combined technical and nontechnical abilities has also proven to be effective through simulation [12, 23]. Simulated procedures and situations offer a structured teaching method that is replicable, objective, and safe (for the patient and student) [24, 25]. The inclusion of nontechnical aspects to a procedure usually considered technical has many proven benefits [12]. First, it allows for the teaching and learning of nontechnical aspects that are usually left out of technical training. Doctor–patient communication plays a key role in their relationship and in ideal clinical practice [26]. Furthermore, communication and nontechnical skills have been proven trainable [27]. Second, including these skills allowed us to incorporate the written consent as part of the procedure. The task of explaining the risks, benefits, and steps of the procedure to the patients helped our trainees incorporate significant theoretical concepts of knee arthrocentesis. Finally, as mentioned previously, hybrid training based on clinical scenarios adds meaning to the interaction. The presence of a real patient-actor increases the stakes for the student while remaining teaching friendly. This method has been proven to improve the learning experience [28]. Nontechnical aspects had not been trained in previous studies regarding knee arthrocentesis [8, 9, 20]. In our study, the nontechnical dimension also improved between training sessions.

The scenario required creating an adaptation of a direct observation of procedure skills (DOPS) scale that proved to be consistent, reliable, and useful for trainee learning (assessment for learning) [29, 30]. Designing a simulated training scenario requires an objective measurement tool. Other studies have used generic observation tools or unvalidated checklists [31]. The main limitation for the use of a non-procedure-specific tool is that it lacks the nuance of the procedure and certainly does not incorporate nontechnical skills, limiting specific item-by-item feedback and therefore improvement. We decided to adapt a similar tool. Delfino et al. had already created a direct observation tool for a specific technical procedure (tracheal intubation) including nontechnical skills in their tool [17]. We decided to adapt the scale, maintaining the 11 items but changing the descriptors used in each of them. First, we determined that the tool had internal consistency and external reliability (among observers). This determination is important because it assures the reproducibility of our results [32, 33]. The tool we created also proved to be bidimensional. Obtaining two dimensions through exploratory and confirmatory factor analysis reaffirmed the notion that we were measuring technical and nontechnical competencies in the same procedure. Finally, we decided to make the DOPS scale available to trainees prior to the sessions. Traditionally, training and teaching scenarios prefer to maintain the items and descriptors of the scale away from the students. The use of the assessment tool for session preparation improves understanding of the key elements of each step, allowing students to learn from the assessment tool, thereby streamlining feedback and improvement [34]. This improvement might help explain the high initial rating for trainees’ first session (mean score 5.89) and probably contributed to the steep learning curve they followed.

A 1-year follow-up found that trainees had performed the real-life procedures confidently given the abilities trained. Student perception has been proven to impact learning [35]. Positive perception leads to improved performance and increases the completion of training programs [36]. Our scenario had a positive trainee perception just after training and 1-year posttraining. Trainees positively valued the feedback given to them in the debriefing room. The feedback was structured, and the surgeon giving the feedback received prior training regarding feedback structure and technique. We believe our experience and the previous literature [37,38,39,40] make the inclusion of personalized feedback a key element in simulated training.

One of our main limitations was that the training sessions were not evenly spaced between students nor for single students. This is a frequent limitation when training has no curricular integration. After its initial success, the program will be included in student training, allowing for sessions at regular intervals. Secondarily, although subjective trainee evaluations rated the model favorably, the knee model used is relatively simple and did not allow the students to scope the impact knee variability has on increasing arthrocentesis difficulty.

In this research, we have opted to apply multilevel statistical models to consider repeated-measures obtained from each trainee and compare groups. Therefore, using multilevel models, differences in DOPS performances between sessions, and among trainees and experts could be obtained. Identifying the point in the learning curve where the slope flattens out (inflection point) is critical [26]. This point represents a progressively higher effort to achieve trainees’ learning gains before reaching a performance level similar to an expert (27). Thus, using this statistical approach, we could graphically represent the learning curves and identified their critical points. Future research could be conducted incorporating an interaction between individual-level (e.g., last-year medical student versus a specialist resident) and group-level variables (e.g., different simulations scenarios) through multilevel statistical models.

Conclusion

A high-fidelity simulation scenario allowed student trainees to improve the technical and nontechnical skills required to perform a safe knee arthrocentesis. After three training sessions, the trainee performance was similar to the experts we assessed. The scenario required creating an adaptation of a direct observation of procedure skills (DOPS) scale that proved to be consistent and reliable.