Introduction

Psychotherapeutic competence is conceptualized as a therapist’s general and treatment-specific knowledge level, skill level, values or attitudes while implementing therapeutic interventions (Muse and McManus 2016; Roth and Pilling 2007; Waltz et al. 1993). Barber et al. (2007) refer to psychotherapeutic competence more comprehensively as “the judicious application of communication, knowledge, technical skills, clinical reasoning, emotions, values, and contextual understanding for the benefit of the individual and community being served” (p. 494). Waltz et al. (1993) describe patient-specific aspects (such as symptoms, impairment or life situation) and treatment-specific variables (such as therapy stage, improvement or timing of interventions) to be considered for a broad perspective on competence.

The assessment of competencies not only plays a crucial role in treatment integrity in general but also may facilitate quality control during training, licensure and ongoing practice, may provide therapists with formative and summative feedback and may guide self-reflection (Muse and McManus 2013). However, meta-analyses on the association between therapeutic competence and patient outcomes yield results from no to small effects (Webb et al. 2010) or small to moderate effects (Zarafonitis-Müller et al. 2014). Further, the reviews report on a variety of competence measures—from the Cognitive Therapy Scale to the Collaborative Study Psychotherapy Rating Scale or study-specific developments, and they depict enormous variability in reliability—from no to nearly perfect agreement (Muse and McManus 2013; Zarafonitis-Müller et al. 2014). The Cognitive Therapy Scale (CTS; Young and Beck 1980), or Cognitive Therapy Rating Scale (CTRS; Beck Institute for Cognitive Behavior Therapy 2019), is a commonly used measure (Kazantzis et al. 2018). It was revised repeatedly, with the most prominent version being the Cognitive Therapy Scale-Revised (CTS-R; Blackburn et al. 2001; for a detailed description of different versions please see Muse and McManus 2013; Kazantzis et al. 2018).

Since therapeutic competence depends on the complexity of the patient’s presentation, patient outcomes are not recommended unreservedly as a proxy for competence (Muse and McManus 2013). While ratings of in-session therapeutic skills performed by independent raters are highly recommended, Muse and McManus (2013) note that research on the reliability and validity of such competence assessments is sparse and that specifically regarding cognitive-behavioral therapy (CBT), “it is currently not possible to make evidence-based recommendations about how best to assess CBT competence” (p. 496).

Although the reliability of competence ratings is often considered improvable (Fairburn and Cooper 2011; Muse and McManus 2013), a number of variables are theorized to influence it. Rater training is consistently deemed central (Barber et al. 2007; Fairburn and Cooper 2011; Muse and McManus 2013, 2016). The same holds for rater expertise, but the definition and amount of expertise required is not always clear (Barber et al. 2007; Muse and McManus 2013, 2016). Other variables discussed in the literature are the number of raters, rater independence, the number of sessions rated per patient, the number of patients rated per therapist, the form of treatment, the stage of therapy, patient diagnosis and the competence scale used (Barber et al. 2007; Dennhag et al. 2012b; Fairburn and Cooper 2011; Muse and McManus 2013, 2016; Webb et al. 2010). Given these many variables, measurement quality (and thus reliability) is influenced by aspects related to rater, sample or instrument used (Kottner et al. 2011).

Moreover, there are various reliability measures, e.g. Cohen’s κ for nominal data, Kendall’s τ for ordinal data, or, depending on the model and number of raters, different forms of intra-class correlations for continuous data, to mention only some (Wirtz and Caspar 2002). Although a variety of primary studies examine psychotherapeutic competence ratings, to our knowledge, no evidence synthesis on their reliability has yet been published. Therefore, the first aim of the current study was to map the evidence regarding the interrater reliability (IRR) of psychotherapeutic competence ratings, and the second was to estimate the pooled IRR across methodologically sound studies. The third explorative aim of the study was to investigate moderators of the IRR of those competence ratings.

Method

We conducted our systematic review in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement (Moher et al. 2009). The review protocol was pre-registered and published with the International Prospective Register of Systematic Reviews (PROSPERO; CRD42018111752).

Inclusion Criteria

Participants in the original studies had to be mental health patients diagnosed via a formal classification system (i.e., any edition of the International Statistical Classification of Diseases and Related Health Problems (ICD; WHO 1992) or the Diagnostic and Statistical Manual of Mental Disorders (DSM; APA 2013). Given that competence scales have been developed mainly for adults (Webb et al. 2010), we concentrated on studies with patients aged 18 and over. To address the performance of a therapist or mental health care provider within a real clinical encounter, we included any studies focusing on individual, face-to-face bona fide psychotherapy (APA 2017). To enable us to focus on psychotherapy and not exclusively on counseling, at least 50% of therapists in the studies were expected to be licensed and to have a minimum of 1 year of any clinical experience. Studies were included if at least two external judges performed the ratings. We allowed for any person to be an external rater (e.g., supervisor, peer, independent researcher) and for any competence scale to be included.

The outcome was the IRR of the total scores of therapeutic competence ratings. IRR refers to the variation between different raters measuring the same subjects under similar conditions (Koo and Li 2016; Kottner et al. 2011; Santelmann et al. 2016). We included IRR as measured by the intraclass correlation coefficient (ICC) since this score is used most frequently for continuous competence outcomes (Kottner et al. 2011), but we also included other IRR coefficients (e.g., Pearson’s correlation coefficient). We only included studies reporting the size of the (sub-)sample for calculating the IRR (cf. Trajković et al. 2011) in order to enable proper interpretation.

Empirical original studies published during a peer-reviewed process (e.g., without commentaries or reviews) were considered. There were no restrictions regarding language or publication date.

Search Strategy

The PubMed (NCBI; 17th September 2018) and PsycInfo (EBSCOhost; 20th September 2018) databases were searched adapting the following search terms to the respective platforms: (mental* OR psych* OR therap*; TI/AB) AND (competenc*; TI/AB) AND (reliability OR ICC; all fields) AND (assessment* OR rater* OR rating*; all fields; humans). We did not exclude grey literature such as dissertations or conference abstracts. Further, we inspected the reference lists of relevant review papers (backward search; Barber et al. 2007; Kazantzis 2003; Muse and McManus 2013; Webb et al. 2010; Zarafonitis-Müller et al. 2014) and finished our search in November 2018.

Screening and Data Extraction

First, titles and abstracts were screened independently for inclusion (TP, RL). Then, full texts were retrieved and screened again independently by two researchers (TP, RL). Disagreements were resolved through discussion or through the inclusion of a third reviewer (FK). Interrater agreement was determined for all of the full texts and amounted to κ = .65, which reflects good agreement (Higgins and Green 2011). For data extraction, we used a structured form including study, patient, therapy, therapist, rater and rating aspects (e.g., rater training or rating material). The form was piloted by two reviewers (TP, RL) on five publications. After the form was finalized, two master’s student reviewers (B.Sc. psych.; TP, RL) extracted all data first, and two licensed psychotherapists (FK, UM) doubled-checked all results independently.

Quality of Reporting

Referring to the Guidelines for Reporting Reliability and Agreement Studies (GRRAS; Kottner et al. 2011), in their review, Duffy et al. (2013) proposed a 7-item tool on the reliability of a specific measure of patients’ activities of daily living. We adapted their tool to our research question and double-checked with the GRRAS. The final reporting checklist comprised the following aspects: (1) therapist sample (number, recruitment, qualification, experience, psychotherapy approach), (2) rater sample (number, recruitment, qualification, experience, psychotherapy approach), (3) administration of the ratings (rating material), (4) independence of the ratings, (5) rater training (form and amount of training), (6) patient sample (number, recruitment, diagnosis), and (7) blinding of the raters (availability of this information, no supervisors). Each of the seven aspects was rated as follows: 0 = insufficient, 1 = partly sufficient, and 2 = sufficient description in the primary study. Again, two independent raters (TP, RL) assessed the quality of reporting. For the sum scores, and before the resolution of disagreements, the IRR reached ICC(1,2) = .88 [CI = .70 – .95], which is considered high (Wirtz and Caspar 2002).

Statistical Analysis

The outcome was the IRR, and different coefficients were reported. The ICC, Finn’s r and the generalizability coefficient all refer to the same statistical model and were thus combined within one meta-analysis. As data may be combined statistically if at least two coefficients are available, meta-analysis could be conducted only on the ICC (and not on the kappa and Pearson coefficients presented in Table 1) after the unit of analysis was defined. To avoid dependent data, the study, and not the publication, was the unit of analysis (Higgins and Green 2011). If multiple outcomes refer to the same study, the most straightforward procedure to avoid statistical dependency is to include only one outcome per study using pre-defined criteria (Quintana 2015). If multiple publications were based on the same study, we chose the one with the most comprehensive data. The same was true for multiple ICCs reported within one study, then we chose the ICC based on the more comprehensive and purer data. Two studies reported two ICCs for the subscales of the instruments used (Brueck et al. 2009; Wittorf et al. 2013). In that case, we transformed the ICCs to Fisher’s z values and then used the mean coefficient for further analyses.

Table 1 Complete evidence map of studies included in the qualitative summary

If multiple data were available, we gave priority to video (instead of audio) data, as they enable more comprehensive judgments, and to Cognitive Therapy Scale (CTS) data (instead of CTS-R) data, since they were much more common and thus better to combine. As the terms CTS and CTRS are often used interchangeably (Muse and McManus 2013), we decided to use the abbreviation “CTS” for both, also to avoid confusion.

If the study authors did not explicitly report that whole sessions were rated, we documented that the sessions were “probably complete”. However, we concluded from most descriptions that whole sessions were performed, so whole sessions comprised the data point then used for meta-analysis. If it was unclear how many sessions were rated per patient (e.g., 1–2), we chose the more conservative value (i.e., 1). Furthermore, if multiple options existed, we decided for expert raters, the higher number of ratings and entire sessions.

We performed a random effects meta-analysis using the restricted maximum likelihood estimator. Correlations were converted into Fisher’s z values for all analyses and retransformed for interpretation. As the sample, we defined the number of tapes that were rated. We tested for statistical heterogeneity using Cochran’s Q and the I2 statistic (Higgins and Green 2011). A Baujat plot (Baujat et al. 2002) was used to examine potential outliers (Quintana 2015). To test for reporting bias, we used Egger’s test (Egger et al. 1997) and visually examined the funnel plots. Following the script by Quintana (2015), we used the “metafor” (Viechtbauer 2010), “robumeta” (Fisher and Tipton 2015) and “dplyr” (Wickham et al. 2015) packages for R (R Core Team 2018).

We independently explored moderators using a series of meta-regression analyses (Quintana 2015). We derived the moderators from the previous literature (Muse and McManus 2013), specifically the number of raters, the quality of reporting, sessions rated per patient, the number of therapists, the number of patients and the study design [randomized-controlled trials (RCT) vs. other], rating material (audio vs. video/both), rater training (yes vs. no/not specified), independence of raters (yes vs. no/not specified), the scale used for the ratings (CTS-based vs. other), therapy (CBT-related vs. other), therapist trainees (yes vs. partly/no/not specified) and patients’ diagnosis (depression & anxiety vs. other).

Results

Characteristics of Included Studies

Through our literature search, we identified 1286 records. After we removed duplicates and added records from the reference lists of the included reviews, we screened 908 for their title and abstract. We finally included 20 studies reported in 32 publications in the narrative synthesis. The study flow chart and reasons for exclusion are illustrated in Fig. 1. A detailed description of the reasons for inclusion and exclusion into the statistical analysis are presented in Supplement 1. The 20 IRRs that could be quantitatively combined for quantitatively are highlighted in bold in the evidence map (Table 1), which also illustrates further information. Since one publication (Schmidt et al. 2018) reported two studies and another referred to two samples (Dennhag et al. 2012a), the total numbers and percentages may vary within the following narrative synthesis.

Fig. 1
figure 1

PRISMA flow chart of study inclusion

The included studies were conducted between 1983 (Chevron and Rounsaville 1983) and 2018 (Kazantzis et al. 2018; Schmidt et al. 2018), and 17 of the original studies were RCTs. Most studies focused on cognitive therapy (CT), CBT, comparisons with so-called third-wave interventions (Hoffart et al. 2005; McGrath 2013), psychoeducation (Weck, Hautzinger, et al. 2011; Weck, Weigel, et al. 2011), maintenance treatment (Weck, Hilling, et al. 2011) or a CBT-related intervention (motivational interviewing; Brueck et al. 2009). The minority of studies addressed psychodynamic therapy (Svartberg 1989; Tadic et al. 2003) with related interventions such as mentalisation-based treatment (Karterud et al. 2013) or interpersonal therapy (Chevron and Rounsaville 1983). In contrast, the seven publications by Barber et al. (see Table 1, superscripts 1 and 2) compared cognitive and psychodynamic therapy with counseling as well as individual versus group interventions. Most patients included in the studies were diagnosed with depression (n = 12, 37.5%), substance dependence (n = 6, 18.75%), anxiety and depression (n = 2, 6.24%), anxiety alone (n = 3, 9.38%), or other diagnoses (n = 7, 21.89%), or no diagnosis was specified (n = 2, 6.24%). The number of included patients ranged from 6 (Schmidt et al. 2018, study 1) to 400 (Barber et al. 2004).

The study therapists were licensed (n = 11, 33.33%), in training (n = 10, 30.31%), or both (n = 6, 18.18%), or their qualification was not described in detail (n = 6, 18.18%). The number of therapists ranged from 5 (Svartberg 1989) to 51 (von Consbruch et al. 2012). In 16 publications, the ratings were based on video tapes (50%), in 11 on audio tapes (34.38%), in three on both (9.37%), and in two on other sources (6.25%). The number of tapes that were rated ranged from 10 (Vallis et al. 1986) to several hundred (see Table 1; Dennhag et al. 2012a). One to four sessions were rated per patient, whereas ratings were mostly (n = 12, 36.36%) based on one session and were assessed by two raters (n = 23, 63.88%). In most cases (n = 18, 52.25%), raters were trained; in five cases (15.63%), they received no training; and in nine publications (28.12%), this aspect was not specified. The raters were mostly (n = 24, 72.72%) described as independent of each other, whereas sometimes, they were not independent (n = 4, 12.12%) or this facet was not specified (n = 5, 15.16%).

The quality of reporting of respective studies was above average (i.e., 8–14 points) in 27 publications (84.38%) and below average (≤ 7 points) in five of them (15.62%). In contrast, using the dichotomous scaling (i.e., either 0 or 1) proposed by Duffy et al. (2013), the quality of reporting was rated as “sufficient” (i.e., scores ≥ 5) in n = 31 (96.88%) of the studies (see Table 1).

Most often (n = 16, 50%), the CTS or CTS-based instruments were used for assessing competence. As an IRR coefficient, most often (n = 27, 79.41%), the authors calculated different forms of the ICC. Less often, the generalizability coefficient (Karterud et al. 2013), Pearson’s r (Chevron and Rounsaville 1983; Vallis et al. 1988), Finn’s r (Kazantzis et al. 2018), the kappa coefficient (Kuyken and Tsivrikos 2009) or so-called inter-rater correlation (Dobson et al. 1985) were used.

Quantitative Synthesis

We conducted a meta-analysis of 20 publications referring to a total sample of n = 1272 tapes. The summary correlation was ICC = 0.82 [95% CI (0.74, 0.87), p < 0.001], which, at first glance, could be interpreted as appropriate (≥ .70; Wirtz 2017) or good IRR (≥ .75; Portney and Watkins 2009; Fig. 2). Still, statistical heterogeneity was considerable (I2 = 90.39%; Q = 163.06, p < .0001; Higgins and Green 2011). According to the Baujat plot (Supplement 2), the studies by Barber and Crits-Christoph (1996, study 1) and by Dittmann et al. (2017, study 6) were potential outliers. Although these were the studies with the lowest (ICC = 0.42; Barber and Crits-Christoph 1996) and the highest (ICC = 0.97; Dittmann et al. 2017) IRRs, a meta-analysis without the two of them changed the results only marginally.

Fig. 2
figure 2

Forest plot of the average interrater reliability (ICCs with CIs)

Visual examination of the funnel plot (Fig. 3), which illustrated symmetry, yielded no indication for publication bias. Accordingly, Egger’s test for publication bias was not significant (p = 0.56). However, only 65% (n = 13, instead of the expected 95%) of studies lay within the triangular region of the funnel plot, which clearly indicates heterogeneity again (Higgins and Green 2011).

Fig. 3
figure 3

Funnel plot

The Role of Moderators

None of the investigated variables had an individual moderating effect, that is, number of raters [Q(1) = 0.06; p = 0.80], quality of reporting [Q(1) = 0.75; p = 0.39], sessions rated per patient [Q(1) = 0.77; p = 0.38], number of therapists [Q(1) = 0.04; p = 0.84], number of patients [Q(1) = 0.05; p = 0.82], study design [Q(1) = 2.59; p = 0.11], form of therapy [Q(2) = 0.89; p = 0.35], therapist trainees [Q(2) = 0.09; p = 0.77], rating material [Q(1) = 1.11; p = 0.29], independence of raters [Q(1) = 0.0005; p = 0.98] and patients’ diagnosis [Q(4) = 0.82; p = 0.37]. Two variables, namely, rater training [Q(1) = 2.96; p = 0.09] and scale used for the ratings [Q(1) = 3.59; p = 0.06] had a p value of < .1.

Discussion

To the best of our knowledge, this is the first evidence synthesis on the reliability of psychotherapeutic competence ratings. The aims of this study were to provide a map of the current evidence, to estimate a pooled IRR, and to investigate moderators of the IRR of psychotherapeutic competence ratings.

In their narrative review, Muse and McManus (2013) reported ICCs for total CTS scores between 0.01 (no agreement) and .94 (nearly perfect agreement), which left uncertainty regarding the ability to rate psychotherapeutic competence. Our meta-analysis revealed a pooled ICC of 0.82 indicative of appropriate reliability, and since both aspects are related to each other, severe heterogeneity (Wirtz and Caspar 2002). Coefficients ranged from ICC = 0.42 (Barber and Crits-Christoph 1996) to ICC = 0.97 (Dittmann et al. 2017). Although these values might be attributable to the file drawer problem (Higgins and Green 2011), that is, the paucity of published studies showing small or no reliability, our results did not support publication bias. Nonetheless, the majority of study authors adhered to basic principles to improve the reliability of ratings, i.e., training raters or using video tapes to maximize the available information (Muse and McManus 2013).

Our qualitative synthesis revealed an evidence map more detailed (Dennhag et al. 2012b) and systematic (Muse and McManus 2013) than the overviews given by previous reviews. Not surprisingly, it showed that most empirical studies referred to CBT and to patients diagnosed with depression. Consequently, the CTS was used most often, as it was particularly developed for CBT in the context of depression (e.g., Vallis et al. 1988). Although criticized for its specific focus, it is now also used within other diagnoses, such as psychosis, anxiety or personality disorders (Muse and McManus 2013). In addition, other comprehensive measures (e.g., Muse et al. 2017) or treatment-specific instruments (e.g., Machmutow et al. 2018) have been published but are still less commonly used than the CTS. Another perspective may be to successively improve established procedures.

According to our results, the number of tapes that were used ranged from ten to several hundred per study, and ratings were mostly based on a single session. In contrast, Dennhag et al. (2012b) show that, for example, for CT, three patients per therapist and four sessions per patient would be necessary to achieve appropriate reliability, which is far above the actual number. However, since competence ratings by trained raters are rather cost intensive, resource constraints may play a major role (Muse and McManus 2013).

Whereas older studies used Pearson correlation coefficients not controlling for varying variances between raters (Wirtz 2017), the ICC has become the most prevalent reliability measure. In their current publication, Kazantzis et al. (2018) proposed using Finn’s r as a potentially useful alternative to some ICC, if data are markedly non-normal and there is a restricted number of categories (e.g., if a 7-point scale exists but raters tend to use four options).

Although these results raise confidence in the utility of competence scales, there are still unanswered research questions. Addressing these issues, and thus improving established procedures, may contribute to less clinical and methodological diversity of primary studies, and thereby enhance statistical pooling in the future (Higgins and Green 2011). For example, raters were often described as independent of each other, but authors varied in their explanations of how this independence was achieved, with studies reporting more (Dennhag et al. 2012a) or less detailed information (Kuyken and Tsivrikos 2009). One strategy to enhance rater independence is to view video tapes and give evaluations separately. Another is to view videos and discuss ratings in intervals in order to reduce rater drift, which refers to changing rating criteria over time (Warshaw et al. 2001). Apart from rater drift, other judgment and observational biases (Wirtz 2017) have rarely been investigated in the competence literature thus far—another possible focus of future research.

Furthermore, the amount of rater expertise necessary still remains an empirical question, with some arguing for more experienced raters and others arguing that, presuming the provision of adequate training, novice raters may also provide reliable ratings (Muse and McManus 2016; Weck, Weigel et al., 2011). Furthermore, the study purpose guides the choice of raters, that is, choosing supervisors if broader knowledge about therapists is necessary or independent judges if objectivity is to be maximized (Muse and McManus 2013).

Although no moderators proved significant in our first exploration of moderators, this finding does not indicate their unimportance; moderator analyses require larger samples, especially if studies with varying quality are included (Hempel et al. 2013). The same applies for the fact that nine publications included small samples of ≤ 30 tapes. We only conducted univariate meta-regression analyses due to power considerations, and thus could not simultaneously control for other variables (Meister et al. 2017). Other limitations of our meta-analysis could be the inclusion of rather experienced therapists and a subsample of 20 studies for quantitative synthesis. Combining comparable coefficients for meta-analysis was important to reduce statistical dependency among the coefficients (Quintana 2015).

Despite this strategy, there was considerable between-study heterogeneity, limiting the interpretability of our results. First of all, heterogeneity might be attributable to conceptual differences, as psychotherapeutic competence was defined in different ways in the primary studies. Accordingly, it may be ascribed to differences in the methods used in the original studies, which was evidenced by the fact that only about half of the original studies were RCTs, by the diversity in the quality of reporting, and by the diverse numbers of tapes, patients and therapists included. Adherence and competence ratings are often a by-product of clinical trials. Presumably, researchers do invest in basic strategies to ensure reliable ratings to support the main trials but may not be acquainted with the pitfalls and details accompanying proper competence ratings. Therefore, referring to important standards for rater training, such as clarification of raters’ implicit concepts, supervisor feedback, discussion of disagreements, discussion of (a)typical cases or the provision of category definitions (Wirtz 2017), as well as publishing manuals for rater training, and using reporting guidelines (Kottner et al. 2011) will further contribute to advancements in this field of study.

In conclusion, the current meta-analysis indicates first pooled results on the reliability of competence ratings, and highlights considerable heterogeneity within the data. In contrast, meta-analyses are restricted by the results published within primary research (Borenstein et al. 2009), which is why further experimental studies could extend the current results and directly compare relevant competence variables (e.g., contrasting ratings obtained via the CTS, the CTS-R or another instrument). Future studies could further investigate the validity of competence ratings to determine, for example, how to maximize validity (e.g., in relation to a grade received after psychotherapy training or in relation to patient-related outcomes). It remains a vital part of process research to determine the specific bodies of knowledge, skills and attitudes that constitute an individually competent psychotherapist.