Introduction

Osteoarthritis (OA) of the knee is a debilitating disease that is increasing in prevalence [1]. The Australian Orthopaedic Association National Joint Replacement Registry (AOANJRR) has reported an increase in the incidence of primary total knee arthroplasty (TKA) of 156% over the past 15 years [2]. OA is recorded as the indication for TKA in 97.7% of cases. The incidence of TKA in Australia is projected to increase by a further 276% by the year 2030 [3].

Radiography plays an integral role in the diagnosis and monitoring of knee OA. There are several radiographic classification systems described in the literature to categorise the severity of OA. These classification systems have applications for disease monitoring, epidemiological and clinical trials, and as adjuncts to clinical assessment to guide treatment [4,5,6].

The clinical value of grading OA severity radiographically remains contentious. Several studies have shown that patients with milder radiographic disease have lower satisfaction and poorer patient-reported outcome measures (PROMs) after TKA [7,8,9]. However, the definition of disease severity in these studies is heterogeneous, with no specific reference to defined radiographic criteria. Other studies have concluded that radiographic grading systems are rudimentary and are not necessarily correlated with pain [10,11,12].

Several classification systems for knee osteoarthritis have been described [4,5,6]. However, there is a paucity of studies that directly compare the intra- and inter-observer reliability of these classification systems in the older patient cohort being evaluated for arthroplasty. Furthermore, studies that have evaluated the reliability of different classification systems in other cohorts have, in the main, utilised specious statistical analysis, calling into question the validity of their reported inter- and intra-observer reliability values [13, 14].

In the outpatient setting, patients being evaluated for knee arthroplasty procedures may be seen by either a resident, registrar, or consultant surgeon. Reliability across a spectrum of experience levels is therefore required for a radiographic classification system to have clinical utility. The aim of this study was to determine the intra- and inter-observer reliability of three commonly used radiographic classification systems for knee OA by investigators of varying levels of experience in a cohort of arthroplasty candidates.

Material and methods

Patient population

Ethical approval was obtained from the institutional ethics committee and informed written consent was acquired from all study participants. One hundred and twelve patients undergoing their first elective primary TKA were consecutively recruited between July 2018 and August 2019 at a regional public hospital in Australia. Patients undergoing revision procedures or those undergoing a second primary TKA were excluded.

Imaging and analysis

As a part of routine clinical care, pre-operative weight-bearing, anteroposterior templating radiographs were obtained using Siemens Ysio classic x-ray systems. All studies were technically equivalent. Radiographs were taken no more than 6 weeks prior to a patient’s scheduled TKA procedure. All x-rays were de-identified before being submitted for analysis.

These 112 x-rays were reviewed by four observers: a lower limb arthroplasty surgeon with over 4 years of experience averaging > 75 TKAs annually (observer 1), an orthopaedic registrar (observer 2), an orthopaedic intern (observer 3), and a medical student (observer 4). The x-rays were reviewed on the same viewing equipment and on the same viewing platform (AGFA IMPAX 6.7.0.2530). Prior to reviewing the radiographs, all observers were provided with the original description of the 3 classification systems utilised. Each observer reviewed the radiographs independently, with no knowledge of each other’s scoring. For classification systems that differentiate grades based on specific measurements, the ruler tool from IMPAX was used to calculate these measurements. The radiographs were then independently randomised and re-classified by each observer after a minimum interval of 2 weeks. The observers were afforded the ability to again review of the descriptions of each classification system prior to the second round of reviews if they so desired. No discussion of results was permitted during the evaluation period.

X-ray classification

Radiographs were graded according to three of the most commonly used OA classification systems: the Kellgren-Lawrence, International Knee Documentation Committee (IKDC), and Ahlbäck classifications (Fig. 1; Table 1) [4,5,6]. These three classification systems were chosen as they were the most frequently referenced classification systems in our initial literature review.

Fig. 1
figure 1

Representative x-rays that were uniformly classified as Ahlbäck grades 1 (a), 2 (b), and 3 (c) among reviewers

Table 1 Grading scales for the radiographic osteoarthritis classification systems [4,5,6]

Statistical analysis

Inter-observer reliability was calculated using Gwet’s AC2 statistical analysis for reliability for weighted data. Intra-observer reliability was calculated using the intraclass correlation coefficient (ICC). Analyses were conducted using Microsoft Excel.

Results

Patient demographics

Fifty-nine of the 112 considered radiographs were for left knee osteoarthritis (53%), with a mean patient age of 68 years (range 54–90). Forty-five percent of participants were female, with a mean body mass index (BMI) of 34.1 kg/m2 (range 22.1–48.0).

Inter-observer reliability

Agreement between observers was calculated for each classification system during both phases of the study. Inter-observer reliability for the Ahlbäck and Kellgren-Lawrence classifications demonstrated ‘substantial agreement’ for both phases of the study, and the IKDC demonstrated ‘almost perfect agreement’ for both phases (Tables 2 and 3) [15].

Table 2 Inter-observer reliability scores for each classification system

Intra-observer reliability

The ICCs for intra-observer reliability are shown in Table 4. The two more experienced observers (observers 1 and 2) demonstrated higher ICCs compared with the less experienced observers for all three classification systems, with observer 1 demonstrating ‘good reliability’ and observer 2 demonstrating ‘excellent reliability’. Observers 3 and 4 demonstrated ‘moderate reliability’ across all three classification systems (Tables 4 and 5) [16].

Table 3 Interpretation of correlation coefficients
Table 4 Intra-observer reliability scores for each classification system
Table 5 Interpretation of intraclass correlation coefficients (ICC)

Distribution of scores

The distribution of scores for each classification system during each phase of the study is presented in Fig. 2a–c. Regarding trends between the first and second rounds of grading, one can appreciate a trend towards higher grades on the Ahlbäck bar graph for the second round of readings, but the same is not necessarily reflected with the other 2 classification systems.

Fig. 2
figure 2

a Distribution of scores for the IKDC classification system for both rounds of grading. b Distribution of scores for the Kellgren-Lawrence classification system for both rounds of grading. c Distribution of scores for the Ahlbäck classification system for both rounds of grading

Discussion

This study has demonstrated substantial to almost perfect inter-observer reliability for all three considered radiographic classification systems of knee OA across all experience levels. The Ahlbäck classification yielded inter-observer reliability coefficients of 0.79 for both rounds of grading, the Kellgren-Lawrence system yielded coefficients 0.82 and 0.85 for each round, and the IKDC yielded coefficients of 0.96 and 0.97. Experience appeared to influence the intra-observer reliability of the classification systems, with observer 1 demonstrating ‘good reliability’ for all three classifications, observer 2 demonstrating ‘excellent reliability’ for all three classifications, and observers 3 and 4 demonstrating ‘moderate reliability’ for all classifications (Table 3).

The three classification systems evaluated in this study vary in the method by which they differentiate between grades of disease (Table 1). The IKDC classification emphasises the objective measurement of remaining joint space [5]. This is in contrast to the more traditional Kellgren-Lawrence classification, which focusses on the presence and size of osteophytes [4]. This differs again to the Ahlbäck classification, which is based on joint space narrowing and bone loss [6]. These fundamental differences in each classification system contribute to their objective measurability, and ultimately their statistical reliability.

The IKDC classification demonstrated the greatest intra- and inter-observer reliability in this study. As this classification system is based upon the objective measurement of the remaining joint space, it is possible that this leads to less subjective observer interpretation, and therefore less potential disagreement between gradings. The Ahlbäck classification also uses objective measurements in its classification. However, these measurements rely on the observer identifying the pre-morbid native joint line, which introduces a potential source of discordance between different observers. The Kellgren-Lawrence is the most subjective of the three classification systems, with the observer required to assess and quantify the size or degree of osteophytes, joint space narrowing, sclerosis, and bone end deformity. This may have contributed to this system having the lowest inter-observer reliability score of the three systems in this study.

Of the three classification systems studied, the IKDC system is the most conservative with the highest grade of disease defined as a joint space of less than 2 mm [5]. This differs significantly to the Ahlbäck classification system, which classifies x-rays as only grade 2 with joint space obliteration, and grade 5 as bone loss of greater than 10 mm [6] (see Table 1). Given a large proportion of elderly patients awaiting total knee arthroplasty procedures in the Australian public health setting are likely to have significant joint space narrowing, this may have contributed to the high degree of inter-observer reliability observed with the IKDC classification system. This is highlighted by the distribution of scores for each classification system obtained during this study. A total of 896 scores from the four observers were obtained across both stages, with 725 receiving the maximum score of D according to the IKDC system. In comparison, 473 received a maximum score of 4 with the Kellgren-Lawrence system, with only 23 receiving a maximum score of 5 with the Ahlbäck system. The varying degrees of deformity encompassed by the Ahlbäck classification system would appear to be more reflective of the spectrum of radiographic disease observed in the arthroplasty cohort, and therefore may have greater clinical utility than the other classification systems in that particular patient group.

There are several published studies that have evaluated the inter-observer reliability of various knee osteoarthritis classification systems, with the majority concluding they had poor reliability and should be utilised with caution [13, 17, 18]. However, there are no available studies that compare the utility of the three classification systems investigated in this study in a specific cohort of arthroplasty patients. The largest of the existing studies is the MARS study, which evaluated six commonly used classification systems in a cohort of patients with a mean age of 28 years awaiting revision anterior cruciate ligament procedures [17]. This is a stark contrast to the average arthroplasty patient, and the cohort to whom the classification systems are applied may be a key factor influencing their reliability. The results of our study suggest that increased inter-observer reliability may be present in patients with greater radiographic severity of OA.

The conclusions of several of the published studies have also been based upon potentially specious statistical analysis [13, 14]. Some studies have used Fleiss’ kappa, a variation of Cohen’s kappa for more than two observers, to calculate their inter-observer reliability. One of the assumptions made with Fleiss’ kappa is that the observers are ‘non-unique’. However, for observers to be ‘non-unique’, each individual patient x-ray needs to be reviewed by a random observer selected from a large pool of observers, rather than utilising the same observers for all x-rays [19]. These existing studies utilised unique observers in their design, and hence should not have used Fleiss’ kappa to assess the inter-observer reliability. Fleiss’ kappa and other derivations of Cohen’s kappa are also known to be vulnerable to paradoxes where calculated agreement coefficients are low, even when agreement is known to be high [20]. This could explain the comparatively low inter-observer reliabilities demonstrated in the aforementioned studies when compared with this study. We elected to use Gwet’s agreement coefficient for ordinal data, as it has been shown to be comparatively paradox-resistant when compared with Cohen’s kappa [21]. It is also able to be applied in the setting of four unique observers, unlike Fleiss’ kappa [22].

Our study clearly demonstrates a correlation between intra-observer reliability and clinical experience. The more experienced observers 1 and 2, a lower limb arthroplasty surgeon and an orthopaedic registrar, demonstrated ‘good’ and ‘excellent’ intra-observer reliability for all classification systems (ICC = 0.87(Ahlbäck), 0.88(K-L), and 0.88(IKDC) for observer 1 and ICC = 0.93(Ahlbäck), 0.91(K-L), and 0.92(IKDC) for observer 2). This contrasts with the less experienced observers 3 and 4, an orthopaedic intern and a medical student, achieving ‘moderate’ reliability (ICC = 0.59(Ahlbäck), 0.69(K-L), and 0.56(IKDC) for observer 3 and ICC = 0.60(Ahlbäck), 0.52(K-L), and 0.70(IKDC) for observer 4; refer to Table 3) This would imply that reliability improves with experience, a finding supported by other studies considering intra-observer reliability of radiographic classification systems [13, 18].

This study’s strengths lie in its inclusion of observers of four different experience levels, which has not been done in any literature that we were able to identify, generalisability to an older cohort of arthroplasty candidates, and correct statistical analysis. However, it was a retrospective study with only one observer from each experience level. And whilst it remains generalisable to an older cohort of patients with relatively more advanced OA, it is not as generalisable to all patients that may present with knee pain, or necessarily reflective of the entire spectrum of disease, for example as might be encountered in a general practice setting. Having a radiologist as an observer may also have served as a benchmark for comparison to the other observers. The considered radiological classification systems only assess tibiofemoral OA and do not consider patellofemoral OA. Finally, this study design has only investigated the reliability of the three classification systems, not necessarily their validity, which is an area that would benefit from further research.

The clinical utility of orthopaedic classification systems is dependent upon their reliability, prognostic ability, and ability to guide management of a particular condition [23]. The inter-observer reliability of the classification systems examined in this study provides a platform for future clinical and epidemiological studies in patients with knee osteoarthritis. Patient satisfaction and PROMs are becoming increasingly topical, particularly given the known incidence of patient dissatisfaction following TKA [24]. Future studies comparing pre-operative radiographic classification of knee OA and post-operative PROMs following TKA may be beneficial for surgeons and patients alike in optimising treatment algorithms. Our research group is currently investigating whether there are any pre-operative factors, such as radiographic disease severity, patient expectations, and private insurance status, which successfully predict post-operative satisfaction and PROMs in the same cohort of patients from this study.

Conclusion

In contrast to previously published data, this study has demonstrated that the Ahlbäck, Kellgren-Lawrence, and IKDC radiographic classification systems have substantial to almost perfect inter-observer reliability, and moderate to excellent intra-observer reliability across varying experience levels, in an older cohort of arthroplasty patients.