Osteoarthritis (OA) is a chronic musculoskeletal condition frequently resulting in varying degrees of pain, stiffness and activity limitations [1]. The weight-bearing joint most commonly affected by OA is the knee, with approximately 24% of adults (aged 15 to 99 years) with knee OA [2]. Research on knee OA has predominantly focused on the tibiofemoral joint or considered the knee as one joint [3], despite studies showing that OA in the patellofemoral compartment is just as, if not more, prevalent than in the medial or lateral tibiofemoral compartments [4,5,6,7,8]. In fact, a systematic review of population-based studies that recruited at least 215 people through “random sampling or convenience sampling from the community” demonstrates around 25% of people aged 20–99 years have patellofemoral OA [9]. Furthermore, a study reported that of those with radiographic knee OA and knee pain, 31% had isolated patellofemoral OA, 24% had isolated medial tibiofemoral OA and 20% had combined patellofemoral and tibiofemoral OA [4]. These findings emphasise the need for a greater focus on the patellofemoral joint in OA research.

More recently, magnetic resonance imaging (MRI) has been used in OA research to comprehensively quantify structural joint changes that may be difficult to visualise in radiographic imaging. Although MRI is expensive and not advised for OA diagnosis [10], it allows better visualisation of joint structural pathologies than radiography [11]. Bone marrow lesions, subchondral cysts, sclerosis, synovitis and effusion are a few of the features that are believed to be contributors to, or associated with, the pathophysiology and symptoms of knee OA [12,13,14,15]. Scoring systems, such as the Boston-Leeds Osteoarthritis Knee Score, Whole Organ MRI Score and MRI Osteoarthritis Knee Score (MOAKS), have been established to standardise the grading and reporting of OA features visible on MRI scans in all compartments of the knee [12, 16, 17]. In particular, the MOAKS is a comprehensive assessment tool for knee OA MRI evaluation. It was developed from the strengths and weaknesses of previous MRI-based assessment tools for OA evaluation [12, 16, 18]; it is reliable [12], and it has been widely used in research [11, 15]. However, the MOAKS is complex and requires specialised training, therefore limiting its practical use in the research environment.

As a potential alternative to these more complex MRI-based OA assessment tools, this paper aims to evaluate an MRI-based Kellgren & Lawrence (K&L) grading tool for the patellofemoral joint. The K&L assessment tool is widely used for radiographic OA assessment. It evaluates the presence of osteophytes, joint space narrowing (JSN), sclerosis and bony deformity on radiograph [19]. Although Kellgren and Lawrence [19] only describes its use for the tibiofemoral joint, it has been previously used for the assessment of the patellofemoral joint in lateral and skyline/axial radiographic views [6, 20,21,22,23]. Riddle et al. [24] developed an MRI-based K&L grading for the patellofemoral joint within a cohort study, evaluating the appropriateness of joint replacement surgery, as they did not have any axial radiographic views [24]. The MRI-based K&L grading assesses the patellofemoral joint mostly using two OA features: osteophytes and cartilage loss. The intra-rater reliability appeared excellent [24]; however, inter-rater reliability and agreement were not evaluated. The primary aim of this study was to assess inter-rater reliability and agreement of an MRI-based K&L grading of the patellofemoral joint. The secondary aim was to validate the MRI-based K&L grading by comparing it with the reliable and validated MOAKS.

Methods

Overview

The present study is a secondary analysis of MRI scans undertaken at baseline during a double-blind randomised placebo-controlled trial, which was conducted over a 2-year period (2007–2009). The trial investigated the effect of glucosamine sulphate, chondroitin sulphate or the combination of both on disease progression in people aged 45–75 years with chronic knee pain [25].

Participants

Participants were recruited through general media advertising and general practices in New South Wales, Australia. Eligible participants aged 45 years and over had joint space narrowing in the medial tibiofemoral joint of a symptomatic knee. Exclusion criteria included rheumatoid arthritis or other inflammatory joint diseases, lower limb surgery within the last 6 months, bilateral knee replacements or plans for knee replacements during the study period. Participants gave informed written consent. The study is in accordance with the Declaration of Helsinki and it was approved by the local human research ethics committee.

Baseline demographics

Age, height, weight, analgesia and knee pain duration (years) were collected. Participants completed the Western Ontario and McMasters Universities Osteoarthritis Index (WOMAC); pain (0–20) and physical function (0–68) subscale scores were extrapolated, with higher scores representing greater knee pain and activity limitations.

MRI technique

The participants’ knees were imaged using a dedicated knee coil in a 3-T magnet (GE Signa HDx). Each examination consisted of axial proton density-weighted turbo spin echo images (with repetition time (TR) of 3900; echo time (TE) of 40; echo train length (ETL) 8; 3 mm slice thickness; 0.3 mm intersection gap; 13 cm field of view (FOV); 384 × 320 matrix), sagittal proton density-weighted fat suppressed turbo spin echo images (TR 3400; TE 40; ETL 7; 3 mm slice thickness; 0.3 mm gap; 14 cm FOV; 384 × 320 matrix) and sagittal T2-weighted turbo spin echo images (TR 1060; TE 6.5; ETL 1; 3 mm slice thickness; no gap; 16 cm FOV; 320 × 224 matrix). Total acquisition time (including the initial survey sequence) was 30 min.

MRI-based K&L grading

The MRI-based K&L grading is a surrogate for the radiographic K&L scale. The MRI-based K&L grading ranges from 0 to 4:

  • 0: Normal

  • 1: No definite osteophytes or joint space narrowing, but there may be minimal cartilage, bone or periarticular changes

  • 2: Definite osteophyte with focal cartilage loss but no extensive cartilage involvement/no joint space narrowing

  • 3: Osteophytes with significant cartilage loss at either the medial or lateral patellar and/or trochlear surfaces

  • 4: Osteophytes with complete cartilage loss involving more than 50% of the medial and/or lateral patellofemoral joint

The axial and sagittal views of the patellofemoral joint were assessed.

MRI osteoarthritis knee score

All participant MRI scans were also graded using the MOAKS, a reliable and validated scoring system that was developed for the specific use of knee OA assessment using MRI scans [12]. There are 12 subscales in the MOAKS, including size of osteophyte, percentage of any cartilage loss (partial and full-thickness loss), percentage of full-thickness cartilage loss, volume of bone marrow lesions and effusion-synovitis. Raters used this assessment tool to evaluate individual OA features from normal (0) to severe [3]. In order to validate the MRI-based K&L grading for the patellofemoral joint, cartilage and osteophyte MOAKS scores in the patellofemoral joint (medial patellar, lateral patellar, medial trochlear and lateral trochlear regions) were compared with the scores attained from the MRI-based K&L grading. Other subscales of the MOAKS, such as bone marrow lesions, cysts and synovitis-effusion, were excluded for this study as they were features that were not assessed in the MRI-based K&L grading. The cartilage and osteophyte MOAKS scores were then averaged in order to attain an overall patellofemoral joint score and compared with the MRI-based K&L grading scores.

Procedure

A senior radiologist (A.P), who was experienced with using the MRI-based K&L grading, conducted the training, which consisted of evaluating images from 10 randomly selected participants with each rater according to the MRI-based K&L grading. The raters were then given another set of MRI scans from 20 randomly selected participants to assess independently. Disagreements and inconsistencies among the readers were addressed and discussed until a consensus was reached. An atlas was created from the 30 sets of MRI scans for training to visually demonstrate each grade (Appendix).

Of the 304 participants with available knee MRI scans, 50 MRI scans were randomly selected. The 30 MRI scans that were used for training were excluded from the selection. The images were graded by three raters: the primary investigator (S.K) and two radiologists (A.P, J.M). All raters were blinded to clinical information and radiologic reports. The primary investigator was a health researcher and a novice reader with no formal radiology training. RadiAnt DICOM viewer [26] was used to view the MRI scans for assessment. Sagittal and axial MR images were used together to attain a total view of the patellofemoral joint. When the raters gave discordant grades to the views of the same participant, the worst grade was used. There was a 2-week interval between the first and second readings for the assessment of intra-rater reliability. The primary investigator also assessed the MRI scans using the MOAKS approximately 1 year after the MRI-based K&L grading assessments.

Statistical analysis

Intra-class correlation coefficient (ICC) was used to assess intra- and inter-rater reliability, using SPSS (SPSS Inc., Chicago, IL). ICC assesses intra- and inter-rater reliability by measuring the variance of scores between the raters [27]. Model 3 was used to calculate the ICC as the three raters assessed each participant, and the raters were fixed. An average measure between the scores of the three raters was taken to determine the ICC for the test. Reliability is considered to be poor when the ICC < 0.40, fair when the ICC is 0.41–0.59, good when the ICC 0.60–0.74 and excellent when the ICC is greater than 0.75 [28].

Intra-rater (S.K) and inter-rater agreement were measured using Cohen’s weighted kappa in Excel. A predefined table of weights (Table 1) was used to measure the degree of disagreement between the two raters (linear weighted kappa). The observed frequencies of scores were tabulated into a 2 × 2 contingency table. A resource package provided by real-statistics.com [29] was then used to calculate the weighted kappa, standard error and 95% confidence intervals (CI), using the two constructed tables. A weighted kappa of less than 0.20 indicates poor agreement, 0.21–0.40 indicates fair agreement, 0.41–0.60 indicates moderate agreement, 0.61–0.80 indicates good agreement and a weighted kappa greater than 0.80 is interpreted as very good agreement [30]. Spearman’s correlation coefficient for ordinal scales was used to statistically compare the MRI-based K&L grading and the MOAKS for validity. A strength of a correlation is considered to be small when ρ = 0.10, medium when ρ = 0.30 and large when ρ = 0.50 [31].

Table 1 Predefined table of weights to calculate Cohen’s weighted kappa

Results

Study sample

The mean age for the sample (26 females) was 61.1 years (SD 8.4), and the mean BMI was 27.4 kg/m2 (SD 4.3). On average, participants had mild symptoms at baseline, with mean WOMAC pain score of 6.0 (SD 3.9) and physical function score of 15.6 (SD 12.1) (Table 2).

Table 2 Demographics (n = 50): mean (standard deviation)

Intra-rater reliability

Two-week intra-rater reliability was 0.91 (95%CI: 0.82–0.95), indicating excellent reliability. Intra-rater agreement was good (ĸ = 0.69).

Inter-rater reliability

Inter-rater reliability among all three raters was excellent (ICC = 0.88; Table 3). When pairs of readers were compared, the ICC remained above 0.75 for each pair of raters (Table 3). However, Cohen’s weighted kappa showed that inter-rater agreement among raters was moderate, ranging between 0.49 and 0.57 (Table 3).

Table 3 Agreement for MRI-based K&L grading between three raters using weighted kappa and ICC, as well as standard errors and 95% CIs

Validity

Our results reveal that the MRI-based K&L grading was correlated with MOAKS osteophytes in the superior patellar, inferior patellar, medial trochlear and lateral trochlear regions (Table 4). The percentage of cartilage loss (both partial and full thickness) and the percentage of full-thickness cartilage loss in all regions of the patellofemoral joint were also correlated with MRI-based K&L grading scores (Table 4). Although these correlations were statistically significant, the coefficients were not large and the strength of the correlations were medium (ρ = 0.37–0.58) (Table 4).

Table 4 Spearman’s correlation coefficients between MRI-based K&L grading for the patellofemoral joint scores and MOAKS features (individual and combined)

When the average of the MOAKS scores were calculated and compared with the MRI-based K&L grading scores, the correlation coefficients were strong and remained statistically significant (Table 4). The strongest correlation occurred when the average percentage of full-thickness cartilage loss in all of the patellofemoral joint regions was calculated (ρ = 0.65, p < 0.001) (Table 4). The correlation coefficient decreased when osteophytes were included in the analysis (ρ = 0.55, p < 0.01) (Table 4).

Discussion

We evaluated intra- and inter-rater reliability and agreement of a recently described MRI-based K&L grading [24]. The grading from MRI scans demonstrated good intra-rater reliability and agreement even when performed by an inexperienced reader. We found that while inter-rater agreement was moderate, there was excellent inter-rater reliability between the three raters. We also demonstrated validity of the scale with a strong correlation between the total MOAKS scores and the MRI-based K&L score. These findings indicate that the MRI-based K&L grading for the patellofemoral joint examined in this study could be a useful tool for researchers and clinicians to assess and monitor patellofemoral OA.

The MRI-based K&L grading was originally developed so that the patellofemoral joint could be assessed using MRI, in conjunction with the radiographic K&L assessment of the tibiofemoral joint [24]. Unlike the original study, where an experienced radiologist was employed, a novice reader performed the intra-rater reliability assessments in the present study. Despite having less experience with reading MRIs, we obtained good agreement (weighted ĸ = 0.69). However, our intra-rater reliability was not as high as reported in the original study when an experienced radiologist assessed the MRI scans (weighted ĸ = 0.80) [24]. This finding suggests that when a novice rater receives extensive training as in our study, acceptable consistency in the grading of patellofemoral OA severity by one rater can be achieved.

In contrast to the results from the intra-rater reliability and agreement, the inter-rater reliability and agreement results appeared very different from each other, which may be due to the conceptual differences between agreement and reliability. Reliability (using ICC) is an assessment of the variability of the selected study objects (i.e. participant MRI scans) [32]. Agreement (using weighted kappa) assesses how much the raters (measurement error) agree on the same measures [32]. The good ICC and moderate weighted kappa results reveal that, although the raters are able to differentiate between the different severities consistently, the raters are not always agreeing. Perhaps more training is required for more novice readers so that they can identify more subtle features that could result in more severe OA grades, and therefore, be more consistent with expert readers. However, only moderate agreement was also seen between the expert readers (Table 3), suggesting that disagreement between the raters could be reflective of the limitations of the MRI-based K&L grading, rather than the raters’ experience. In order to overcome the limitations of the MRI-based K&L grading, perhaps the assessment tool needs to be more refined to improve the agreement between raters. For now, any disagreements need to be discussed further to reinforce the distinguishing features of each grade. Furthermore, in clinics, it would be preferable if the same reader evaluates the MRI scans, when following their patients’ structural disease progression.

The validity findings suggest that the MRI-based K&L grading is not a good alternative to the MOAKS if the assessor is evaluating individual structural pathologies of OA in the patellofemoral joint. However, when the MOAKS scores were combined, the correlation coefficient between the total MOAKS and the MRI-based K&L grading was stronger. Since the MRI-based K&L grading evaluates osteophytes and cartilage in combination to yield an overall patellofemoral joint score, it would be appropriate to compare it with a similar outcome (that is, combining MOAKS osteophytes and cartilage loss subscale scores). However, from the authors’ knowledge, the MOAKS scores have not been previously combined, and the clinical significance of a total MOAKS score is unknown. Furthermore, a total patellofemoral joint score may not be ideal for MRI-based OA assessment as severe OA changes in some regions may outweigh milder changes in other regions. That is, a total patellofemoral joint score may be reported as more severe than specific areas of the joint (e.g. medial and lateral patellar surfaces). Yet, changes in a small region of the joint may not be clinically meaningful. With the large number of subregions and subscales in the MOAKS and other MRI-based OA assessment tools, it may be difficult to yield a sound conclusion for the patient that could be clinically meaningful. It is up to the clinician or researcher to decide whether they present to their patients an overall patellofemoral joint score or all the findings in each region and each subscale. The latter could be potentially detrimental to the patient, as medical terms that insinuate presence of disease could result in less understanding and more fear [33], and perhaps poorer prognosis. A previous study has shown that “degenerative terms” on radiological reports lead to poorer perceived prognosis among people with low back pain [34]. Reports using complex MRI-based OA assessment tools may lead to the patient catastrophizing their OA, provoking fear and potentially leading to poorer prognosis. The MRI-based K&L grading provides a simplified score, which may be clinically useful and more meaningful to patients.

The strengths of this study include a training protocol for all raters in the study, the development of an atlas and detailed statistical consideration. The most experienced radiologist trained the other raters, ensuring consistency, as the three raters had various degrees of training and experience with assessing MRI scans. All raters were blinded to any clinical data and radiological reports during the assessments. The raters had no prior knowledge of the participants’ patellofemoral and/or tibiofemoral joint status, eliminating bias from the assessments. Furthermore, participants’ MRIs were randomly selected. For the statistical analyses, we employed two statistical tests to ensure evaluation of both reliability and agreement. Additionally, we assessed the validity of the MRI-based K&L grading by comparing it to a reliable MRI-based assessment tool.

Study limitations

Limitations of this study include the exclusion of the tibiofemoral joint and limiting evaluation to the patellofemoral joint of the knee, as well as the inclusion of only participants with knee pain. The inclusion of tibiofemoral joint assessment and a “normal” participant cohort (no knee pain) would strengthen the usefulness of the MRI-based K&L grading for knee OA assessment using MRI scans. Future studies should consider assessing the MRI-based K&L grading for the tibiofemoral joint to allow a whole joint assessment to be conducted. Furthermore, future studies should also consider comparing the MRI-based K&L grading with radiographic K&L evaluation of the patellofemoral joint to determine comparability. Since MRI could be more sensitive than radiographic evaluations, raters may identify more people with mild knee OA using MRI, than with radiography. It would be also valuable to evaluate the MRI-based K&L grading in the clinical setting and to evaluate patients’ responses to standardised terminology or definitions of OA as seen on radiological reports. It would be interesting to see if the terminology used in the MRI-based K&L grading and other MRI-based OA assessment tools elicit fear and potentially lead to perceived poorer prognosis.

Conclusions

The radiographic K&L is a simple and well-recognised assessment, evaluating OA structural disease severity. Therefore, this MRI-based K&L grading can be potentially adopted by those working in radiography and are less familiar with MRI, when radiographs are not available for assessment. Our results demonstrate that researchers and clinicians with different levels of experience can use the grading assessment to assess OA. As it only assesses two MRI features of OA, it is simple and easy to follow and understand. Furthermore, this grading assessment is another option for the assessment of the patellofemoral joint [35]. Despite disease in the patellofemoral joint (combined with tibiofemoral OA) contributing to more pain and functional limitations, compared with isolated tibiofemoral OA [6, 36, 37], fewer MRI studies have been conducted on the patellofemoral joint, than either the medial or lateral tibiofemoral compartments [11]. Therefore, the MRI-based K&L grading is an important contribution as it provides a less time-consuming score that could be utilised to monitor OA progression in larger cohorts.