Introduction

Osteoarthritis (OA) is a chronic, generally progressive, disease characterized by symptoms of joint pain and functional disability [1]. As the disease progresses, joint pain worsens and performing the activities of daily life becomes difficult. Health-related quality of life (HRQL) worsens, commonly affecting ones mood, the ability to have a comfortable sleep, and to enjoy social and recreational activities. When conservative treatment is no longer effective in providing pain relief and functional improvement, a decision may be made for total joint replacement (TJR).

Increasing emphasis is being placed on patient-reported outcome measures (PROMs) to evaluate the impact of TJR on HRQL [2, 3]. On average, patients undergoing TJR achieve significant improvement during the first 6–12 months following surgery [2, 4]; unfortunately, 7–30% of patients report little or no improvement or are not satisfied with the surgical results [1, 5,6,7,8] with poorer results for knee as compared to hip surgery.

Condition-specific and generic PROMs are often used in combination in arthroplasty registries to assess the HRQL of patients with TJR [9]. PROMs must be tested in the intended population to ensure that they meet acceptable standards of reliability and validity. Validity is the degree to which empirical evidence and theoretical rationales support the interpretability, appropriateness, and usefulness of a score [10]. One approach to assessing the validity of a new HRQL tool is to compare it with generic and condition-specific tools recognized as being reliable and valid in the population of interest. Convergent validity is the degree to which the scores correlate with other measures that assess a similar construct. An important aspect of validity in any outcome measure is the instrument’s responsiveness, its ability to detect change in health status when it has occurred. While condition-specific PROMs usually have higher responsiveness, it is important to include generic PROMs to compare outcomes across different populations [11].

The EQ-5D-3L is a generic preference-based HRQL instrument for describing and valuing health [12]. It is one of the most common generic instruments used in arthroplasty registries [9] and is useful in economic evaluations where utility values are used to calculate quality-adjusted-life-years (QALYs) [13, 14]. It is based on a descriptive system that defines health in terms of 5 dimensions, each with three response options: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. Responsiveness of the EQ-5D-3L compared to condition-specific measures in TJR shows that the EQ-5D-3L is less responsive than Oxford hip (OHS) and knee (OKS) scores [3, 13] and other joint-specific measures [13, 15]. The inability of the EQ-5D-3L to discriminate small to moderate differences in health status and its poor responsiveness were attributed to the wording of mobility [13] and a strong ceiling effect [15]. The three response options limit the range of responses for individuals with moderate to severe disability [16, 17]. For example, the three response options for mobility lack the flexibility of responses necessary to describe the mobility limitations of OA patients who may need a walking aid but are not confined to bed.

The recently developed 5-level EQ-5D, the EQ-5D-5L, addresses these issues by expanding the range of responses for each dimension from three to five levels and modifies the labels for some of the options. Thus, the wording for the most severe level of mobility was changed from “I am confined to bed” to “I am unable to walk about” [18]. Respondents also rate their overall health on a 0–100 visual analogue scale (EQ-VAS). The EQ-5D-5L has been compared to the EQ-5D-3L in population surveys [19,20,21] and clinical groups [22,23,24,25,26]. Its responsiveness has not been reported in longitudinal studies of patients having TJR.

The purpose of this study is twofold: (1) to assess the validity of the generic EQ-5D-5L and compare it to two commonly used PROMs (condition-specific and generic) in patients with OA with total hip (THR) or knee (TKR) replacement surgery; validity evidence included ceiling effects, convergent validity, and responsiveness; (2) to compare distribution-based and anchor-based methods of assessing responsiveness one year following TJR.

Methods

A questionnaire was administered to consecutive patients referred to orthopaedic practices in Halifax, Nova Scotia (QE11 Health Sciences Centre) and Winnipeg, Manitoba (Concordia Joint Replacement Group) from March 2012 to February 2014. In Halifax, the orthopaedic surgeons at the QE11 work loosely as a group. They receive about 95% of referrals from the Halifax area (Capital District) and approximately 50% of the referrals from the rest of the Province. The Concordia Joint Replacement Group consists of four orthopaedic surgeons, who practice as a group. They receive approximately 50–70% of all referrals in Winnipeg.

Patients meeting the inclusion criteria were adults (> 18 years), referred to an orthopaedic surgeon as likely candidates for a primary TJR due to OA. Ethics approval was obtained from the University of Calgary Conjoint Health Research Ethics Board, the University of Manitoba Health Research Ethics Board, and the Nova Scotia Health Authority Research Ethics Board.

An information sheet and the questionnaire with a postage-paid return envelope were mailed shortly after the clinic received the referral form. Non-responders received a 2-week follow-up reminder and a 4-week repeat survey package. Patients were assessed following referral to the orthopaedic surgeon to obtain baseline measures and at 12 months following surgery. A questionnaire was also sent following the orthopaedic consultation to assess test–retest reliability of the outcome measures.

Outcome measures

Patients completed HRQL questionnaires in the following order: the OHS or OKS, the EQ-5D-5L, EQ-VAS, and the Short Form 12 (SF-12). The OHS and OKS are 12-item questionnaires designed to assess the pain and symptoms of patients having THR [27] or TKR [28, 29]. The 12 items, each measured on a 5-point scale, assess hip or knee pain, mobility (walking, climbing, changing position), self-care (washing, drying, putting on socks), and activities such as shopping. If 2 or less items were left unanswered, we replaced the missing data with the mean value of the other responses, as per the scoring protocol for the OHS and OKS [29]. Summary scores range from 0 (most severe symptoms) to 48 (least symptoms). Because the OHS and OKS were designed specifically for patients having a THR or TKR, they were used as the ‘gold standard’ by which to assess responsiveness of the EQ-5D-5L. To calculate the EQ-5D-5L index, we used the value set described by van Hout et al. [30] to convert the descriptive system to a summary index score. The SF-12 is a generic HRQL questionnaire which produces two standardized summary scores, the Physical (PCS-12) and Mental Component Summary (MCS-12) [31], that range from 0 (worst) to 100 (best). The PCS-12 includes items related to general health, daily activities, and the effect of pain on work, both outside and inside the home. The MCS-12 items assess emotional health and social activities. The SF-12 has been used to assess HRQL in TJR [11, 13, 32, 33]. Missing values for the EQ-5D-5L and SF-12 were not imputed. Demographic variables included age, sex, marital status, education, the problem joint (hip or knee), and previous TJR in another joint.

Data analysis

The distribution of the EQ-5D-5L responses across the five levels at baseline and 12 months post-surgery was assessed. This assessment is important to ensure that the scale captures the full range of potential responses and that changes over time can be detected. Significant ceiling or floor effects weaken the ability of a scale to detect deterioration or improvement in health status. Ceiling and floor effects were calculated as the percentage of patients showing the best and worst health status for each outcome measure at baseline and 12 months post-surgery. Floor or ceiling effects were defined as small (1–15%) and moderate (> 15%) [34].

Convergent validity

We assessed the convergent validity between the EQ-5D-5L and the other measures with Pearson correlation coefficients. Based on the literature [3, 13, 35, 36] and item content, we hypothesized that the EQ-5D-5L and the Oxford scores would correlate strongly (> 0.5), as both tools measure similar constructs relevant to OA patients— pain, mobility, self-care, and activities. We hypothesized that correlations of the EQ-5D-5L with the PCS-12 and EQ-VAS would be moderate (0.3–0.5) as both the PCS-12 and EQ-VAS assess general health, and that correlations with the MCS-12 would be moderate, as only one EQ-5D-5L item assesses mental health.

Responsiveness

Because responsiveness coefficients vary by population and context, multiple approaches to assessing responsiveness are recommended [37]. We used both distribution-based and anchor-based approaches. Distribution-based approaches include the effect size (ES), the standardized response mean (SRM), and the standard error of measurement (SEM). The ES was calculated by dividing the difference in baseline and 1-year post-surgery scores by the standard deviation of the baseline score. The SRM was calculated by dividing the difference score by the standard deviation of the difference. Effect sizes were interpreted as small (0.2), medium (0.5), and large (0.8) using Cohen’s conventions [38].

All PROM assessments have some measurement error [39]. The SEM is the standard deviation of errors of measurement associated with scores from a particular group of respondents. The SEM is a function of the reliability of the score and the standard deviation of scores. As reliability coefficients are sample dependent, we used intraclass correlation coefficients (ICCs) to assess the test–retest reliability of the outcome measures at 2 timepoints: at baseline and following surgeon consultation. The SEM was calculated by taking the square root of one minus the reliability and multiplying the results by the standard deviation of each outcome measure at baseline. A 95% confidence band (2SEM) was used around individual scores to provide ‘reasonable limits’ for estimating the true score [40]. The 2SEM criterion was applied to the change scores to classify patients as improved (> 2SEM) versus the same or worse. The procedure was repeated using a 68% confidence band (1SEM). Both 2SEM [41, 42] and 1SEM [43] have been used in the HRQL literature to classify patients into those who improved following surgery vs. those who did not.

The anchor-based method examines the relationship between a PROM and an external criterion or anchor, either clinical or patient-based, to assign people into several groups reflecting their degree of change [39, 44,45,46]. The minimum important difference (MID) is defined as the smallest difference in a PROM that is perceived by patients as beneficial or harmful [37, 39, 47, 48]. MIDs vary by population, context, method, the choice of anchor, the length of follow-up, and on the strength of the relationship between the HRQL measure and the anchor [37, 39]. A fundamental aspect of estimating the MID is to define the subgroup of people who have changed by a minimal amount [37, 39, 47]. The most commonly used method to identify this subgroup is to ask patients at follow-up how much they have changed since baseline; this is also referred to as a global transition question [37, 46]. Typically the MID is assessed by asking patients to provide a retrospective judgment about whether they have improved, stayed the same, or worsened over a period of time; the mean change of the group that reports getting a little better or a little worse is used to estimate the MID [37, 46, 47]. This is the method that we used. As recommended [37, 39], we calculated the MID using three different anchor questions, each with five response options, 1 year following surgery. (1) How would you rate your hip or knee now compared to when you first saw the orthopaedic surgeon? (2) How would you rate your health in general now compared to when you first saw the orthopaedic surgeon? (3) How satisfied are you with your hip or knee replacement? The response options for the first two questions were: much better, somewhat better, about the same, somewhat worse, and much worse. The response options for the satisfaction question were: very satisfied, satisfied, neither satisfied nor dissatisfied, dissatisfied, and very dissatisfied. We calculated the MID as the mean change score between baseline and one year post-surgery in patients who rated (1) their hip or knee as “somewhat better”, (2) their health in general as “somewhat better”, and (3) their satisfaction as “satisfied”.

Any evaluation of the effectiveness of TJR depends on the definition of a “successful” treatment [49]. In TJR, the MID has been used as a threshold to determine the number of patients who are classified as responders to surgery [8, 50,51,52,53,54] or as having a successful outcome [55]. We used the MID as a cutpoint to dichotomize patients into responders and non-responders to TJR [8, 51]. Patients with a change score greater than the MID were classified as “responders”. Finally, we cross-tabulated the responder categories defined by the MID with those defined by patients who improved using the 1SEM and 2SEM criteria. We calculated the percent agreement for each comparison.

Results

Questionnaires were mailed to 1738 patients who were referred to an orthopaedic surgeon; 975 patients returned the baseline questionnaire (Fig. 1). There was no significant difference in age or sex in responders vs. non-responders. Of the responders, 592 were offered and had a TJR; of these, 541 completed both the baseline and 12-month post-surgery questionnaire. Our sample included 537 patients with baseline and one-year post-surgery EQ-5D-5L data. Eighty-two percent of patients had their surgery in Winnipeg and 18% in Halifax. Fifty percent (n = 268) had a knee replacement, 19% (n = 100) had had a previous TJR on another joint, 56% (n = 302) were female, and the mean age was 64.1 years (SD 10.3). Table 1 provides baseline descriptive statistics for the demographics and outcome measures.

Fig. 1
figure 1

Flow diagram showing recruitment of patients

Table 1 Baseline statistics for outcome measures and demographics for hip and knee replacement patients

Table 2 shows the EQ-5D-5L responses for hip and knee replacement at baseline and one year post-surgery. All response options were used for mobility and self-care at both time points. The majority of patients reported “moderate” or “severe” problems at baseline for mobility, usual activities, and pain, and “slight” or “no” problems one year post-surgery for all five dimensions. Ceiling effects for the EQ-5D-5L post-surgery ranged from 18 (TKR) to 36% (THR) compared to 5% and 23%, respectively, for the Oxford scores (Table 3). There were no floor effects for the EQ-5D-5L or the SF-12.

Table 2 Distribution of EQ-5D-5L responses at baseline and 1 year post-surgery
Table 3 Floor and ceiling effects for outcome measures for hip and knee (percent)

As hypothesized, correlation coefficients of the EQ-5D-5L with the Oxford scores were strong; correlation of change scores were 0.72 for THR and 0.74 for TKR (Table 4). Correlations of EQ-5D-5L change scores with the other measures ranged from 0.32 (EQ-5D-5L and MCS-12) to 0.53 (EQ-5D-5L and PCS-12).

Table 4 Correlation coefficients of EQ-5D-5L with other outcome measures

The effect sizes for the EQ-5D-5L were 1.86 (hip) and 1.19 (knee) (Table 5). These were smaller than effect sizes for the Oxford scores (3.00 and 2.05) and the PCS-12 (2.05 and 1.62), respectively.

Table 5 Effect size (ES) and standardized response mean (SRM) for outcome measures for hip and knee

The percentage improved with the EQ-5D-5L at one year post-surgery based on 2SEM was 77% for THR and 55% for TKR. This percentage was similar to the PCS-12 for THR and lower than the Oxford scores (Table 6).

Table 6 Percent of patients improved > 1 and > 2 standard errors of measurement (SEM)

The MID for the EQ-5D-5L was 0.22 (hip) and 0.20 (knee) for patients who rated their hip or knee, respectively, as somewhat better (Table 7). We did not assess the MID for patients who reported their outcomes as somewhat worse, as there were insufficient numbers in this category (5 or less). The percentage of patients classified as “responders” one year following surgery varied with the outcome measure, the anchor, and the surgical joint. The percentage of patients who improved using the EQ-5D-5L ranged from 57 (health now) to 81% (hip knee now) for THR and 54–68%, respectively, for TKR. The percentage improved was higher for hip patients than knee patients with both the EQ-5D-5L and Oxford scores. Similar to the percentage of patients who improved based on exceeding 1 or 2SEM, the percentage of patients categorized as “responders” with the MID was generally the highest using the Oxford scores.

Table 7 Minimum important difference (MID) and percent improved for each outcome measure for hip and knee

On the suggestion of a reviewer, using a method described by Lee et al. [56], we compared our MID results to those using simple linear regression. Lee et al. calculated a minimal clinically important difference (MCID), using the slope of the line for the change in outcome measure for levels of patient satisfaction. Using this method, we used the slope of the line for the change in EQ-5D-5L and Oxford scores for the change according to level of perceived improvement in the hip or knee. We collapsed the levels of improvement into three groups: “much better”, “somewhat better”, and “about the same/worse” due to small numbers in the “about the same” or “worse” groups. The MCID for the EQ-5D-5L, identified by simple linear regression, was 0.24 (95% CI 0.11–0.37) and 0.20 (95% CI 0.12–0.27) for THR and TKR, respectively. The analysis was repeated for the OHS and OKS; results were 12.26 (95% CI 7.98–16.53) for the OHS and 8.78 (95% CI 6.36–11.21) for the OKS.

We cross-classified the responder category (responder or non-responder) of the change scores with the MID (calculated with 3 different anchors) versus 1 and 2 SEM. Table 8 shows the percent agreement for responder category for each comparison. For the EQ-5D-5L, there was 96.3% agreement in the classification of responder status for THR and 86.9% for TKR using 2 SEM vs. the MID with hip or knee improvement as the anchor. A comparison of the percentage of patients who were classified in the same category (responder or non-responder) using 2SEM versus the MID showed agreement between 79.6 and 99.6% with the EQ-5D-5L and between 69.4% and 94.8% with the Oxford scores. The percentage agreement using the two methods was generally higher with 2SEM compared to 1SEM for the Oxford scores and EQ-5D-5L.

Table 8 Percent agreement of responder category (responder or non-responder) with minimum important difference (MID) vs. 1 and 2 standard errors of measurement (SEM)

Discussion

In this study, we assessed the convergent validity and responsiveness of the EQ-5D-5L and compared it to condition-specific and generic PROMs in patients who had either THR or TKR. We also compared both distribution-based and anchor-based methods of assessing responsiveness. Baseline Oxford scores and PCS-12 were similar to pre-TJR scores in other studies [3, 57, 58], while EQ-5D-5L index scores were similar to pre-surgery EQ-5D-3L index scores in THR [13] and TJR [3] patients. The baseline MCS-12 was similar to other pre-TJR scores [11, 57] and healthy population norms [31]. Compared to EQ-5D-3L rheumatology studies where the majority of patients used level 2 (some to moderate) for mobility, usual activities, and pain/discomfort [16, 35, 59], responses to the EQ-5D-5L were more evenly distributed across the five levels.

Ceiling effects for the EQ-5D-5L one year following surgery were moderate (> 15%) but were less than those reported for the EQ-5D-3L at 6 months [60] and one year post-surgery [13, 15], which was reported as high as 84% [15]. The post-surgery ceiling effect for the OHS was also moderate and has been reported in other studies for the OHS and OKS [33, 61].

Convergent validity

Our results supported the convergent validity of the EQ-5D-5L in TJR. Both baseline and change score correlations between the EQ-5D-5L and Oxford scores were strong and similar to those in a cross-sectional study of TJR [35]. Correlations were stronger than those between the EQ-5D-3L and Oxford scores reported in other studies [3, 13, 35, 36].

Responsiveness

Although effect sizes for the EQ-5D-5L were large (> 0.8), they were smaller than effect sizes for the Oxford scores and the PCS-12. Compared to the few studies that have reported ES for the EQ-5D-3L in TJR, effect sizes for the EQ-5D-5L in our study were higher than those reported for the EQ-5D-3L in THR patients [13] and similar in TKR patients [15] one year following surgery. ESs for Oxford scores were similar to those in other studies of patients with TKR [3] and THR [13, 33, 58]. In a comparison of responsiveness of the OHS and EQ-5D-3L, Dawson et al. [62] also found a larger ES for the OHS one year following revision hip replacement. ICCs were similar or lower than those in the few studies that have reported test–retest reliability in similar populations [28, 35, 63, 64]. The percentage improved using 2SEM was higher for hip patients than knee patients for all outcome measures. This is consistent with other studies that report a higher percentage of patients showing improvement with hip vs. knee surgery [4, 11].

We assessed MIDs using three patient-rated questions as anchors: perceived change in health status, change in hip or knee, and satisfaction with surgery 12 months following surgery. The MID for each PROM varied depending on the joint (hip or knee) and the anchor. For the EQ-5D-5L, the MID ranged from 0.20 (hip or knee somewhat better for TKR) to 0.41 (health somewhat better for THR). In comparison, Walters and Brazier reported a MID of 0.12 for the EQ-5D-3L for TKR patients who reported some change in health from baseline to 6 months [65]. Although there is some concern regarding the application of the MID for preference-based HRQL instruments in cost-effectiveness analysis [66], the EQ-5D-5L is also used in clinical evaluation, where the MID is well accepted. Our results for the MID of 11 points for OHS and OKS for patients who reported their hip/knee problem as “somewhat better” were similar to the mean values reported by Beard et al. [58] of 11 and 9 points for the OHS and OKS, respectively, for patients whose response was “a little better” to a question that compared their hip/knee problems 6 months after surgery to before surgery. Our MID results for the EQ-5D-5L and Oxford scores were similar to those using the linear regression method described by Lee et al. [56].

The percentage of responders with the EQ-5D-5L varied from 54 to 81%, and with the Oxford scores, from 53 to 89%. In comparison, other longitudinal studies using the MID to determine responders to TJR have reported the percentage of responders ranging from 54 [53] to 70% [8, 11]. However, these studies varied in method of assessment, hip or knee joint, and follow-up time. Similar to other studies [11, 53, 67], a higher percentage of hip patients compared to knee patients were classified as responders. Of the three anchors we used to calculate the MID, the anchor most specific to the hip and knee outcome resulted in the highest proportion of responders for all outcome measures except the MCS-12.

Comparison of distribution and anchor-based methods

Both distribution-based and anchor-based methods have been used to assess responsiveness but few studies have done concurrent comparisons of both approaches [43, 46, 68]. In our study, the percent agreement (responders or non-responders) using 2SEM and MID ranged from 79.6 to 99.6% for the EQ-5D-5L and from 69.4 to 94.8% for the Oxford scores. The percent agreement varied with the measures but was generally higher with 2SEM compared to 1SEM with the EQ-5D-5L and Oxford scores. In contrast, Wyrwich et al. [43] reported that 1SEM closely approximated the MCID in patients with chronic obstructive pulmonary disease. However, they did not report a comparison of the 2SEM with the MCID. Based on a systematic literature review of the MID, Norman et al. [46] concluded that the MID estimates were close to one-half a SD, which is equivalent to 1SEM when the test–retest reliability is 0.75. However, in one study, this method of defining responders resulted in only 54% of patients classified as having a “good TJR outcome” 6 months following surgery [53].

Limitations of this study were a poor response rate for patients following referral but there was no significant difference in demographic variables between responders and non-responders. Based on the percentage of responders who were offered surgery, not all of the non-responders would have been surgical patients. This could explain some of the non-response, as the survey questions may not have been relevant for patients who were not appropriate candidates for TJR. However, the majority of surgical patients who returned the baseline survey also completed the one year follow-up survey. The analysis for test–retest reliability may be biased by the orthopaedic consultation between the pre- and post-consultation assessments.

Conclusion

This study is the first to report on the responsiveness of the EQ-5D-5L in a longitudinal study of patients undergoing TJR. Our results support its use as a generic HRQL measure to be used alongside a condition-specific measure in assessing PROMs in primary TJR. Compared to the few studies that have assessed the responsiveness of the EQ-5D-3L in primary TJR [3, 13, 15, 60], the EQ-5D-5L had a more even distribution of responses across the five dimensions, less ceiling effect, and a larger ES for THR. Although the EQ-5D-5L was less responsive than the Oxford scores, this finding is consistent with other comparative studies of condition-specific and generic HRQL measures [15, 41]. We showed that there is wide variation in the MID and the percentage of responders, depending on the joint assessed, the method of assessment, and the outcome measure. Therefore, interpretation of a successful outcome of TJR should take these factors into consideration. Caution should be taken in the interpretation of the percentage of “responders” to TJR based on only one method of assessing the responder classification.