Introduction

For patients suffering from pain or function limitations caused by end-stage of osteoarthritis (OA, hip and/or knee) or joint injuries, joint replacement, including total hip replacement (THR) and total knee replacement (TKR), is one of the most effective ways to relieve the symptoms and improve patients’ functioning and health-related quality of life (HRQL) [1]. In 2015, there were 166 cases of hip replacements and 126 knee replacements per 100,000 people in the Organization for Economic Co-operation and Development (OECD) countries [2]. On average, the rate of knee replacements almost doubled and the rate of hip replacements increased by 30% in OECD countries from 2000 to 2015 [2]. In Canada, the numbers of THR and TKR in 2016–2017 were 55,981 and 67,169 (increased by 18% and 15% since 2011–2012), respectively, which placed a burden of more than 1 billion Canadian dollars on the Canada health system [3].

Due to the continued growth of the number of joint replacements and emphasizing patient-centered care in this population, HRQL of patients following THR and TKR has been considered an important outcome of these procedures, that is, the change in HRQL before to after the surgery. Therefore, instruments used to measure HRQL (and change in HRQL) of patients following the THR/TKR need to have the ability to pick up such change (i.e., responsiveness) [4].

HRQL of the THR/TKR patients can be measured using both OA-specific measures and generic measures [5,6,7]. Generic preference-based measures can provide utility scores that can be used in calculating quality-adjusted life years (QALYs) for use in cost-utility analyses and also allow comparisons among different interventions and health conditions. Developed by the EuroQol Group [8], the EQ-5D [9] is one of the most commonly used generic preference-based measures in clinical trials and cost-utility analyses [10,11,12,13]. It also has been routinely applied in THR/TKR programs in the UK [14], Sweden [15], and Canada (Alberta) [16]. There are two versions of the EQ-5D: the 3L and the 5L. For routine outcome measurement among the THR/TKR patients, the 3L version is used in the UK and Sweden. In Alberta, the 3L version was used initially, but has recently, and gradually, been replaced by the 5L version.

The 5L version of the EQ-5D was developed to enhance the measurement properties of the instrument, since the 3L version has important limitations including high ceiling effect and limited sensitivity to change [17,18,19,20]. The available evidence on measurement properties of the 3L versus the 5L among patients following THR/TKR is mainly based on cross-sectional comparisons; however, longitudinal comparisons between the 3L and 5L versions are lacking. Greene et al. [21] and Conner-Spady et al. [17] have shown the 5L version is superior to the 3L in terms of construct validity in patients undergoing THR/TKR. Two recent assessments of longitudinal measurement properties have focused on the 5L version. Conner-Spady et al. [22] reported that the EQ-5D-5L has appropriate responsiveness in THR/TKR patients; however, they only evaluated the responsiveness for the 5L version among a relatively small sample (n = 269 THR, n = 268 TKR) over a 12-month period. Since other factors (e.g., rehabilitation, development of other symptoms/conditions) may occur in such a relatively long term, the change in HRQL may not be the real surgery-related change, affecting the responsiveness evaluation results. Bilbao and colleagues’ study [23] supported the reliability, validity, and responsiveness (over a 6-month period) of the EQ-5D-5L in patients with hip or knee osteoarthritis; however, since only 20.2% patients received joint replacement during the follow-up, their findings may not be appropriate to generalize to the THR/TKR patient population.

Our objective was to examine and compare the responsiveness of the EQ-5D-3L and EQ-5D-5L over a 3-month time period in patients following THR/TKR.

Methods

Sample and materials

This is a retrospective cohort analysis using data from the Alberta Bone and Joint Health Data Repository. This data repository is operated by the Alberta Bone and Joint Health Institute and was launched in 2010 [24]. The EQ-5D (3L or 5L) and the Western Ontario and McMaster University Osteoarthritis Index (WOMAC) have been routinely administered to all THR/TKR patients before surgery, 3 months, and 1 year after surgery at 13 hospitals in Alberta, Canada. Typically, patients complete the EQ-5D and WOMAC during the first clinical assessment; otherwise, patients could complete the measures before the first clinic visit through a hyperlink included in the appointment confirmation email, or after the clinic visit through a link in a follow-up email. For those who do not complete the two measures during, before, or after the first clinical assessment, their pre-surgery measurement could be done during, before, or after one of the subsequent clinic visits. Since patients’ HRQL might change over time between the first clinical assessment and the surgery, and given our intention to examine whether the EQ-5D can capture pre- and post-surgery change in HRQL, a clear definition of baseline measurement was necessary. Therefore, we defined the baseline measurement as those pre-surgery measurements completed between 14 days before, and 30 days after, the first clinical assessment. Patients were included in this analysis if they: (1) had a primary THR or TKR; (2) completed the EQ-5D and WOMAC in the time period from 14 days before to 30 days after the first assessment; and (3) completed the WOMAC and the same version of EQ-5D at baseline and 3-month follow-up. All clinics used the EQ-5D-3L between 2010 and 2012, and gradually shifted to using the EQ-5D-5L between 2013 and 2016, after 2016, all clinics have used the EQ-5D-5L.

In the present analysis, we only employed data from baseline and 3-month follow-up after surgery and did not use the 1-year follow data, because changes in HRQL caused by THR/TKR usually happen during the first three months after the surgery, after that, HRQL change will be mainly impacted by many other factors (e.g., aging, comorbidities, rehabilitation, and so on). Therefore, 3-month is an appropriate duration to assess the responsiveness of EQ-5D in patients following THR/TKR.

Measures

The EQ-5D has 5 dimensions: mobility (MO), self-care (SC), usual activities (UA), pain/discomfort (PD), and anxiety/depression (AD). In the 3L version (the initial version was introduced in 1990 by the EuroQol group), each dimension has three severity levels: “no problems” (level 1), “some problems” (level 2), and “unable to” (level 3, “confined to bed” for MO) describing 243 distinct health states [25]. The 5L version has the same descriptive system including the five dimensions but has five severity levels for each dimension: “no problems” (level 1), “slight problems” (level 2), “moderate problems” (level 3), “severe problems” (level 4), and “extreme problems” (level 5) describing 3125 distinct health states [26]. We used the EQ-5D-3L and EQ-5D-5L Canadian value sets [27, 28] to calculate the index scores. The index scores range from -0.340 (health state “33333”) to 1 (health state “11111”) for the EQ-5D-3L [27] and range from − 0.148 (health state “55555”) to 0.949 (health state “11111”) for the EQ-5D-5L [28].

The WOMAC is a commonly used, self-administered instrument to evaluate hip and knee osteoarthritis-related symptoms and functional limitations. It includes 24 items measuring pain (5 items), stiffness (2 items), and physical functioning limitations (17 items) [29]. The 5-point Likert version of the WOMAC was employed in this study. For each item, the score ranges from 0 for no symptoms or functional limitations to 4 for extreme symptoms or functional limitations. The ranges of the raw scores of the aforementioned subscales are 0–20, 0–8, and 0–68, successively. All three raw subscale scores can be linearly converted to transformed scores ranging from 0 (worst) to 100 (best). For each patient, the transformed overall score is the average of the three subscales’ transformed scores [29]. In a sensitivity analysis, the WOMAC overall score was calculated using a simpler method: converted the sum of the 24 items’ score to a 0 (worst) to 100 (best) scale [29].

Previous studies have reported that the WOMAC is responsive to change in HRQL among patients following joint replacement (or with OA) [30,31,32].

Statistical analysis

Patients’ baseline characteristics including age, gender, pre-surgery risk factors, body mass index (BMI), number of weeks between the baseline and surgery, and patients’ responses to the WOMAC and EQ-5D were compared using rank-sum test (continuous variables) and Chi-square test (categorical variables) between the 3L and 5L samples. Ceiling and floor effects were explored at both health state level and dimension level for both the 3L and 5L samples. If more than 15% of the patients reported the best or worst possible levels or health states, then ceiling or floor effects were considered presented [33].

We used the WOMAC as an anchor and categorized patients into the following nine groups according to their relative change in the transformed overall WOMAC score from baseline: large (decrease from baseline ≥ 70%), moderate (70% > decrease from baseline ≥ 50%), small but important (50% > decrease from baseline ≥ 20%), and very mild deterioration (20% > decrease from baseline > 0), no change (no change in WOMAC overall score), very mild (0 < increase from baseline < 20%), small but important (20% ≤ increase from baseline < 50%), moderate (50% ≤ increase from baseline < 70%), and large improvement (70% ≤ increase from baseline) [29]. For those who had baseline WOMAC overall score as 0, we replaced their baseline score with 0.5. Patients who achieved at least a small but important improvement were defined as responders, and those who failed to achieve a small but important improvement were defined as non-responders. For both the 3L and 5L samples, the difference in baseline characteristics between the responder and non-responders was tested using Wilcoxon rank sum and Chi-square tests where applicable.

Since the anchor per se should measure very similar construct as the target measure under investigation [34], we assessed the correlation between the anchor, i.e., change in overall WOMAC score and change in the EQ-5D-3L and EQ-5D-5L index scores using a priori criteria as recommended by Guyatt and colleagues [35]: the correlation of the anchor with the change of the EQ-5D (3L and 5L) index scores from baseline to 3 months should be larger than 0.5, i.e., at least a moderate correlation [36]. Spearman correlation coefficients between the anchor and the two versions of EQ-5D were calculated by treating the change in WOMAC overall score as a continuous variable and a categorical variable (based on the above-defined 9 groups).

The responsiveness of the EQ-5D-3L and EQ-5D-5L for patients in each of the above-defined groups was assessed using effect size (ES, change in scores divided by standard deviation of the baseline score), standardized response mean (SRM, change in scores divided by standard deviation of the change), and Guyatt’s Responsiveness Index (GRI, change in scores divided by the standard deviation of the change for the stable patients). In our analysis, the stable patients for GRI calculation were defined as patients who had “very mild improvement” or “very mild deterioration” or remained the same (“no change”) on the WOMAC overall score. The magnitudes of these coefficients were interpreted as small (0.2–0.49), moderate (0.5–0.79), or large (≥ 0.8) [37].

Results

General characteristics of participants

Between April 2010 and March 2017, a total of 1998 patients had THR and completed the WOMAC and the same version of EQ-5D at baseline and 3-month follow-up. Among these patients, 646 completed the 3L version (59.4% females, mean age 66.1 years) and 1352 completed the 5L version (58.1% females, mean age 65.3 years). During the same time period, a total of 2776 patients had TKR and completed the WOMAC and the same version of EQ-5D at baseline and 3-month follow-up. Among these patients, 948 completed the EQ-5D-3L (60.7% females, mean age 66.1 years) and 1828 patients completed the EQ-5D-5L (61.8% females, mean age 66.5 years). These patients were included in our analysis (Fig. 1). In both the THR and TKR samples, there were no statistically significant differences between the 3L and 5L samples in terms of age, sex, or pre-surgery body mass index (BMI) (Table 1). Patients who completed the EQ-5D-5L had longer time interval between the baseline and the surgery than those who completed the EQ-5D-3L. The mean baseline WOMAC overall scores and EQ-5D index scores for THR patients (WOMAC 36.4, EQ-5D-3L 0.52, EQ-5D-5L 0.38) were lower than that for the TKR patients (WOMAC 41.5, EQ-5D-3L 0.57, EQ-5D-5L 0.47) (Table 1). Ceiling and floor effects were not considered presented (all < 0.3%) for both versions at the health state level at baseline, while the EQ-5D-3L consistently presented a larger ceiling effect on SC and A/D dimensions and larger floor effect on UA and P/D dimensions than the EQ-5D-5L. Ceiling and floor effects for the MO dimension were not presented for both versions (all < 5.0%); however, for the EQ-5D-3L, more than 94% patients selected level 2 (Table 1). For patients following TKR in both the 5L and 3L samples, there was no significant difference in terms of age, gender, pre-surgery BMI, and number of pre-surgery risk factors between the responders and non-responders, except for the non-responders were younger than the responders in the 5L sample. Similarly, for the patients following THR, there was no significant difference between responders and non-responders, except for the non-responders had more pre-surgery risk factors than the responders in the 3L sample, and had lower BMI in the 5L sample. The responders consistently had lower EQ-5D index scores at baseline than non-responders across all samples (see Supplementary Table 1).

Fig. 1
figure 1

Flowchart for including patients for analysis. WOMAC Western Ontario & McMaster Universities Osteoarthritis (Index)

Table 1 Patient baseline characteristics

Correlation between anchor and EQ-5D

For both versions of the EQ-5D in both the THR and TKR samples, the correlations between the change in EQ-5D index score and change in the WOMAC overall score (as either continuous or categorical variables) met the a priori criteria (Table 2).

Table 2 Spearman correlation coefficients between the change in EQ-5D and WOMAC (anchor)

Responsiveness

For the THR and TKR patients in the “small but important improvement” groups, the EQ-5D-5L (ES, THR 0.90, TKR 0.95; SRM, THR 1.20, TKR 1.13; GRI, THR 0.95, TKR 1.06) was consistently more responsive than the EQ-5D-3L (ES, THR 0.64, TKR 0.69; SRM, THR 0.69, TKR 0.73; GRI, THR 0.75, TKR 0.73). According to the pre-defined classification of magnitude of coefficients, for both versions, the three coefficients were “small” for the “very mild improvement” groups in both the THR and TKR samples (Table 3). The EQ-5D-5L consistently had larger ES, SRM, and GRI than the EQ-5D-3L for “moderate improvement” and “large improvement” groups in both the THR and TKR samples (Table 3).

Table 3 Change in EQ-5D index scores and coefficients of effect size

Sensitivity analysis

We explored defining the WOMAC overall score using a simple approach in the sensitivity analysis. More than 98% of patients were categorized into the same subgroup defined by the relative change in WOMAC overall score in the base-case. Correlations between the EQ-5D and the anchor, and value of the responsiveness statists were extremely similar to the results from the base-case analysis (Supplementary Tables 2 and 3).

Discussion

Our study demonstrated that both versions of the EQ-5D were responsive among patients following the THR or TKR; however, the EQ-5D-5L had better capabilities in picking-up changes in HRQL than the EQ-5D-3L.

Our finding that the EQ-5D-5L was responsive to change in HRQL of patients following the THR or TKR is in line with a recently published study [22]. The responsiveness of the EQ-5D-5L had also been examined and confirmed in patients with stroke [38, 39], adolescent idiopathic scoliosis [40], chronic obstructive pulmonary disease [41], and meningitis [42]. The small ceiling and floor effects of the EQ-5D-5L at the health state level were in line with findings from studies in similar populations [17, 23].

However, there are very few studies that have compared the longitudinal measurement properties of the two versions of EQ-5D. The finding that the 5L version was more responsive than the 3L in our study was inconsistent with results from a similar comparison in patients with stroke that found that the EQ-5D-3L was more responsive than the EQ-5D-5L [38]. This difference may be due to different anchors used to evaluate the responsiveness, and the different patient populations and clinical context of these health conditions. Moreover, Golicki et al. used an interim Polish EQ-5D-5L value set which was estimated using the cross-walk approach [38]; however, in our study, we used the Canadian 5L value set estimated using the time trade-off method. The differences in the value sets and the approaches used to generate the value set might impact the assessment of responsiveness of the measures.

For patients following THR/TKR, the MO dimension is highly impacted and considerable changes are typically expected in this dimension after surgery. One of the likely reasons that the 5L was more responsive than the 3L among this population would be the difference in the wording of the most severe level of the MO dimension between the two versions. The most severe level of the 3L MO dimension is “confined to bed (level 3)”, and the most severe level of the 5L MO dimension is “extreme problems (level 5)”. Previous cross-sectional studies demonstrated that for the 3L version, most patients report levels 1 and 2, with almost none in level 3 [43, 44]; for the 5L, the distribution of responses was more even across the levels. This more even distribution allows more room for change. More specifically, for the 3L MO dimension, most of the improvements after surgery can only happen from level 2 to level 1, since the level 3 is barely used; but for the 5L MO dimension, improvements can start from any level (except level 1). So, the modification from 3 levels to 5 levels in the EQ-5D, and also changing the wording of the most severe level from “confined to bed” to “extreme problems” impacted the measurement of mobility, which is highly relevant in this THR/TKR patient population. In addition, compared to the EQ-5D-3L, the EQ-5D-5L consistently presented lower ceiling/floor effect at SC, UA, P/D and A/D dimensions, which suggested that the 5L dimensions had better discriminate ability. This can also explain the superiority of the EQ-5D-5L in responsiveness.

The WOMAC user’s guide suggested a meaningful change in HRQL using the following criteria [29]: 1) 20/50/70% improvement in the WOMAC pain subscale, or 2) 20/50/70% improvement in WOMAC pain and 20/50/70% improvement in WOMAC stiffness or WOMAC physical function subscales. In order to maximize the conceptual overlap between the anchor and the EQ-5D, we used the 20/50/70% (increase or decrease) of baseline WOMAC overall score to categorize patients’ HRQL change from the baseline. Our approach captured the HRQL change in all three aspects (pain, stiffness and physical function). However, since we used relatively strict criteria compared to the WOMAC user’s guide, our study may have overestimated the responsiveness of the EQ-5D relative to the usual interpretation of the WOMAC. In addition, we only used WOMAC as the anchor in this study. Benson et al. [45] reported that the EQ-5D-3L has moderate (Pearson’s r > 0.5) correlation with howRu and Oxford hip and knee scores in patients undergoing THR/TKR, those measures can be explored as anchors in further study.

Our study added evidence to the comparison of longitudinal measurement properties between the EQ-5D-3L and the EQ-5D-5L. This is important for orthopedic researchers and end-users when making decisions about which version to use to measure change in HRQL over time in this patient population. This evidence can also support switching the routine application of the EQ-5D-3L in the THR and TKR fields to using the EQ-5D-5L. Moving from the 3L version to the 5L version in this particular setting (Alberta, Canada) would potentially lead to an increase in QALY gained from the surgery in cost-utility analysis.

Our study had several limitations to be considered. In the Alberta Hip and Knee Replacement Project, patients only completed either the 3L or the 5L version of EQ-5D, so we were not able to conduct a head-to-head comparison. The 3L and 5L samples used in our analysis were comparable at main demographics; however, the 5L sample consistently had lower WOMAC scores than the EQ-5D-3L. Even though the mean differences between the 3L and 5L were small (1.4–3.9 on a 0–100 scale), the bias from the potential heterogeneity in baseline quality of life may still affect the results. In addition, due to the shifting from the 3L to the 5L, patients who completed the EQ-5D-3L received the surgery earlier than those completed the EQ-5D-5L. Since waiting time in Alberta for THR/TKR has been getting longer in the recent years [46], the time interval between the baseline and surgery was longer for the EQ-5D-5L samples, which may cause bias. Further head-to-head comparisons on the longitudinal measurement properties between the 3L and 5L versions of EQ-5D might add more valuable information to this field. In our data set, for those who had completely WOMAC and EQ-5D records at baseline, only about 1/3 completed the WOMAC and the same version of the EQ-5D at 3-month follow-up. We did not apply any imputations and compare those who had or did not have 3-month records in the present study. This was because of the complicated reasons for losing samples, which includes the switch from the 3L to 5L, logistic issues at each clinic, and patients’ readmissions and joint repairs (end of follow-ups of the initial THR/TKR). Losing such an amount of patients may bring bias; however, we were not able to estimate the direction and magnitude to inform the generalizability of this study by using our current data set.

We used Cohen’s criteria [37] to interpret the magnitude of the SRM; however, some researchers reported that this would led to over- or under-estimation of changes [47, 48]. We recognized the samples we studied had very little deterioration in health since hip/knee replacement surgery is generally very effective. Therefore, the estimations of effect sizes for deterioration groups were less robust because of the much smaller sample sizes for these groups. Finally, we did not estimate a minimally important difference (MID) for the EQ-5D in this analysis since more than 90% of the THR and TKR patients were categorized as having a “small but important” improvement or above in this data set. We used the recommended anchor-based statistical analysis approaches, such as receiver operating characteristic curve [35], to estimate MIDs for the EQ-5D in our analysis; however, the aforementioned unbalanced samples seemed to introduce bias. Our efforts in solving this issue are still ongoing.

Conclusion

Our study demonstrated that the EQ-5D-5L is more responsive than the EQ-5D-3L in measuring changes in HRQL in patients following THR/TKR, relative to the WOMAC as a clinical anchor. When choosing a generic preference-based measure, we recommend using the EQ-5D-5L in longitudinal studies in this patient population. Further research comparing the responsiveness of the EQ-5D-3L and EQ-5D-5L against other anchors and in other populations is imperative to confirm these findings.