Introduction

Given the limited health resources, cost–utility analysis is increasingly used to inform decisions on whether to adopt new but expensive health-care interventions. Preference-based health-related quality of life (HRQOL) measures are commonly used to generate quality-of-life weights for calculating quality-adjusted life years (QALYs) in such analysis.

The EuroQol 5-dimension (EQ-5D) [1] and the Short Form 6-dimension (SF-6D) [2] are widely used preference-based HRQOL instruments. Both instruments describe a respondent’s health status using a multi-attribute health-state classification system and produce a utility value from a scale anchored by 0 (death) and 1 (full health) for the respondent. However, the EQ-5D and SF-6D exhibited important differences when used in empirical studies. First, the health utility measured by the EQ-5D was generally lower than that measured by the SF-6D [35]. Second, the two instruments showed differential sensitivity to difference in health status. The EQ-5D was more efficient in detecting group differences than SF-6D in a Spanish general population sample [6]; on the other hand, the SF-6D showed greater discriminatory power than EQ-5D in two general population samples [7, 8], hearing-impaired adults [9], and liver transplant patients [10].

The EQ-5D and SF-6D have been used in patients with end-stage renal disease (ESRD) [1114] and one study assessing the validity of the two measures in hemodialysis (HD) patients found that they performed similarly except that the incompletion rate was lower for EQ-5D [13]. However, the sensitivity of these two instruments and the impact of different index scores on health utility estimates in patients with ESRD were not formally assessed; hence, which one of the two instruments is more suitable for use in this population is unknown. In addition, all previous studies used the 3-level EQ-5D (EQ-5D-3L), which is susceptible to poor discriminative power [15, 16] and ceiling effects [17]. The new 5-level EQ-5D (EQ-5D-5L) has been developed [18] and shown to have better sensitivity and fewer ceiling effects than the EQ-5D-3L in both cancer patients and a general population sample [1921]. So we were interested in investigating the measurement properties of the EQ-5D-5L in assessing ESRD patients. Therefore, the objective of the present study was to assess the psychometric properties of the EQ-5D-5L and SF-6D instruments in patients with ESRD in terms of agreement, construct validity, and sensitivity.

Methods

Patients

A consecutive sample of patients with ESRD was recruited while they were awaiting routine consultation or undergoing HD in the dialysis centre of National University Hospital, a tertiary referral hospital in Singapore, from June 2012 to May 2013. Inclusion criteria were: (1) a diagnosis of ESRD; (2) on HD or peritoneal dialysis (PD) for at least 3 months; (3) ability to communicate in English or Chinese; and (4) well enough to be interviewed. After providing written consent, each patient was interviewed by a trained interviewer using a standardized questionnaire (available in identical English or Chinese version) including the EQ-5D-5L self-report questionnaire (EQ-5D-5L), the 36-item Kidney Disease Quality of Life questionnaire (KDQOL-36), and questions assessing socio-demographic characteristics. Clinical data such as co-morbidity [measured as Charlson comorbidity index (CCI)], blood hemoglobin level, and dialysis adequacy [measured as Kt/V (K: dialyzer clearance of urea, t: dialysis time, V: volume of distribution of urea)] was obtained from patients’ case notes. CCI is an age-adjusted index score of number and severity of co-morbidities proven to prognosticate for mortality in ESRD [22], with higher scores indicating the presence of multiple and/or advanced stage(s) of various medical condition(s).

Instruments and measures

EQ-5D-5L

The EQ-5D-5L self-report questionnaire has five items (mobility, self-care, usual activities, pain/discomfort, and anxiety/depression) [23], with five descriptive levels for each item. The five levels in the EQ-5D-5L include “no problems”, “slight problems”, “moderate problems”, and “severe problems” for all five items, and “unable to do” for mobility, self-care, and usual activities or “extreme problems” for pain/discomfort and anxiety/depression. Respondents choose one level for each item to describe their health status on the day of interview. Responses to the five EQ-5D items define a health state for which an index score can be generated to indicate its value to the general public. The index score is anchored by 0 (death) and 1 (full health), with higher scores corresponding to higher utility. The English and Chinese versions of the EQ-5D-5L have been validated for use in Singapore [24, 25]. In this study, the EQ-5D-5L index scores were calculated using a mapping (“crosswalk”) function [26] to reflect the values of the described health states to the general UK population as no other EQ-5D-5L value set was available at the time of this study [1].

KDQOL-36

The KDQOL-36 [27] is a commonly used kidney disease-specific HRQOL instrument. It comprises the 12-item Short-Form Health Survey (SF-12) and three scales targeting kidney disease and dialysis: symptoms/problems (12 items), effects of kidney disease on daily life (8 items), and burden of kidney disease (4 items). Scores of these three kidney scales were calculated using disease-specific KDQOL items, ranging from 0 to 100, with higher scores indicating better perceived health. The English version of the KDQOL-36 has been validated in Singaporean HD patients [28]. The KDQOL-36 was scored using the recommended algorithm (available from: www.gim.med.ucla.edu/kdqol).

SF-12 based SF-6D index

The SF-6D is a multi-attribute health classification system consisting of six domains: physical functioning, role limitation, social functioning, pain, mental health, and vitality, with 2–6 levels for each domain. Responses to seven of the SF-12 items can be mapped to health states defined by the SF-6D classification system and the utility-based SF-6D index score can be generated [29]. The SF-6D index score derived from the SF-12 reflects the health preferences of the UK general population, ranging from 0.29 (the worse possible health state) to 1.00 (full health) [29]. Both English and Chinese versions of the SF-6D have been validated in Singapore [30].

Statistical analysis

Continuous variables were presented as means and standard deviations (SD), while categorical variables were shown as frequencies and proportions. For EQ-5D-5L and SF-6D index scores, we reported the distributions of scores and the mean (SD) and median, minimum, maximum, and the percentage of respondents with the minimum/maximum scores. Agreement between the EQ-5D-5L and SF-6D scores was examined by calculating the intraclass correlation coefficient (ICC) and using a Bland–Altman plot [31]. An ICC ≥ 0.7 suggests an acceptable level of agreement [32].

Convergent construct validity was investigated by examining the correlation between the EQ-5D-5L and the SF-6D utility scores using Pearson’s correlation coefficient (r). We expected the two different utility scores to show a strong correlation (r ≥ 0.5) [3, 10, 33], in support of the construct validity. Known-groups validity was accessed by testing the a priori hypothesis that the utility scores would be higher in patients in better health status [7, 12, 34] than those in worse health. The study sample was dichotomized into subgroups in better and worse health status according patients’ co-morbidity (indicated by CCI), hemoglobin level, dialysis adequacy (indicated by Kt/V), and KDQOL-36 kidney disease-specific scale scores. We used mean as the cut-off value for all the variables except for Kt/V. The cut-off values for Kt/V were defined separately for HD and PD patients. The sensitivity to detect differences in known groups was assessed using the “relative efficiency (RE)” statistic and effect size (Cohen’s d). The RE statistic is defined as the ratio of F statistics in the analysis of variance (ANOVA) tests of the differences in scores between patients who have “better” health and those who have “worse” health [32]. For each pair of known groups, we used the F statistic of the EQ-5D-5L index as the reference (RE = 1) to calculate the RE value of the SF-6D index. As higher F-statistic values correspond to higher statistical significance, the instrument with a higher RE value would be considered as more efficient or discriminative than its comparator. The effect size was calculated using the difference in mean scores divided by the pooled SD [35]. We used the threshold values of 0.2, 0.5, and 0.8 to define small, moderate, or large effect size, respectively [35]. Differences in the mean scores of the EQ-5D-5L and SF-6D between the groups known to differ in health status were also compared.

All statistical analyses were performed using STATA (release 11.2; Stata Corp, College Station, TX, USA) statistical software, with p < 0.05 being considered significant.

Results

Characteristics of patients and the HRQOL scores

A total of 150 patients with ESRD and on dialysis participated in this study, including 75 on HD and 75 on PD. Demographic and clinical characteristics are shown in Table 1. Patients’ mean age was 60.1 years; nearly half of them had a CCI > 5 (47.3 %). The mean duration of dialysis, either HD or PD, was 5.65 years. The range of the dialysis adequacy (i.e., Kt/V) in HD and PD patients was 0.68–2.30/dialysis and 0.38–4.58/week, respectively, and the mean hemoglobin level was 11.2 g/dl.

Table 1 Patients’ characteristics

Distributions of the EQ-5D-5L and SF-6D are displayed in Table 2. The EQ-5D-5L score ranged from −0.59 to 1, with 27.3 % of subjects reported perfect health. In contrast, the SF-6D score ranged from 0.37 to 1, with 2.7 % of respondents scoring the highest value. The EQ-5D-5L was skewed towards perfect health, whereas the distribution of SF-6D was normal (Fig. 1). Although the mean scores for the EQ-5D-5L (0.68) and the SF-6D (0.70) were similar, the ICC between the EQ-5D-5L and the SF-6D utility scores was 0.36. The Bland–Altman plot demonstrated wide limits of agreement interval (i.e., 1.21) and the EQ-5D-5L scores were systemically lower than the SF-6D in subjects with lower utility scores (Fig. 2).

Table 2 Descriptive statistics of EQ-5D-5L and SF-6D utility scores
Fig. 1
figure 1

Distribution of EQ-5D-5L and SF-6D utility scores

Fig. 2
figure 2

Bland–Altman plot of difference in utility scores between the SF-6D and the EQ-5D-5L

Construct validity

Pearson’s correlation coefficient between the EQ-5D-5L and the SF-6D was 0.53, indicating a strong correlation. As expected, patients with less co-morbidity (CCI ≤ 5), higher hemoglobin level (>11 g/dl), or higher Kt/V (HD: >1.45/dialysis; PD: >2.35/week) had higher mean EQ-5D-5L and SF-6D scores than patients with more co-morbidity, lower hemoglobin level, or lower dialysis adequacy although statistical significance was not achieved in some of the comparisons (Table 3). Both the EQ-5D-5L and SF-6D indices differentiated between subjects with different KDQOL-36 kidney disease-specific scale scores; the mean EQ-5D-5L and SF-6D scores were higher for the patients with higher scale scores, which was in line with expectations (Table 3).

Table 3 Known-groups validity and sensitivity of EQ-5D-5L and SF-6D

Sensitivity

Using the EQ-5D-5L index as the reference, the RE values of the SF-6D were more than 1 for comparison of known groups defined by hemoglobin level and the three KDQOL-36 kidney disease-specific scales, while the RE values were less than 1 in the known group comparisons of patients with differing CCI and Kt/V levels (Table 3). The effect sizes showed the same trend in the relative sensitivity of the two instruments (Table 3).

The EQ-5D-5L (range, 0.03–0.23) showed greater differences in utility between the known groups than the SF-6D (range, 0.002–0.12) (Table 3). For example, the differences in the mean scores of the EQ-5D-5L and SF-6D for patients with fewer dialysis-related symptoms and those with more symptoms were 0.23 and 0.12, respectively (Table 3).

Discussion

In this study, we found that the preference-based EQ-5D-5L and SF-6D instruments were both valid in patients with ESRD but the two utility measures were sensitive to different outcomes, were not interchangeable, and gave different estimates for utility differences between groups of patients. These findings highlighted the importance of in-depth investigation of different preference-based HRQOL instruments in the outcomes research of ESRD.

Both measures demonstrated known-groups validity as the mean utility scores differed in the expected directions between subgroups of patients in better and worse health status. Correlation between the EQ-5D-5L and the SF-6D was strong (≥0.5), similar to that found in previous studies of HD patients [13] and other patient groups [3, 10, 36]. Moreover, similar to previous studies [4, 7, 37], agreement between the EQ-5D-5L and SF-6D scores was poor and the EQ-5D-5L tended to generate lower scores than the SF-6D for subgroups with poorer health, suggesting that the two index scores cannot be used interchangeably.

Overall, both instruments were able to discriminate between different patient groups. The SF-6D was superior to the EQ-5D-5L in differentiating patients with different levels of self-reported health outcomes measured using the KDQOL-36 scales while the EQ-5D-5L was more sensitive to clinical outcomes, such as comorbid conditions and Kt/V.

The greater sensitivity of the SF-6D compared to EQ-5D-5L to kidney disease-related HRQOL could be due to two reasons. First, the recall period of the SF-12 items that SF-6D was based on (“last 4 weeks”) and the KDQOL items is identical, while the EQ-5D-5L items assess the health problems on the day of the survey (“today”). Second, the SF-12 was administered as a component of the KDQOL-36 instrument in the study, which means the responses to the SF-12 items and kidney disease items could be more strongly correlated due to context effect or order effect [32]. Hence, the better sensitivity of SF-6D observed in this study should be interpreted with caution.

The greater sensitivity of EQ-5D-5L than the SF-6D to clinical outcomes was consistent with the finding from a cross-sectional study of the general Spanish population [6]. However, findings from studies of England and US general populations [7, 8] and hearing-impaired adults and liver transplant patients [9, 10] showed that the SF-6D derived from the SF-36 was more discriminative than EQ-5D. There are two possible reasons for these seemingly contradicting findings. First, the sensitivity of the EQ-5D-5L and the SF-6D instruments may be population-specific. Because of the differences in levels and dimensions, it is possible that one measure is more sensitive than the other in one population but less sensitive in another population. Second, as the study of the general Spanish population [6], our study used the SF-12 derived SF-6D, which has been found to have inferior discriminative power as compared to SF-36-derived SF-6D [8].

The greater differences between ESRD patients in differing health shown by the EQ-5D-5L as compared to the SF-6D in our study was consistent with results from previous cross-sectional studies of other patient groups [4, 38, 39]. Moreover, the EQ-5D also exhibited greater utility gains than the SF-6D in longitudinal studies [37, 40, 41]. These results suggested that the use of the EQ-5D instrument, as opposed to the SF-6D instrument, in cost-utility analysis may lead to more favorable estimates of (incremental) effects and therefore more attractive incremental cost-effectiveness ratios (ICERs) and greater chances of adopting more expensive but also more effective treatment alternatives [42, 43]. So choice between the two preference-based HRQOL instruments for economic evaluations should be carefully justified, as it may have an important impact on the decision making based on the results of such evaluations.

One possible reason for this difference is that the full score range of EQ-5D-5L scores (i.e., 1.59) is more than double of that of the SF-6D (i.e., 0.71). A previous study found that utility measures using a narrower scale range were more likely to result in smaller magnitude of differences in health utility [44]. In our study, the greater difference in utility scores between the known groups according to the EQ-5D-5L as compared to SF-6D was mainly because the EQ-5D-5L scores were much lower than the SF-6D scores for the worse groups, suggesting that the SF-6D might have overestimated very poor health status due to its relatively high lower limit of scale. Indeed, previous studies found that the SF-6D produced higher utility estimates than the EQ-5D in patients with inflammatory arthritis [40, 45]. Therefore, the EQ-5D-5L might be more suitable than the SF-6D for studying patients in very poor health status.

Our results need to be interpreted in light of several study limitations. First, the EQ-5D-5L was scored using a “crosswalk” method in which the EQ-5D-5L health states are mapped to the EQ-5D-3L values [46]. EQ-5D-5L utility values directly elicited from the general population, which may be available soon, might exhibit different results. Second, our findings may not be generalizable to the EQ-5D-3L since the EQ-5D-5L was found to be more sensitive than the EQ-5D-3L in previous studies [1921]. Last, we were not able to assess the sensitivity of the two instruments to change in health status in this cross-sectional study.

In conclusion, both the EQ-5D-5L and the SF-6D are valid and sensitive health utility measures for assessing ESRD patients. However, it appears that the EQ-5D-5L would lead to more favorable cost-effectiveness results than the SF-6D when they are used to quantify health benefits in economic evaluations.