Introduction

The EQ-5D-3L, the original version of the EQ-5D, is an instrument widely used to measure and evaluate general health status [14]. The EQ-5D-3L descriptive system describes general health in terms of five dimensions: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. Each dimension has three levels, indicating no problems, some or moderate problems, and extreme problems, resulting in a total of 243 (i.e., 35) unique health states. EQ-5D-3L provides a simple descriptive profile and a single index for health status that can be used in the clinical and economic evaluation of health care, as well as population health surveys [5].

The EQ-5D-3L has good psychometric properties and is able to detect small changes in chronic diseases [6, 7]. However, there has been some evidence that the SF-6D derived from SF-36 was more discriminative than the EQ-5D-3L index [8, 9], although Cunillera et al. [10] showed that the SF-6D derived from SF-12 was less discriminative than the EQ-5D-3L index. In addition, it lacks descriptive richness, when compared with other generic preference-based instruments, including the Health Utilities Index Mark 2 and 3 (HUI2 and HUI3) and the Short Form 6D (SF-6D), which define 24,000, 972,000, and 18,000 unique health states, respectively [11]. Moreover, EQ-5D-3L suffers from ceiling effects [8, 9, 12].

An expanded descriptive system, with more response categories (i.e., levels) per dimension, may improve the ability of the EQ-5D to reliably discriminate among different levels of health and to detect changes in health [13]. The EuroQol Group designed a new questionnaire, the EQ-5D-5L version (i.e., the level 5 of EQ-5D), to improve the sensitivity and reduce the ceiling effects of the EQ-5D-3L version. At present, versions of the EQ-5D-5L in 57 languages are available at the EuroQol website (http://www.euroqol.org), and studies have compared the psychometric properties of the EQ-5D-3L and EQ-5D-5L [11, 13]. However, these studies used a prototype version of the EQ-5D-5L, not the official version.

We therefore aimed to assess the response redistribution of EQ-5D-3L when using the EQ-5D-5L. In addition, we compared the psychometric properties of the EQ-5D-3L and EQ-5D-5L from the perspective of informativity, validity, and reliability.

Methods

Subjects and settings

This study was approved by the Institutional Review Board of Asan Medical Center (approval number: 2010-0546), and all participants provided written informed consent.

We consecutively recruited 901 cancer patients aged over 18 years who were receiving chemotherapy at an ambulatory cancer center in Korea, over 1-month period. Patients were excluded if they had a performance status of 4 on the ECOG Scale, or if there were missing or duplicate responses on the EQ-5D-3L or EQ-5D-5L questionnaire. Participants filled in the questionnaire just before or during chemotherapy.

To assess reliability, 250 patients conveniently selected from the first survey subjects were asked to retest brief questionnaires, which included EQ-5D-3L and EQ-5D-5L, but not EORTC QLQ-C30, at 1–4 weeks interval, and return them by mail.

Information

ECOG performance scale was evaluated by one research nurse. General characteristics, including gender, age, and clinical information, were obtained from the cancer registry in the center.

Every participant completed the Korean versions of the EORTC QLQ-C30, the EQ-5D-5L, and the EQ-5D-3L, consecutively. The Korean version of EQ-5D-3L was previously validated [14]. The EQ-5D-5L used in this study was the official version provided by EuroQoL. Its dimensions are the same as those of the EQ-3D-3L, but include five response levels. The descriptors for levels 1, 3, and 5 on the EQ-5D-5L were similar to the wording for levels 1, 2, and 3, respectively, on the EQ-5D-3L, but not identical. For instance, ‘some’ of level 2 in EQ-5D-3L changed to ‘moderate’ or ‘moderately’ of level 3 in EQ-5D-5L. Level 3 in the mobility domain of EQ-5D-3L was described as ‘I am confined to bed’, whereas level 5 in the mobility domain of EQ-5D-5L was described as ‘I am unable to walk about’. Level 2 on the EQ-5D-5L was labeled ‘slightly’ for anxiety/depression and ‘slight’ for the remaining four dimensions. Level 4 on the EQ-5D-5L was labeled ‘severely’ for anxiety/depression and ‘severe’ for the other 4 dimensions. Further detailed comparisons between the two EQ-5D instruments are available at http://www.euroqol.org.

The EQ-5D-3L index was calculated using the valuation set from Lee et al. [15], whereas the EQ-5D-5L index was calculated by applying the indirect interim mapping method presented by the EuroQoL group at the 13th EU ISPOR meeting. According to the files provided by the EuroQoL group, EQ-5D-5L Crosswalk Project was conducted in 3,691 patients across six countries. Using these data, they obtained the transition probability matrix between the EQ-5D-3L and EQ-5D-5L states. Then, the EQ-5D-5L index was calculated by summing each of moving probability from EQ-5D-3L states to specific EQ-5D-5L states multiplied by each utility weight in EQ-5D-3L states expressed by the following equation:

$$ {\text{EQ - }}5{\text{D - }}5{\text{L}}_{21243} = \sum\limits_{a = 1}^{3} {\sum\limits_{b = 1}^{3} {\sum\limits_{c = 1}^{3} {\sum\limits_{d = 1}^{3} {\sum\limits_{e = 1}^{3} {(P_{3L\_abcde \to 5L\_21243} \times U_{3L\_abcde} )} } } } } $$

\( P_{3L\_abcde \to 5L\_21243} \): transition probability from the ‘abcde’ state in EQ-5D-3L to ‘21243’ state in EQ-5D-5L

\( U_{3L\_abcde} \): utility weight in the ‘abcde’ state in EQ-5D-3L

The EORTC QLQ-C30 is an integrated system for assessing the HRQoL of cancer patients. It includes five functional scales (i.e., physical, role, emotional, cognitive, and social), three symptom scales, a global health status, and a number of single items [16].

All HRQoL instruments were self-administered; if necessary, the research nurse assisted participants in completing the questionnaires. Respondents could change their answer before submitting it to the research nurse.

Analyses

Response redistribution

On any item in the EQ-5D-5L or EQ-5D-3L, missing or multiple answers were excluded from analyses. Redistribution properties were described as proportions of the 3L-5L response pairs within each 3L response level, and the corresponding mean and median VAS scores were calculated for each subgroup of paired responses, except for inconsistent pairs. Inconsistency and its size were defined as in Janssen et al. [11]. Briefly, after projecting the 3L response scale on a 5L response scale (i.e., producing 3L5L by recoding 1 = 1, 2 = 3, 3 = 5), the size of inconsistency was calculated as |3L5L − 5L| − 1. If inconsistency size was 0 or less than zero, it was considered as consistency. For example, when level 1 in EQ-5D-3L was redistributed as level 1 or 2 in EQ-5D-5L, it was considered as consistent response. However, if the response was redistributed as level 3, 4, or 5 in EQ-5D-5L, it was considered as inconsistent response and the size of inconsistency was 1, 2, and 3, respectively.

Informativity

The ceiling effect was calculated as the proportion of ‘no problem’ responses on each dimension and the proportion of ‘no problem’ in all dimensions. Reduction in the ceiling effect suggested enhancement of discriminant ability. Informativity was assessed using the Shannon entropy (Shannon index) and the information efficiency (Shannon evenness index) [17, 18].

The Shannon entropy is calculated as

$$ H^{\prime} = \sum\limits_{i = 1}^{C} {P_{i} \log_{2} P_{i} } $$

where H′ is the absolute amount of informativity captured, C is the number of possible categories (or levels in this study), and P i  = n i /N is the proportion of observations in the ith category (i = 1,…,C), where n i is the observed number of scores (responses) in category i and N is the total sample size. In case of an even distribution (i.e., if all levels are evenly filled), the optimal amount of information is captured and the Shannon entropy has reached its upper limit (H′max), as represented by the formula: H′max = log2 C, which amounts to 1.58 on the EQ-5D-3L, and to 2.32 on the EQ-5D-5L. H′max increases as the number of levels increases. Nevertheless, the empirical informativity H′ will increase only if the newly added categories are actually used. The Shannon entropy combines the number of categories defined by a system, as a measure for the extent to which the information is evenly spread over the categories. The information efficiency reflects the evenness of a distribution, regardless of the number of levels. The information efficiency measure J′ = H′/H′max describes the use of a system (H′), given its potential (H′max). The Shannon entropy H′ can be therefore considered an expression of the absolute informativity of a system, whereas the information efficiency J′ expresses only the relative informativity of a system, regardless of the number of categories [4].

Validity

Convergent validity of the EQ-5D-5L and EQ-5D-3L was examined by comparing the EQ-VAS score, ECOG performance status, and EORTC QLQ-C30 subscales within each dimension. We assumed that each dimension in the both EQ-5D instruments would be more highly correlated to the related subscales than to other subscales in the EORTC QLQ-C30 (e.g., mobility in the EQ-5D is more likely to correlate with physical function than with emotional function in the EORTC QLQ-C30). We also hypothesized that the correlation between the EQ-5D-5L and other measures would be similar to or higher than the correlation between the EQ-5D-3L and other measures. These assumptions were examined by the Spearman’s rank correlation coefficient. Fisher’s T transformations were utilized to determine whether correlations between EQ-5D versions and other instruments differed significantly [19].

For known-group construct validity, both EQ-5D-3L and EQ-5D-5L indexes were calculated by performance status, age-group, and VAS score quartile. We assumed that both EQ-5D-3L and EQ-5D-5L indexes would be lower in groups with higher ECOG performance score, in older than in younger patients, and in higher VAS score group than low VAS score group.

Test–retest reliability

The agreement of each dimension on the EQ-5D instruments was evaluated by kappa and weighted kappa statistics. The larger the number of scale categories, the greater the potential for disagreement. Therefore, we calculated weighted kappa statistics in parallel, which uses weights to quantify the relative difference between categories.

We applied Fleiss’s standards for strength of agreement for the kappa values, as follows: <0.4 = poor, 0.4–0.75 = fair to good, >0.75 = excellent [20]. Test–retest reliability of both EQ-5D indexes was evaluated by the intraclass correlation coefficient (ICC, two-way random effects, absolute agreement).

All statistical analyses were performed using the SAS software version 9.1 (SPSS Inc., Chicago, IL), and the differences were considered statistically significant if P-value was less than 0.05.

Results

Subjects

A total 2,316 visits for chemotherapy were recorded during the research period, except repeated visits. A research nurse asked the patients to participate in the study. If they consented to join the study, they were asked to fill out the questionnaire. In total, 901 questionnaires were collected. Out of 901, four cases were duplicates, three had missing or duplicate answers on the EQ-5D-5L or EQ-5D-3L, and one case missed the identification number. Thus, our final analysis set consisted of 893 patients. The final response rate was 38.5% (893/2316). The difference in the distribution of gender and age-group was not statistically different between candidates and responders (P = 0.054 and P = 0.10, respectively, data not shown). The mean age of the subjects was 53.0 (SD ± 11.2) years, with 56.8% women. These subjects had 30 different types of cancer, the most frequent being breast (32.9%) and colorectal (20.0%) cancer.

In the second survey, 81 out of 250 patients responded to questionnaires by mail, but three questionnaires were not usable because of missing data. Therefore, responses from 78 subjects were used to analyze reliability. Their mean age was 53.9 (SD ± 10.9) years, and 56.4% were women (Table 1). The mean time interval between the initial and follow-up surveys was 11.5 days (IQR 6–15 days).

Table 1 General characteristics of the study subjects

Response redistribution

Table 2 shows the proportions of the EQ-5D-3L and EQ-5D-5L response pairs within each 3L response level, and the mean and median VAS values for each subgroup with consistent responses. The mean and median VAS values tended to decrease when the 3L-5L response pairs in each dimension increased from 3L1 to 5L1 (subjects who selected level 1 in EQ-5D-3L and level 1 in EQ-5D-5L) to 3L3–5L5 (subjects who selected level 3 in EQ-5D-3L and level 5 in EQ-5D-5L), that is, from the most to the least healthy subgroup. There was substantial partitioning of level 2 in the EQ-5D-3L, and the majority of level 3 responses in all dimensions of the EQ-5D-3L moved to level 4 of the EQ-5D-5L. The proportion of inconsistency in dimensions ranged from 2.4% for anxiety/depression to 4.5% for usual activities. The average size of inconsistency was highest (1.21) for self-care and lowest (1.06) for pain/discomfort (Table 3).

Table 2 Response redistribution from the EQ-5D-3L to the EQ-5D-5L, by dimension and by level in consistent responses
Table 3 Inconsistent responses distributed from the EQ-5D-3L to the EQ-5D-5L

Informativity

We found that 150 respondents (16.8%) on the EQ-5D-3L and 87 (9.7%) on the EQ-5D-5L reported no problems on all dimensions. Eighty out of 150 respondents answered to the 11111 item of health state in EQ-5D-5L as well, but other 70 patients answered to other health states in EQ-5D-5L. The mean VAS score in the former group was 84.3 and that of the latter group was 78.1; the difference was significantly different (P = 0.007). In both EQ-5D instruments, the self-care dimension showed the highest ceiling effect, whereas pain/discomfort showed the lowest ceiling effect. The proportions of respondents reporting ‘no problems’ across dimensions decreased on the EQ-5D-5L, compared with the EQ-5D-3L. The mobility domain showed the most reduction from 65.1% in EQ-5D-3L to 54.8% in EQ-5D-5L (Table 4). The difference in the ceiling effect between EQ-5D-3L and EQ-5D-5L was statistically significant in all domains, except self-care.

Table 4 Proportion of ‘no problem’ responses on EQ-5D-3L and EQ-5D-5L

Table 5 shows the informativity results of the EQ-5D-3L and EQ-5D-5L. The EQ-5D-5L consistently showed higher informative quantity (Shannon entropy) than the EQ-5D-3L, with an average difference of 0.48. Information efficiency (J′) in EQ-5D-5L was improved in the mobility and usual activity domains, whereas it was declined in the self-care and anxiety/depression domains, compared with EQ-5D-3L. The percentage gain of information efficiency ranged from −9.1% (self-care) to 8.5% (mobility).

Table 5 Shannon entropy (H′) and information efficiency (J′) for the EQ-5D-3L and EQ-5D-5L

Validity

Table 6 shows correlations by dimension between EQ-5D, ECOG performance status, and EORTC QLQ-C30 subscales. The correlations between EQ-VAS and EQ-5D-5L responses tended to be slightly stronger than correlations between EQ-VAS and EQ-5D-3L responses, across all dimensions. Similarly, correlations between the EQ-5D-5L and EORTC QLQ-C30 subscales were slightly stronger than those between the EQ-5D-3L and EORTC QLQ-C30 subscales. None of these differences, however, was statistically significant. Spearman’s rank correlation coefficients between the EQ-5D-3L and EQ-5D-5L ranged from 0.70 for usual activities to 0.77 for anxiety/depression. The Pearson’s correlations between VAS and EQ-5D indexes were 0.52 for the EQ-5D-3L and 0.55 for the EQ-5D-5L (data not shown). In both EQ-5D instruments, mobility was more highly correlated with physical function than with other subscales on the EORTC QLQ-C30, whereas anxiety/depression was more highly correlated with emotional function of the EORTC QLQ-C30.

Table 6 Convergent validity of the EQ-5D-5L and EQ-5D-3L by VAS score, ECOG performance status, and QLQ-C30 scale

Aspects of the known-group construct validity of EQ-5D-3L and EQ-5D-5L indexes by ECOG performance status, age-group, and VAS score quartiles are shown in Table 7. Both EQ-5D indexes had similar values and tended to decline as ECOG scores increased, age increased, and VAS score decreased. The VAS quartile variable explained 20.9 and 25.3% of the EQ-5D-3L index and EQ-5D-5L index, respectively, and the explanatory power increased approximately 5% in both indexes when continuous VAS scale was applied.

Table 7 EQ-5D indexes by ECOG performance status, age-group, and VAS score

Test–retest reliability

Agreements by kappa on both EQ-5D instruments were fair to good in 4 dimensions, but not for usual activities. Kappa statistics on the EQ-5D-5L varied from 0.36 to 0.64 across dimensions. Kappa statistics of the EQ-5D-5L were slightly lower than those of EQ-5D-3L, whereas the weighted kappa of the EQ-5D-5L tended to be slightly higher than that of EQ-5D-3L. The difference in kappa statistics between the two EQ-5D instruments was not statistically significant. ICCs of the EQ-5D-3L and EQ-5D-5L indices were 0.75 and 0.77, respectively (Table 8).

Table 8 Test–retest reliability on the EQ-5D-3L and EQ-5D-5L

Discussion

We found that the three levels on the EQ-5D-3L were substantially redistributed among the five levels on the EQ-5D-5L, with the majority of level 3 on the EQ-5D-3L rearranged to level 4 on the EQ-5D-5L. The proportion of respondents reporting ‘no problem’ on the EQ-5D-5L ranged from 27.3% for pain to 80.0% for self-care. Full health (11111) was significantly decreased from 16.8% in EQ-5D-3L to 9.7% in EQ-5D-5L. Ceiling effects on the EQ-5D-5L were still present, but were considerably decreased compared with the EQ-5D-3L, except for the self-care dimension. Although the ceiling effect of EQ-5D-3L in our research was lower than previously reported in other studies [2123], the ceiling effect of EQ-5D-5L was more improved than that of EQ-3D-3L.

Not surprisingly, the Shannon entropy was higher for the EQ-5D-5L than for the EQ-5D-3L. The increased Shannon entropy suggested that the EQ-5D-5L was able to better describe various health states and that these expanded levels were empirically used by respondents. The average information efficiencies also improved slightly from 63.1% in EQ-5D-3L to 63.8% in EQ-5D-5L; however, the effect of information efficiency by domain was diverse. It means that the extent of even distribution was enhanced in EQ-5D-5L, but its impact was different, depending on domain. Our findings, showing that the EQ-5D-5L had greater absolute informativity and lower ceiling effect than the EQ-5D-3L, are consistent with previous results [11, 13].

The proportion of inconsistencies among our respondents averaged 3.5%. This was higher than the average of 1.1% observed in hypothetical situations when the subjects were familiar with the EQ-5D instrument [11], but lower than 4.3% inconsistent responses reported in a previous study [13]. We found that only one person responded inadequately on the EQ-5D-5L. In comparison, a study performed in Singapore excluded 30 (3.7%) out of 803 patients because of missing values [23] Similarly, the number of missing items from the Welsh study was 4 (0.33%) [24], which was higher than the values on both the EQ-5D-5L (0.02%) and EQ-5D-3L (0.13%) in our study. For convergent validity in Korean cancer patients, the EQ-5D-5L showed stronger correlations with cancer-specific instruments than the EQ-5D-3L; however, the difference was not statistically significant. The association between EQ-5D instruments and other measures showed similar results. For known-group construct validity, we observed decreases in the EQ-5D-5L index by performance status and age-group. The ECOG variable explained 26.1 and 23.7% of the EQ-5D-3L index and EQ-5D-5L index, respectively, when univariate regression analysis was applied. These findings supported previous favorable evidence regarding the validity of the EQ-5D-5L [1, 13]. For example, a study on cervical cancer patients in Taiwan showed similar construct validity [2].

When we assessed reliability, we observed variations in agreement across dimensions. Not surprisingly, when using kappa statistics, the reliability of EQ-5D-3L was slightly better than that of EQ-5D-5L. When using weighted kappa, however, the reliability of EQ-5D-5L was similar or better than that of EQ-5D-3L. The intraclass correlation coefficient was slightly better for the EQ-5D-5L than for the EQ-5D-3L. Compared with the former validation study of the Korean EQ-5D-3L, we found that agreement was slightly decreased, whereas ICC was almost the same [14]. Janssen et al. [11] showed that EQ-5D-5L had generally better inter-observer and test–retest reliability than the EQ-5D-3L. The Taiwan study reported that the ICC for the EQ-5D-3L was 0.83 and the kappa values for the EQ-5D dimensions ranged from 0.54 to 0.73 [2]. Our values were slightly lower, but still acceptable. Our participants were receiving chemotherapy at the initial survey point, so their condition may have changed during the second survey. In the second survey, we did not collect other information about stability in the subjects’ health. Therefore, interpretation of reliability in our study was limited.

This study had several limitations. We examined the redistribution properties of the EQ-5D-5L in cancer patients. We also analyzed additional datasets to examine redistribution properties by type of cancer. Both breast and colorectal cancer patients showed similar distributions of matching pairs. However, it may not be possible to generalize our findings to non-cancer patients, because the proportion of problems reported by cancer patients may differ from those in the general population and in patients with other chronic conditions [22, 25]. In addition, we used experimental interim value sets for the EQ-5D-5L. We intended to recalculate our results when the EQ-5D-5L valuation algorithm was formulated, but the algorithm was in the pre-final stage; therefore, minor change could be made by the EuroQoL group. In the supplementary analysis, when we used the crude summary score transforming 0–100 in both EQ-5D versions, the known-group validity showed similar trends in both instruments and the ICC for the crude summary score in EQ-5D-5L was higher than that of EQ-5D-3L. Further research is required to determine the psychometric properties of the EQ-5D-5L, in particular the ceiling effect, in the general population, and the responsiveness of the measure and the reliability of the EQ-5D-5L in stable cancer patients.

In conclusion, our findings showed that the EQ-5D-5L had greater informativity and lower rate in the ceiling effect than those values of the EQ-5D-3L. Furthermore, the EQ-5D-5L showed good construct validity and reasonable reliability. Therefore, considering these findings, the EQ-5D-5L may be preferable to the EQ-5D-3L.