Introduction

The EQ-5D is an established health-related quality of life (HRQoL) instrument, used frequently in both clinical trials and health services research [1]. The validity of the Chinese EQ-5D has been assessed in mainland China [24] and elsewhere [57]; its reliability, however, is not well reported. We evaluated the reliability and validity of the EQ-5D in a sample of the general population in urban China.

Methods

Sample and study design

The survey using a multi-stage stratified random sampling approach was conducted in Hangzhou, China in 2008. Nine “Jiedao” (sub-district neighborhood) were randomly selected from Xiacheng district (central), Gongshu district (sub-central), and Yuhang district (suburb), three for each. Two communities from each “Jiedao” and 70 households from each community were randomly selected. The total sample size was 1,800, with 200 in each “Jiedao”. All residents 14 years old and above, living in a sampled household for at least 6 months were eligible to participate until the quota for each “Jiedao” was met. Participants self-administered a questionnaire containing the Chinese EQ-5D and SF-36. Trained interviewers administered questions regarding existence of chronic diseases. Sixty respondents were randomly sampled among respondents on the first survey day that would be willing to self-administer the EQ-5D and SF-36 in a 2-week period. Written consent was obtained from all respondents for this study approved by Zhejiang University School of Medicine Ethics Committee.

The EQ-5D comprises five health dimensions (mobility, self-care, usual activities, pain/discomfort, and anxiety/depression, each with three response categories: no, some or extreme problems) and a 0–100 points visual analogue scale (EQ VAS) [8]. Scores for the five dimensions can be converted into a utility index score by applying the scores from preference weights elicited from the UK general population [9]. The SF-36 is a validated [10, 11] 36-item instrument yielding eight scales and two summary scores. Higher scores indicate better health status.

Data analysis

Reliability and validity of the EQ-5D were assessed according to established guidelines [12]. To evaluate reliability, it was assumed that the health status between two measurements was stable. The percentages of agreement and kappa coefficients for the five dimensions were calculated. Kappa values below 0.2 indicate a slight agreement, 0.21–0.4 fair, 0.41–0.6 moderate, 0.61–0.8 substantial, and 0.81–1.0 an almost perfect agreement [13]. Test–retest reliability of the EQ-5D index and EQ VAS scores was determined using the intraclass correlation coefficient (ICC; two-way mixed-effect model/absolute agreement definition ICC2,1) [14]. An ICC greater than 0.70 is considered appropriate for group comparison [15]. The standard error of measurement (SEMagreement) was used to assess variability, i.e., the absolute measurement error [16, 17]. It was also expressed as a percent of the measurement range (SEM%) likely to be encountered in actual research [12].

To evaluate construct validity, we first assessed convergent and discriminant evidence by examining relationships with the SF-36 using Pearson’s correlation. It was expected that comparable dimensions, e.g., EQ-5D pain/discomfort and SF-36 bodily pain, would correlate better, compared with less comparable dimensions, such as EQ-5D mobility and SF-36 mental health. Pearson’s correlation coefficients of 0.50 or above were regarded as strong, 0.30–0.49 as moderate, and lower than 0.30 as weak [18].

Second, construct validity was assessed by comparing the EQ-5D index (both the UK [9] and Japanese preference weights [19] were used) and EQ VAS scores for subgroups of respondents with differing self-reported overall health and number of chronic diseases using ANOVA. It was also expected that older people, females, people widowed or divorced, and those with a lower socioeconomic status would report poorer health [2, 3, 2022]. The relationships between the EQ-5D and the demographic variables were examined using ANOVA, t-test, or chi-square test. Finally, Mean SF-36 summary scores for respondents reporting no problems for any EQ-5D dimension were compared with those for respondents reporting problems using t-test, and higher SF-36 scores was expected in the first case [23].

Results

Among the 1,800 respondents from 1,260 selected households, complete data for all EQ-5D dimensions were available for 1,747 respondents (97%) and analyzed for the present study. The estimated response rate was 71.4% (two eligible individuals in each household on average [11]). The mean age was 47.5 years (SD 17.5, range 14–99), with 51.6% being women. Compared with Hangzhou urban area demographic statistics for year 2008 [24], our sample had similar sex ratio, older age, and higher educational attainment (Table 1).

Table 1 Characteristics of the study sample (n = 1,747)

EQ-5D response

The majority of respondents reported no problems (ceiling effects), ranging from 78.0% for the pain/discomfort dimension to 96.7% for the self-care dimension (Table 2). The mean EQ-5D index score was 0.92 (SD 0.17, range −0.59 to 1), and the mean EQ VAS score was 84.44 (SD 13.0, range 8.50–100).

Table 2 Distribution of responses to EQ-5D dimensions

Test–retest reliability

In the retest samples, 48 of 60 respondents returned the retest questionnaire, and the data from 31 respondents whose scores of the first question of the SF-36 (self-reported overall health) was the same at two measurements were analyzed. The median interval of test–retest measurement was 13 days (interquartile range: 12–15 days). Kappa values for EQ-5D items regarding mobility, self-care, usual activities, pain/discomfort, and anxiety/depression between measurements were 1.00, 0.65, 0.87, 0.35, and 0.63, respectively. The ICCs of test–retest reliability were 0.53 for the EQ-5D index score and 0.87 for the EQ VAS score, respectively. The SEM values (SEM%) were 0.13 (9.22%) and 4.20 (4.20%) for the EQ-5D index and EQ VAS scores, respectively, (Table 3).

Table 3 Test–retest of reliability of the EQ-5D (n = 31)

Validity

The Pearson’s correlation coefficients between the EQ-5D and the SF-36 were stronger between comparable dimensions (e.g., −0.59 between EQ-5D pain/discomfort and SF-36 BP and −0.44 between EQ-5D mobility and SF-36 PF) than those between less comparable dimensions (e.g., −0.26, −0.20 between EQ-5D mobility, self-care, and SF-36 MH, respectively) with a few exceptions, demonstrating convergent and discriminant evidence of construct validity. The EQ-5D index and EQ VAS scores had moderate or strong correlations with all SF-36 scores (all P < 0.001, Table 4).

Table 4 Correlations between the EQ-5D and SF-36

Respondents who reported poor general health and chronic diseases had significantly lower EQ-5D index and EQ VAS scores (Table 5). The discrepancy between the UK and the Japanese versions of EQ-5D index scores was much smaller for better health states, but larger for worse health status. Older people, females, people widowed or divorced, and those with a lower socioeconomic status reported poorer HRQoL as expected (Table 6). Respondents reporting no problems on any EQ-5D dimension had better scores on the SF-36 summary scores, respectively, than those reporting problems (all P < 0.001, data not shown).

Table 5 Comparison of EQ-5D index and EQ VAS scores for subgroups of respondents with differing health status
Table 6 Responses to the EQ-5D by sociodemographic variables (n = 1,747)

Discussion

The study assessed the reliability and validity of the Chinese EQ-5D in a large urban population in China. Compared with most EQ-5D studies in general population [21, 25, 26], our sample covered 14-to 18-year-old adolescents. The Chinese EQ-5D youth version is not available; it is therefore suitable to apply the EQ-5D in adolescents primarily to allow for follow-up and comparisons over a wide range of ages.

Construct validity of the EQ-5D was demonstrated using convergent, discriminant, and known groups analyses. The EQ-5D showed fair to moderate levels of test–retest reliability, with high percentage of respondents reporting same level of problems in the dimensions and satisfactory ICC for the EQ VAS score. However, the examination of reliability was compromised by high ceiling effects. Reliability coefficients not only reflect the degree of agreement between repeated measures, but also the degree to which a measurement instrument can differentiate among individuals [17, 27]. In a homogeneous population, the within-subject variance can easily overwhelm the between-subject variance, making for low reliability [28]. The SEM is relatively sample-independent and useful in the interpretation of HRQoL change [29, 30]. A higher SEM value for the EQ VAS score after 1 month was reported recently [31]. When applying Japanese preference weights [19], the ICC, SEM value, and SEM% for the EQ-5D index score were 0.64, 0.09, and 8.11%, respectively.

There are several studies where the EQ-5D has been used among the Chinese general population. Wang et al. [2] measured EQ-5D data among 2,994 respondents from one district of Beijing. Recently, Sun et al. [3] analyzed national EQ-5D data and provided norms for the Chinese general population. The reliability of EQ-5D was not measured in these two studies. Chang et al. [32] reported validation results in a representative sample of the 20–64 years Taiwanese population. Similar ICCs were reported (0.51 for the EQ-5D index score and 0.70 for the EQ VAS score), even though people more than 65 years old were not recruited.

This study had limitations. First, although there were a small number of non-respondents due to refusal or inaccessibility after three visits, no data were available for them and it is unclear whether characteristics of the non-respondents differed from the respondents. Second, although the estimated response rate was high, there might be selection or response bias [33]. Third, the retested sample size was small for the assessment of the reliability. Fourth, our sample was more representative of an older and educated general population.

We conclude that the Chinese EQ-5D demonstrated acceptable construct validity and fair to moderate levels of test–retest reliability in an urban general population in China.