Introduction

The EQ-5D is an instrument used to measure health outcomes and health-related quality of life (HRQoL) in cost-effectiveness analysis [14]. The original version of EQ-5D, EQ-5D-3L (3L), has been used successfully in many countries. Its validation and psychometric properties have been explored and demonstrated in normal populations and in patients with various diseases, including hepatitis B [510].

Although the EQ-5D-3L has benefited from a condensed format, studies have shown that the 3L version suffers from deficiency in descriptive richness and from serious ceiling effects, especially in mild disease states [1113]. In order to reduce these problems, a new version (EQ-5D-5L, 5L) was developed in 2009. According to studies utilizing the 5L version, sensitivity and discriminatory powers were improved, and ceiling effects were reduced [1418]. Until now, there has been little use of 5L in a Chinese population, nor has there been a comparison between these two systems.

We conducted a cross-sectional study in patients with hepatitis B to compare the psychometric properties of 5L and 3L in a Chinese population and evaluated acceptability, face validity, redistribution properties, convergent validity, known-group validity, discriminatory power, ceiling effect, test–retest reliability, and responsiveness. We report the findings of our study.

Methods

Subjects and settings

We chose Jinan Hospital for Infectious Diseases and Shandong Tumor Hospital, the largest hospitals for hepatitis B infections and for tumors in Shandong Province, respectively, to recruit patients in August, 2013. All inpatients and outpatients with hepatitis B, including carriers, patients with active or inactive chronic hepatitis B (CHB), compensated cirrhosis (CC), decompensated cirrhosis (DC), or hepatocellular carcinoma (HCC), were invited to take part in the survey using consecutive sampling. We attempted to recruit all patients who were conscious and not in severe pain who registered sequentially in either of the two hospitals. For each infectious condition, at least 100 patients were recruited. Patients who were below 18 years old, or not capable of reading or speaking Chinese, or with mental disorders or dementia were excluded. Infectious conditions were confirmed by case histories, and the medical history of other diseases was recorded.

Instruments

The EQ-5D is composed of a descriptive system and a visual analog scale (VAS). The descriptive system consists of five dimensions: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. For each dimension, there are three levels in the 3L version and five levels in the 5L version. The VAS is a thermograph-like scale rated from 0 (the worst imaginable health) to 100 (the best imaginable health). Subjects were asked to choose one level in each dimension and mark the scale according to their current health states. HRQoL was derived from descriptive systems using value sets from the general population. Many countries have estimated their 3L value sets by time trade-off or VAS methods, but not with a Chinese population. Since the 5L version was relatively new, there was no current value set [1, 2].

In the 3L version, the three levels in each dimension were defined as ‘no problems,’ ‘some problems,’ and ‘unable to/extreme problems.’ If a subject chooses one level in each of the five dimensions, the combination of all five levels could define a unique health state, so the 3L descriptive system could define up to 243 (35) unique health states as levels ranging from 11,111 (best health) to 33,333 (worst health). In the 5L version, the levels in each dimension have been expanded to five, including ‘no problems,’ ‘slight problems,’ ‘moderate problems,’ ‘severe problems,’ and ‘unable to/extreme problems,’ so 3,125 (55) unique health states could be defined as levels ranging from 11,111 (best health) to 55,555 (worst health) [1, 2].

We used the simple Chinese versions of both the 3L and 5L [2]. Each consenting patient received a questionnaire composed of 5L, socio-demographic information, 3L, and the VAS, respectively. Consistent with the literature, we maintained this sequential order to reduce bias. For example, if respondents scored 3L first, there could be a tendency to not use the second and fourth levels in 5L [16]. The socio-demographic section was placed between the two systems in order to reduce the memory effects.

Since outpatients were difficult to follow-up, we conducted a retest 1 week after the first survey with inpatients to measure test–retest reliability and responsiveness of the two systems. We asked doctors to give judgments on the progressions of patients’ health states based on symptoms and laboratory tests defined as improved, stable, or exacerbated. Since the symptoms and laboratory tests of patients with hepatitis B were complex, doctors were given autonomy to make these judgments.

Analysis

HRQoL

To obtain HRQoL for 3L, we used the value set from a Japanese cohort that was estimated using a time trade-off method on 543 members of the public population [19]. Since no 5L value sets were available, the EuroQol Group recommended mapping the interim scoring of the 5L descriptive system to 3L to obtain HRQoL [20]. Being independent of the value set that is used, it can be applied to transform any 3L value set into 5L index values. Mean and standard deviation (SD) were used to describe the central and discrete tendency of HRQoL.

Acceptability and face validity

These two constructs were based on direct choices by each patient from face-to-face interview [16]. Acceptability was evaluated by two questions: ‘In your opinion, which one of the two instruments is easier to answer?’ and ‘Did you have any problems in any of the two systems?’ Face validity was evaluated by another question: ‘In your opinion, which one of the two systems expresses your ideas better?’

Redistribution properties

Redistribution properties were defined as proportions of 3L–5L response pairs in each dimension from the same patient [14, 21]. We used 3L-1, 3L-2, and 3L-3 to represent the subgroups in which patients chose level 1, level 2, and level 3, respectively, in 3L in each dimension. Each 3L–5L response pair was described by P(x, y) according to the levels chosen in both 3L and 5L, where x indicated the 3L level and y indicated the 5L level. In each subgroup of paired responses, the mean of VAS was calculated, except for inconsistent pairs. P(1, 1), P(1, 2), P(2, 2), P(2, 3), P(2, 4), P(3, 4), P(3, 5) were defined as consistent while the others were considered inconsistent.

In each dimension, P(1, 1) represented the best health states while P(3, 5) represented the worst. We hypothesized the VAS should be increasing as pairs going from P(3, 5) to P(1, 1). The proportion of P(1, 2) and P(2, 2) could be evidence for the validity of an extra level between level 1 and level 2 in the 3L system, the same as P(2, 4) and P(3, 4), supporting an extra level between level 2 and level 3.

Convergent validity

We compared the correlation between 3L and the VAS with 5L and the VAS. The levels of each dimension were treated as ordinal variables, and Spearman’s rank correlation coefficient (r s ) was computed. Strength of correlation was interpreted as follows: absent (r s  < 0.20), weak (0.20 ≤ r s  < 0.35), moderate (0.35 ≤ r s  < 0.50), and strong (r s  ≥ 0.50) [14]. We hypothesized a stronger correlation in 5L. Cohen’s t test was used for comparison between r s from 3L to 5L [22].

Known-group validity

Known-group validity is a direct and convenient assessment of both construction validity and sensitivity [23]. HRQoL was used to make the comparison instead of a single level. Tests for age and education were performed using r s , and education was recorded as three levels (1 = primary or lower, 2 = middle school, 3 = college or higher). Gender, infectious conditions, and presence of comorbidities were assessed using the Kruskal–Wallis H statistic. According to previous studies [24, 25], we expected patients that were female, young, with high education level, or without comorbidity to have higher HRQoL. Patients with mild symptoms were hypothesized to have higher HRQoL.

Discriminatory power

We used Shannon’s index (H’) and Shannon’s (J’) evenness index to compare discriminant power of 3L and 5L. Shannon’s index and evenness index are measures of discriminatory power of any health measurement scale [26]. In a dimension where a certain level has a very low or very high endorsement, e.g., less than 0.05 or more than 0.95, little information is gained because one predicts the other with more than 95 % certainty. Originating from the field of information theory, the Shannon’s index and Shannon’s evenness index have already been used to assess many health states classifications successfully [14, 16, 27, 28]. The Shannon’s index was calculated as follows:

$$ H^{'} = - \mathop \sum \limits_{i = 1}^{L} p_{i} \log_{2} p_{i} $$

Where L represented the number of levels in one dimension, and p i represented the proportion of patients who chose the ith level. The maximum of H’ was marked as Hmax, which amounted to log2 C, where C represented the number of levels in each dimension. For the 3L and 5L system, Hmax amounted to 1.58 and 2.32, respectively. The Shannon’s evenness index was calculated as follows:

$$ J^{'} = \frac{{H^{'} }}{{H^{'}_{\hbox{max} } }} $$

Although H’ increased when C increased, J’ would increase only if the new levels added were actually used. H’ was an absolute measure and J’ combined both the number and the efficiency of the levels that were used. If answers from patients were evenly distributed across all levels, J’ would reach its zenith; if there were redundant levels, J’ would decrease as well. We hypothesized H’ would increase and J’ would not decrease significantly in the 5L system compared with the 3L system.

Ceiling effect

Ceiling effect was assessed as the proportion of subjects reporting no problem in each dimension. With the decrease in ceiling effect, the sensitivity should increase. We expected a decrease in the ceiling effect in 5L compared with 3L.

Test–retest reliability

For patients with stable health states 7 days after the first survey according to doctors, weighted kappa was used to evaluate the test–retest reliability. We applied Fleiss’s standard for the strength of agreement for kappa values, as follows: poor (κ < 0.4), fair to good (0.4 ≤ κ<0.75), excellent (κ ≥ 0.75) [14]. At the same time, two-way random intraclass correlation coefficient (ICC) was used to compare HRQoL before and after the 7-day interval [23].

Responsiveness

For patients whose doctors reported improved health states 7 days after the first survey, we assumed an increase in HRQoL. Wilcoxon signed-rank sum test was used to compare HRQoL before and after the 7-day interval in 3L (5L) or between the two systems. There was an insufficient number of patients with deteriorated health states for statistical analysis.

All the statistical analyses were performed by R version 2.15.2, and differences were considered statistically significant for P values less than 0.05.

Results

Subjects

A total of 1,063 outpatients and 445 inpatients were invited to take part in the survey; the response rate was 34.7 and 62.0 %, respectively. Eventually 369 outpatients and 276 inpatients were included and answered the questionnaires. Seven days later, inpatient participants were invited to retake the survey, and 183 answered the same questionnaires. Judgments of their disease progressions were received from doctors (Table 1).

Table 1 Distribution of subjects by age, gender, resource, and disease progression in the first interview and review

HRQoL and the VAS

In 5L, HBV carriers and DC had the highest and lowest respective means for both HRQoL and VAS (0.813 vs. 0.661). In 3L, HBV carriers and inactive CHB patients had the highest mean HRQoL (0.817), while DC had the lowest (0.665). HBV carriers had the highest mean of VAS (88.0), while DC had the lowest (58.4) (Table 2).

Table 2 Mean and SD of HRQoL and VAS in the first interview

Acceptability and face validity

There were 63.6 % of patients who deemed 5L to be easier to answer than 3L; 13.5 % chose 3L; 20.3 % indicated there was no apparent difference between them; and 2.6 % did not answer the questions. A total of 7.2 % of patients reported some problems in filling in 3L, while 2.5 % complained 5L was tedious. A total of 68.2 % of patients claimed 5L could express their true thoughts; 8.2 % chose 3L; 18.2 % indicated there was no apparent difference; and 5.4 % did not answer the questions.

Redistribution properties

A total of 77 (2.4 %) pairs were inconsistent. Usual activities had the most inconsistent pairs (n = 36, 5.6 %) while self-care had the least (n = 5, 0.8 %).

Table 3 shows the mean VAS of 3L–5L pairs. With most of the pairs, the VAS decreased as going from P(1, 1) to P(3, 5). In self-care, the VAS of P(2, 4) equaled the VAS of P(2, 2), while P(3, 5) was higher than the whole 3L-2 subgroup. In usual activities, the VAS of P(3, 5) was higher than that of P(3, 4). In anxiety/depression, the VAS of P(3, 4) was higher than the VAS of P(2, 4).

Table 3 Redistribution properties from 3L to 5L: consistent responses

In all dimensions of the 3L-1 subgroup, there were higher proportions for P(1, 1), which ranged from 67.0 % (anxiety/depression) to 98.2 %, than for P(1, 2). In the 3L-2 subgroup, P(2, 1) with a proportion range from 69.0 % (mobility) to 74.2 % (anxiety/depression) dominated over P(2, 3) and P(2, 4) except in usual activities, where P(2, 3) had the largest proportion of 44.6 %. P(2, 4) had the smallest proportions in all dimensions. The most even distribution in the 3L-2 subgroup was in usual activities (67/74/25). There were not many pairs observed in the 3L-3 subgroup in mobility, pain/discomfort, and anxiety/depression, P(3, 5) had a larger proportion than P(3, 4), while for self-care and usual activities, P(3, 4) dominated the subgroup (Table 3).

Convergent validity

Table 4 shows the r s between 5L (3L) and the VAS in each dimension. The correlations between mobility and the VAS, usual activities and the VAS were high in both of 3L and 5L while others were moderate. The r s between 5L and the VAS was statistically higher than 3L in all dimensions except self-care.

Table 4 Correlation coefficient between 3L, 5L and VAS

Known-group validity

For 5L and 3L, the r s between age and HRQoL was −0.367 and −0.320 in both 5L and 3L, respectively. This indicates a negative correlation between age and HRQoL, as expected, but the difference between r s from 3L to 5L was not statistically significant (t = −0.928, P = 0.354). The r s was 0.223 and 0.199 between education and HRQoL in 5L and 3L, respectively, showing the same direction as expected; however, the difference between r s from 3L to 5L was also not statistically significant (t = 1.75, P = 0.081).

There was no significant difference between male and female (P = 0.278 and 0.290, respectively) in either 5L or 3L. Patients with other diseases showed lower HRQoL in both 5L and 3L (both P < 0.001). In 5L, there were significant differences of HRQoL between different infectious conditions except two pairs: inactive CHB and HBV carriers (P = 0.546) and active CHB and CC (P = 0.355). In 3L, HRQoL of three pairs—inactive CHB and HBV carriers (P = 0.673), active CHB and CC (P = 0.168), DC and HCC (P = 0.245)—were found to have no significant difference.

Discriminatory power

In all dimensions, H’ from 5L was larger than that of 3L. The 5L J’ value was slightly larger than 3L J’ in mobility, usual activities, and pain/discomfort, but in self-care and anxiety/depression, J’ from 3L was larger than J’ from 5L (Table 5).

Table 5 Shannon’s index and evenness index

Ceiling effect

There were relatively severe ceiling effects in both of the systems in all dimensions. Self-care showed the highest percentage (90.7 and 92.1 % in 5L and 3L, respectively), while anxiety/depression showed the lowest (42.6 and 62.3 % in 5L and 3L, respectively). A reduction in the proportion of patients who chose the first level of a dimension can be seen when comparing 3L and 5L percentages: Anxiety/depression showed the largest reduction (19.7 %), followed by pain/discomfort (17.7 %), mobility (7.3 %), usual activities (6.8 %), and self-care (1.4 %). Patients who chose level 1 in all dimensions decreased to 21.6 % in 5L by 16.7 % (Table 6). All reduces were statistically significant.

Table 6 Distribution of patients who chose the first level in each dimension

Test–retest reliability

The κ of 5L was higher than of 3L in mobility, usual activities, and anxiety/depression but without statistical evidence (Table 7). ICC of 3L in CC patients was higher than in 5L. In DC and HCC patients, ICC was lower than 5L, but none of these differences were statistically significant. In all of the patients with stable health states, ICC of 5L was higher than 3L with statistical evidence (Table 8). There were only 2 active CHB patients with stable health states, so ICC was not computed.

Table 7 Kappa in each dimension in patients with stable health states
Table 8 ICC in patients with stable health states

Responsiveness

In patients with improved health states, HRQoL increased in all infectious conditions for both 3L and 5L. In 5L, HRQoL increased between 0.029 (active CHB and CC) and 0.073 (HCC), while in 3L, HRQoL increased between 0.025 (CC) and 0.076 (HCC) (Table 9). There was no significant difference in the increases in HRQoL between 3L and 5L.

Table 9 Results of Wilcoxon signed-rank order tests for patients with improved health states

Discussion

The purpose of our study was to compare 5L and 3L according to validity, sensitivity, reliability, and responsiveness. We believe that it is the first time to make such a comparison between the two systems and the first time to use the new 5L version of EQ-5D in China.

Although 3L had a more condensed format, most patients found 5L to be easier to answer and more useful in expressing their true opinion, supporting similar findings as another study [16]. The results of 3L–5L redistribution properties show that the extra levels added provide evidence to support this preference. Since very few people deemed the 5L version to be tedious, the levels added were not an extra burden for subjects.

Similar to other studies [14, 15, 18], our distribution data, especially in the 3L-1 and 3L-2 subgroups, were skewed. In the 3L-1 subgroup, most patients showed level 1 in 5L since they displayed no symptoms in at least one dimension no matter which system they used. In the 3L-2 subgroup, most patients chose level 2 in 5L in all dimensions except in usual activities. However, in other studies [16, 21], the distribution was more even. This may be due to disease characters that most patients with chronic HBV have tended to perceive as less severe, and as a result, very few chose P(2, 4) or the 3L-3 subgroup. P(2, 4) and P(3, 4) consisted of only a small proportion, giving little evidence to an extra level between level 2 and level 3 in 3L. On the other hand, P(1, 2) and P(2, 2) consisted of 6.5 % (self-care) to 45.7 % (anxiety/depression) of all pairs in each dimension, showing strong evidence for an extra level between level 1 and level 2 in 3L.

The inconsistent pairs constituted a small proportion of the responses. This small proportion is similarly found in another study [16] and even lower than two others [14, 15], indicating that patients could consistently use the two systems. The trend of the VAS increasing as the pairs moved from P(3, 5) to P(1, 1) supports the validation of the 3L–5L extension. There was deviation in the redistribution in some severe subgroups—P(2, 4) in self-care, P(3, 5) in usual activities, and P(3, 5) in anxiety/depression. This may because in the VAS, 0 was described as ‘worst imaginable health state,’ but in each dimension, there was no word such as ‘worst.’ This may lead to diverse rating methods from patients.

The correlation between 5L and the VAS was stronger than 3L, indicating better convergent validity. Both the two systems showed good known-group validity except in gender since men were much more prevalent than women in our survey, and symptoms could often cover the effects of gender. The 5L version being able to separate more infectious conditions than the 3L version allowed for better validation and sensitivity.

The Shannon’s index increased from 3L to 5L, but no apparent improvement of the Shannon’s evenness index in 5L was found, indicating the use of the extra levels added was not adequate. This might be attributed to the skewed redistribution of the 3L–5L response pairs, resulting in the inadequate use of some levels. The ceiling effect decreased in all dimensions in 5L, as shown in the redistribution of the proportion of P(1, 2), and spread from 1.8 % (self-care) to 33.0 % (anxiety/depression) in the 3L-2 subgroup.

In patients with stable health states, ICC was higher in 5L than 3L, but the improvement of the κ was not significant. While ICC was assessed by HRQoL as a sum of all five dimensions, κ was obtained from each dimension. As a result, κ was more sensitive to changes in a single dimension, while what ICC measured was more comprehensive. This result indicated that although the stability of each dimension was not improved in 5L, the reliability for the whole descriptive system was much improved.

For patients with improved health states, both of the two systems could detect increases in HRQoL, indicating a good responsiveness. We did not detect any significant difference with our sample size, and therefore, we could not determine which system was more effective.

In our study, we compared the two versions of EQ-5D in a wide disease spectrum of hepatitis B infection. Since the conditions varied greatly by infectious patterns, from no symptoms to being restricted in bed, we could observe the two versions from different perspectives. Most studies have evaluated the 5L version to be better than the 3L version [1417, 21, 28]. We did observe improvement in the sensitivity and convergent validity, and we acquired evidence for the extra levels added from redistribution properties; however, our evidence was not as apparent as previous studies. This may be because our limited sample size could not detect small improvements. Additionally, during the field survey, some patients reported confusion about the dimensions and levels, for example, the third dimension, usual activities, which was described as being able to work, study, do housework, and entertainment. For some patients with early HCC, work was forbidden so the patient could rest, but the patient could do housework, confusing the patient. For patients who needed infusions regularly, they may have chosen level 1 in mobility since they could walk without the help when there was no infusion but they also could have chosen level 5 since they could not move during infusion. At least 13 patients suggested the levels should be described more concretely, rather than just changing adjectives. These problems arose due to different interpretations on the descriptive systems. In the future, studies with larger sample sizes and more types of patients may be warranted, while some revisions in EQ-5D may be needed for better applicability in a Chinese population.

There were several limitations in our study. First, there was no value set for 5L and 3L for any Chinese population. We used a Japanese value set, which may cause bias. Second, some patients who were highly anxious or depressed refused participation in the survey, and patients who were seriously ill, for example, those who were in a coma or felt grave pain or discomfort at the time of survey were excluded. Third, judgments of patients’ progressions were made by doctors which may have involved subjective factors. Fourth, the interval between the first survey and retake of the survey was determined by the mean time of inpatients staying in hospitals. This interval may have been short enough to have carryover effects, and there may have been some residual memory that could have influenced the retest. Besides, the socio-demographic section was placed between the two systems in order to reduce the memory effects, but this could not eliminate this carryover effect.

In conclusion, 5L was more suitable than 3L for use on patients with hepatitis B in China. Its acceptability, convergent validity, and reliability are improved while its ceiling effects are reduced compared with 3L. Evidence for the levels added in 5L was obtained from redistribution properties. We recommend the 5L version for future studies.