Introduction

The EQ-5D has been used to measure both patients and general populations’ health [1,2,3,4,5,6,7,8]. Due to its simplicity, its self-completion with low cost burden, and capacity to generate a preference-weighted index score, known as a utility score, the EQ-5D is commonly used to assess humanistic outcomes for economic appraisal recommended by several HTA guidelines including Thai [9,10,11].

The first version of the EQ-5D, the EQ-5D-3L (3L), was introduced in 1990 and now has been translated into more than 170 languages [2]. Its EQ-5D descriptive system has five dimensions, each with three response options including no problem, some/moderate problem, and extreme problem [4]. Nevertheless, previous evidence has revealed some drawbacks of the 3L’s use, including high ceiling effect, minor discriminative power, and less sensitivity to clinical changes in both general populations and clinical areas when compared to the SF-6D, SF-12, and SF-36 [12,13,14,15,16]. To solve these problems and still preserve clinical relevance to a wide range of health conditions and populations, a newer version of the EQ-5D, the EQ-5D-5L (5L), was developed and introduced by the EuroQoL group in 2015 [1]. This 5L version includes two additional response options, “slight problem” and “severe problem,” for each of the five dimensions. As a result, the EQ-5D-5L has five response options; no problem, slight problem, moderate problem, severe problem, and extreme/unable to perform [1]. This version is expected to diminish the ceiling effect and improve discriminative power in both general populations and clinical areas. Moreover, this version has now been translated into more than 113 languages including Thai.

To date, several studies examining the 5L’s psychometric properties have found it a valid and reliable instrument. Compared to the 3L in both clinical areas and general populations [1, 7, 17,18,19], it has a lesser ceiling effect but more enhanced discriminative power for clinical changes. Previous evidence has also suggested that the 5L might capture more severe health problems in the patient population, and it might differentiate mild health states, particularly in the pain/discomfort and anxiety/depression dimensions in the general population [20, 21].

In Thailand, evidence is limited for the 5L’s psychometric properties. To our knowledge, only two studies have explored the 5L’s measurement properties in patients with diabetes [22] and a wide range of chronic diseases [23]. These studies revealed that the 5L was a valid and reliable instrument, with less ceiling effect than the 3L in the patient group. Nevertheless, evidence of the 5L’s psychometric properties when administered to the general Thai population has not yet been established. Therefore, this study aimed to assess the 5L’s pyschometric properties in comparison with the 3L in terms of practicality (administration time and ceiling effect), discriminatory power, response redistribution, test–retest reliability, validity (known-groups and construct validities), and acceptability in the general Thai population.

Methods

Participants and settings

A cross sectional survey study was conducted with study participants (n = 1200) randomly selected from five provinces including Nakhon-Srithammarat, Khon-Kaen, Chonburi, Chaing-Mai, and Bangkok (the capital city). Inclusion criteria included (1) age 20–70 years and (2) understanding of the Thai language and the data collection process, as evaluated by the interviewers or the researcher (KK). Exclusion criteria were: (1) being diagnosed with acute or life-threatening illness, (2) having cognitive impairment or (3) disability. Four-stage stratified random sampling was employed to select the provinces, districts, and villages for data collection.

Data collection

Each subject completed the self-administered questionnaire using a paper and pencil as follows: general subject information, EQ-5D-5L, the short form 12 health survey version 2 (SF-12v2), WHOQOL-BREF, EQ-5D-3L, EQ-VAS, and two acceptability questions—(1) ease of understanding of the EQ-5D and (2) better reflection of health status. Moreover, our interviewers were assigned to be with all subjects to record the time for each part of the questionnaire. Permission to use the Thai version of those instruments was granted by the appropriate officials. All subjects received approximately 3.20 USD (1 USD = 31.19 THB) as compensation for their time. The majority of subjects (95%) completed the questionnaire by themselves; however, for those who had an eyesight problem, our interviewers read all questions and response options without elaborating or interpreting them.

Four hundred subjects were asked to complete both EQ-5D versions at their homes 2–3 weeks after their first interview and to send them back to the researcher (KK) in a prepaid mailing envelope. Subjects were asked to assess whether their individual health status had changed after their first interview, and a five-point Liket scale was used: (1) much better, (2) somewhat better, (3) the same, (4) somewhat worse, and (5) much worse. The researcher (KK) made reminder phone calls 2 weeks after the initial assessments. A questionnaire reaching the researcher (KK) after 21 days was excluded from this analysis.

Instruments

EQ-5D

The EQ-5D is a brief, self-report questionnaire measuring respondents’ general health. Respondents were required to rate their health on the day of the questionnaire’s administration. The EQ-5D is comprised of two parts; the first part is the EQ-5D descriptive system consisting of five dimensions including mobility (MO), self-care (SC), usual activities (UA), pain/discomfort (PD), and anxiety/depression (AD). This part is generally used to calculate the EQ-5D index using a country-specific value set for economic analyses. At present, both 3L and 5L Thai versions have individual value sets for calculating the EQ-5D index [24, 25], which generally ranges from 0 to 1, where 1.00 represents perfect health and 0 represents death. The lowest of the Thai index scores are − 0.454 and − 0.283 for the health state “33333” and “55555” of the 3L and 5L, respectively, while the maximum Thai index score for both versions is 1.00. Moreover, a negative value represents the health state as worse than death. The second part is the EQ-VAS, that is, the respondent’s self-rated health on a 20-cm vertical line (visual analog scale) measuring the current respondent’s self-rated health, where endpoints are labeled “worst imaginable health state” at 0 and “best imaginable health state” at 100 [2]. The scores of EQ-VAS range from 0 to 1 and are obtained from dividing the number marked on the scale by 100.

WHOQOL-BREF

The WHOQOL-BREF is a shorter version of the WHOQoL-100, developed by the World Health Organization (WHO) collecting data from 15 countries including Thailand [26]. This instrument requires respondents to rate their HRQoL levels during the past 2 weeks. The WHOQOL-BREF contains 24 items grouped into four dimensions as follow: physical (7 items), psychological (6 items), social (3 items), and environmental (8 items). Two other items are one for general health and another for overall quality of life. Response options are on a 5-point Likert scale; 1 = not at all, 2 = not much, 3 = moderately, 4 = a great deal, and 5 = completely. WHOQOL results are reported as raw scores for each dimension, calculated by multiplying the mean score of all items by four, so the score can range from 4 to 20 for the four domains. It can also be converted to a transformed score ranging from 0 (the worst possible health status) to 100 (the best possible health status). The official Thai version of WHOQOL-BREF is available [27].

SF-12 version 2

The generic health profile “12-item Short Form Survey version 2” is a short version of the 36-item Short Form Survey (SF-36) for measuring health status in large suverys [28]. It has been proven to be valid and reliable in Thai patients with chronic diseases [23, 29, 30]. It consists of 12 items further grouped into eight dimensions including Physical Functioning (PF: 2 items), Role limitations due to physical problems (RP: 2 items), Bodily Pain (BP: 1 item), General Health (GH: 1 item), Vitality (VT: 1 item), Social Functioning (SF: 1 item), Role limitations due to emotional problems (RE: 2 items), and Mental Health (MH: 2 items). SF-12 scores can be transformed from 0 (the worst possible health status) to 100 (the best possible health status) for each health dimension, and they can be converted to norm-based scoring, which is referred to 50 ± 10 (mean ± SD). Moreover, scores of those eight dimensions can be summarized into two major scales, the Physical Component Summary (PCS) and the Mental Component Summary (MCS) [31]. In this study, we used the 4-week standard recall period of the SF-12v2.

Data analyses

Practicality

The ceiling effect was computed as the proportion of subjects reporting “no problem” (level 1) for both versions in each dimension and across all five dimensions divided by the total number of subjects. An acceptable percentage of ceiling effect was set as less than 15% [32]. We hypothesized that by adding two more levels of impairment to the 3L, the ceiling effect of the 5L would be diminished. Absolute and relative reductions of this effect from 3L to 5L were computed and reported. The average time of both EQ-5D versions’ completion was also reported and compared.

Discriminatory power

Two indices, Shannon entropy (Shannon index (\({H}^{{\prime}})\)) and information efficiency (Shannon evenness index (\({J}^{{\prime}}\))), were employed to determine each dimension’s discriminatory power. The Shannon index is defined as follows:

$${H}^{{\prime}}=-\sum_{i=1}^{C}{P}_{i}{log}_{2}{P}_{i},$$

where \({H}^{{\prime}}\) is the absolute amount of informativity captured, C is the number of levels in this study, and Pi = ni/N is the proportion of observations at the ith level (i = 1, …, C) among our study samples, where ni is the number of responses at the ith level and N is the total sample size. A higher Shannon index \(({H}^{{\prime}})\) indicates more information captured by the instrument and better discriminant activity.

Shannon evenness index \(({J}^{{\prime}}\)) means eveness of information distribution regardless of the number of response options [33]. \({J}^{{\prime}}\) was calculated as \({H}^{{\prime}}/{H^{\prime}}_{\max}\), and its value ranges from 0 to 1, where 1 means all response options selected with the same frequency. We hypothesized that the 5L would have higher \({H}^{{\prime}}\), and \({J}^{{\prime}}\) of 5L would remain equal to or decrease slightly from the 3L.

Response redistribution

Response redistribution was used to determine both versions’ response consistency. To quantify consistency, the 3L response level was recoded to the 5L (the 3L5L) response level as follows: 1 = 1, 2 = 3, and 3 = 5 [7, 22, 34]. Inconsistency size was calculated as │3L5L-5L│-1, which means zero or less indicated consistency. The mean of EQ-VAS from the 5L for each pair was also quantified to ensure response redistribution’s validity. The mean of individuals’ VAS scores remaining at the same level was hypothesized to be higher than those selecting more severe problems on the 5L or to be lower than those selecting milder problems on the 5L.

Validity

Construct validity was evaluated in terms of convergent and discriminant validity via correlations between the five dimensions of 3L and 5L and other well-established HRQoL instruments, WHOQOL-BREF and SF-12v2, using Spearman’s rank rho correlations. Colton’s rule was used to determine the strength of correlation as follows: weak or no (r < 0.25), moderate (0.25 ≤ r < 0.50), moderate to strong (0.50 ≤ r < 0.75), and strong (r ≥ 0.75) [35]. Convergent validity represents a high correlation between the dimensions of these instruments measuring similar constructs, whereas discriminant validity represents otherwise [32]. Hypothesized strong correlations were expected among these pairs, including MO/PF/Physical dimension, PD/BP/Physical dimension, and AD/MH/Psychological dimension. The correlation level between EQ-VAS scores was also determined and reported using Pearson’s correlation.

Known-group validity was performed to investigate utility index changes against participant sub-groups defined by demographic characteristics. We hypothesized that low utility scores would be observed among women, smokers/ex-smokers, drinkers/ex-drinkers, older samples (≥ 54 years old), and those with lower education levels (no schooling or primary school), lower incomes (< 30,000 THB or 990 USD), and higher numbers of comorbidities. Multivariable analyses were used to investigate the associations between the demographic characteristics and 3L and 5L index scores.

Reliability

Test–retest reliability was assessed among subjects with stable health between initial and second assessments. The reliability of the EQ-5D index and the EQ-VAS scores were examined using intraclass correlation coefficients (ICCs), while the reliability of each dimension in the EQ-5D descriptive system of both versions was assessed and compared using a weighted kappa coefficient. Rosner’s guideline was used to determine the agreement level for both ICCs and weighted kappa coefficients as follows: poor reproducibility (< 0.4), good reproducibility (0.4–0.75) and excellent reproducibility (≥ 0.75) [7, 19, 36, 37].

Acceptability

Responses to the two acceptability questions, including ease of understanding and better reflection of health status, were summarized and reported in terms of percentages.

All statistical analyses were performed using IBM SPSS version 23, with the p value < 0.05 generally considered statistically significant.

Results

Characteristics of study subjects

Table 1 displays basic characteristics of all subjects (n = 1200). Most were female (n = 640, 53.3%), married (n = 765, 63.7%), and educated at secondary school (n = 514, 42.9%). Subjects’ mean age and household income were 42.7 years (SD = 13.7) and 12,631.50 (SD = 10,276.5) THB/month, respectively. Moreover, most of them reported as healthy (n = 844, 70.33%). There were no missing values from either EQ-5D version.

Table 1 Characteristics of study subjects (n = 1200)

Practicality

As shown in Table 2, both 5L and 3L showed that the highest and lowest proportions of subjects rating “no problem (level 1)” were SC (97.3% vs 97.5%, p > 0.05) and PD (64.8% vs 57.8%, p < 0.01), respectively, and PD showed the highest relative reduction of 10.68%. Moreover, overall ceiling effects reduced from 57.17% for the 3L to 49.08% for the 5L, with a relative reduction of 14.14%. The average times for subjects to complete the 3L and the 5L were 2.08 ± 1.03 and 2.20 ± 1.04 min, respectively.

Table 2 The absolute and relative reductions of ceiling from the 3L to the 5L and descriptive statistics of both EQ-5D versions

Discriminatory power

Table 3 presents the Shannon index and the Shannon evenness index of the two EQ-5D versions. Our results revealed that the Shannon index (\({H}^{{\prime}})\) increased when samples rated two more severity levels of the 5L for all five dimensions (range 0.19–1.37). As expected, the Shannon evenness index (\(J^{\prime})\) was lower in the 5L than in the 3L for all dimensions except MO. The percentage of relative Shannon evenness index reduction showed that its maximum and minimum were found in PD (3.28%) and SC (27.27%), respectively.

Table 3 Discriminant power measured by Shannon index (\({H}^{{\prime}})\) and Shannon evenness index (\(J^{\prime})\) for the 5L compared to the 3L (n = 1200)

Response redistribution

As shown in Table 4, most samples reporting level 1-3L remained at level 1-5L for all five dimensions (87.3–99.6%). Of the study samples answering level 2-3L, from 57.8% for MO to 71.2% for PD shifted their answers to level 2-5L, whereas approximately 14.5% for AD—27.4% for MO upgraded their answers to level 3-5L. The proportion of samples marked for AD (25.0%) and PD (100%) reporting level 3-3L redistributed their answers to level 4-5L. Moreover, only two samples (50.0%) shifted their answers from level 3-3L to level 5-5L for AD. Of 6,000 redistribution pairs, small proportions of inconsistent pairs were observed in the AD (n = 30, 0.5%) and the SC (n = 7, 0.12%) dimensions.

Table 4 Response redistribution from 3L to 5L and mean of EQ-VAS

Validity

Table 5 shows convergent and discriminant validity. The physical dimension of the WHOQOL-BREF had moderate correlations with the MO (r =  − 0.32 for the 3L, r =  − 0.33 for the 5L, all p < 0.01), PD (r =  − 0.36 for the 3L, r =  − 0.35 for the 5L all p < 0.01), while MO and PD had moderate correlations with PF (r =  − 0.42 for the 3L, r =  − 0.47 for the 5L, all p < 0.01) and BP (r =  − 0.33 for the 3L, r =  − 0.35 for the 5L, all p < 0.01) of the SF-12v2, respectively. Nevertheless, AD had moderate correlations with the psychological dimension of the WHOQOL-BREF (r =  − 0.34 for the 3L, r =  − 0.33 for the 5L, p < 0.01), and it had weak and moderate correlations with MH of the SF-12v2 for the 3L and the 5L (r =  − 0.24 for the 3L, r =  − 0.30 for the 5L, p < 0.01), respectively. The EQ-VAS produced the highest correlation with the physical dimension of the WHOQOL-BREF (r = 0.40, p < 0.01), while it yielded the strongest correlation with GH of the SF-12v2 (r = 0.35, p < 0.01).

Table 5 Comparison of convergent and discriminant validity of 5L and 3L with WHOQOL-BREF and SF-12v2

As displayed in Table 6, we found that both EQ-5D versions could discriminate utility scores well in regard to gender, age, education level, household income, smoking, alcohol, and number of comorbidities. As expected, these following hypotheses were confirmed for both EQ-5D versions since we found that female, elderly, and those with one or more comorbidities tended to have a lower mean of utility index, with all p < 0.05.

Table 6 Known-group validity of 5L and 3L index scores using real Thai value sets using multivariable analyses

Reliability

All 400 subjects completed the questionnaire at 2–3 weeks after the initial assessment, and all retest questionnaires were returned to researchers within 21 days. Of 400 subjects, 239 (59.75%) reported themselves with no health status change from the first measurement (Table 7). The 5L’s weighted kappa coefficients ranged from 0.48 to 0.61, while the 3L’s ranged from 0.42 to 0.63. Moreover, the MO from both versions had the highest reproducibility with weighted kappa coefficients of 0.63 (95% CI 0.51–0.76) for the 3L and 0.61 (95% CI 0.49–0.72) for the 5L, while the lowest reproducibility was observed in AD of the 3L and UA of the 5L, with weighted kappa coefficients of 0.42 (95% CI 0.29–0.55) and 0.48 (95% CI 0.35–0.60), respectively. Percentage agreements across five dimensions ranged from 0.81 to 0.97 for the 3L and from 0.75 to 0.97 for the 5L. The ICCs of 3L and 5L indexes and EQ-VAS were 0.78 (95% CI 0.71–0.83), 0.71 (95% CI 0.63–0.78), and 0.82 (95% CI 0.77–0.86), respectively.

Table 7 Comparison of test–retest reliability for the EQ-5D descriptive system and the utility index between 3L and 5L

Acceptability

Most subjects (n = 589, 49.1%) thought that the 5L was easier to understand than the 3L, while 34.8% reported no difference. Conversely, 29.8% of subjects thought that the 5L could better reflect their health than the 3L; however, 31.5% indicated that the two versions were similar.

Discussion

Ours is the first study investigating psychometric analyses including practicality, discriminatory power, response redistribution, validity, reliability, and acceptability of the 5L compared to the 3L in the general Thai population.

Like previous studies [6, 7, 17,18,19, 22, 38,39,40,41,42], adding two more levels of severity to the 3L could reduce the overall ceiling effect by 8.09%, with the relative reduction of 14.14%. Our percentage of ceiling effect reduction (3L–5L) was lower than those in previous studies that ranged from 9.7 to 20% [38, 43,44,45]. However, we confirmed that our results are valid since our percentage of ceiling reductions was similar to those in a previous study conducted with the general Korean population [7].

Not surprisingly, most samples retained their answers at level 1 for both EQ-5D versions in all five dimensions, consistent with various previous studies conducted in both general populations and patient groups [7, 18, 19, 22, 46]. This might be due to most recruited samples being relatively healthy, so they rated themselves with “no problems” for both EQ-5D versions. We also found inconsistent responses to these two versions in our samples, at an average proportion of 1.7%, highest in AD (0.5%) and lowest in SC (0.12%). This was similar to that reported in the previous studies [41, 47], thus indicating that our samples answered the two EQ-5D versions consistently.

As expected, adding two more levels of severity to the EQ-5D’s descriptive system increased discriminative activity (\({H}^{{\prime}})\) in all dimensions from the 3L, with incremental values ranging from 0.01 to 0.41. Conversely, the Shannon evenness index (\({J}^{{\prime}}\)) was lower in the 5L than in the 3L for all dimensions, except MO. Notably, our \({H}^{{\prime}}\) and \({J}^{{\prime}}\) values were slightly lower than the findings from previous studies [6, 17, 19, 39, 42, 46]. Because those studies were conducted in clinical areas and in general populations with a large sample size (n = 7554), samples with moderate/extreme conditions were more likely to be recruited. However, our values were similar to those reported in Pattanaphesaj et al. (0.21–1.40 for \({H}^{{\prime}}\), 0.09–0.60 for \({J}^{{\prime}}\)) [22]. This ascertains that our results are valid and that the 5L showed improvement in discriminant activity across a wide range of population in Thailand.

As for construct validity, hypothesized correlations between both EQ-5D versions and WHOQOL-BREF and SF-12v2 were confirmed because two similar dimensions from those instruments yielded a higher correlation coefficient than two dissimilar dimensions. However, the strength of correlation was not as strong as anticipated. We reasoned that both EQ-5D versions asked respondents to rate their current health status, whereas the WHOQOL-BREF and SF-12v2 asked respondents to rate their health with a 2-week and a 4-week recall period, respectively. Nevertheless, the correlation pattern was like those reported in previous studies [7, 19, 22, 23, 42], implying that our results are valid.

For known-group validity, both EQ-5D versions showed that decreases in utility index were observed among female, elderly, and those with higher number of comorbidities. These findings were consistent with previous studies [39, 42, 48]. Moreover, known-groups revealed that smokers and drinkers had higher utility scores than their counterparts for 5L and 3L, respectively. Similar to a previous study, they revealed that smokers and drinkers reported more health problems than non-smokers and non-drinkers on the two bolt-on dimensions, interpersonal relationships, and activities related to bending knees on the EQ-5D-5L among Thai diabetic patients [49], while another previous study showed that smokers and drinkers reported higher scores on the SF-36v2 in the general Thai population [50]. This might be due to the Thai population’s specific characteristics, so these associations should be reinvestigated through further research.

Regarding reliability, 3L and 5L indexes and EQ-VAS showed good to excellent reproducibility, and all five dimensions produced good reproducibility for both EQ-5D versions. Compared with studies by Pattanaphesaj et al. [22] and Sakthong et al. [23], our weighted kappa values were similar or slightly higher, while our ICCs of the EQ-5D index and EQ-VAS were slightly higher than Pattanaphesaj et al., but were less than those reported in Sakthong et al. Moreover, Pattanaphesaj et al. and our study similarly reported that SC was not computed due to the high ceiling effect for both versions, resulting in lack of variance in our dataset. A possible explanation is that our study was conducted in the general Thai samples with limited range of health states, while Pattanaphesaj et al. conducted their study in diabetic patients without complications. This contrasted with the sample reported in Sakthong et al., as they reported the weighted kappa coefficient of SC with the value of 0.57 (95% CI 0.44–0.70) since they conducted the study in Thai patients with many chronic diseases and different levels of impairments.

Moreover, our study showed that the values of weighted kappa and ICCs for 5L were lower than those for the 3L, indicating that the 5L seemed less reliable than the 3L. These resembled that reported in Pattanaphesaj et al. [22]. However, these findings contrasted with those reported by Kim et al. [19], Corner-Spady et al. [47], and Jia et al. [38]. We explained that the long time (14–21 days) between the two assessments might contribute to recall bias for the respondents to judge whether their health status had changed after the first assessments. Furthermore, some respondents (40%) assigned to complete the second set of questionnaire were unhealthy, so their health status might have changed during this long time interval. Previous evidence has also suggested that approximately 2 weeks were considered the appropriate time interval for the retest reliability [51, 52]. Therefore, further studies investigating samples’ reliability with a shorter time interval are warranted.

One limitation that should be addressed is our time interval was 2–3 weeks for evaluating test–retest reliability, and results were inconsistent with findings reported in other studies. Therefore, future research investigating the effect of various time intervals on test–retest reliability is greatly encouraged.

Conclusion

Evidence supported that the 5L had an acceptable level of validity and reliability in the general Thai population. In addition, we found that the 5L was slightly better than the 3L in ceiling effect, discriminatory power and in convergent validity, while it showed comparable known-groups validity with the 3L. However, evidence to distinguish the superiority of the 5L over the 3L for test–retest reliability was limited. To confirm our results, therefore, it should be reinvestigated with a larger number of subjects having various levels of health impairment.