Introduction

The quality-adjusted life year (QALY) is regarded as one of the most important outcomes in economic evaluations of healthcare interventions [1]. It is calculated by multiplying a quality adjustment weight (or health utility) by life duration to generate a standardized metric that can then be used in cost-utility analysis (CUA) [1]. A common approach to eliciting the health utility values is the use of generic preference-based measures, such as the EQ-5D or SF-6D [1,2,3]. A generic preference-based measure usually consists of a health state classification system and a corresponding country-specific health utility value set elicited from a representative sample of the general population [1].

The health state utility values have been widely elicited using cardinal approaches, such as standard gamble (SG) and time trade-off (TTO) [1, 4, 5]. However, these approaches are cognitively complex, and respondents might have some difficulty in understanding and completing the task, particularly those in vulnerable groups such as the old adults or children [6]. One of the most recent developments in utility elicitation is the adoption of the discrete choice experiment (DCE), especially for online surveys [7,8,9,10]. The DCE with duration (DCETTO) approach, a variant of DCE, provides a novel alternative to elicit the utility values [10, 11]. Unlike traditional DCE, in which only different hypothetical health states are presented, DCETTO requires respondents to further consider the duration of living in each hypothetical health state, i.e., it includes quantity vs. quality trade-off in each task. Consequently, it does not require a separate task to anchor the latent utility, which remains controversial in the traditional DCE approach [6, 12]. Compared with the iterative process of identifying the indifference point between two options in SG and TTO, DCETTO is usually regarded as a promising alternative since it only requires respondents to make ordinal choices [10].

The evidence on reliability is crucial when assessing the performance of an elicitation approach [1, 13]. Assessment of reliability commonly refers to two different types of validation [14]. The first one is called internal reliability which mainly assesses the homogeneity of multi-item scales and is not the focus of this study. The second one, test–retest reliability, focuses on the repeatability and stability of measurements. Test–retest reliability is a critical property to ensure that the elicited societal preference on different health states is stable over time. So far, the evidence on test–retest reliability of different elicitation approaches, including visual analogue scale (VAS), SG, TTO, DCE, and DCETTO, is very limited [13, 15,16,17,18,19,20]. Overall, mixed results were found in the studies that compared VAS, TTO, and SG, based on the reported intraclass correlation coefficients (ICC) [16,17,18,19]. The two studies that compared traditional DCE and TTO also reported mixed results on test–retest reliability [13, 20]. Currently, no studies compared the test–retest reliability of DCETTO with other approaches. A summary of existing evidence on test–retest reliability of different elicitation approaches can be found in Table 1.

Table 1 A summary of the comparison of the test–retest reliability of elicitation approaches in previous studies

Given the increasing usage of the DCETTO in health state valuation [21], it is crucial to deepen our understanding of its test–retest reliability, in particular when comparing to the traditional approaches such as TTO. This study aimed to evaluate and compare the test–retest reliability of DCETTO and TTO based on the SF-6Dv2 valuation tasks among a representative sample of the Chinese general population.

Methods

This study was part of a larger study that focused on the valuation of the SF-6Dv2 using face-to-face interviews among the Chinese general population [22]. More detailed descriptions of the design of the valuation study can be found elsewhere [22].

Instrument

The SF-6D is derived from the Short-Form 36 (SF-36) health survey [23]. The original health state classification system of the SF-6D comprises six dimensions with four to six levels in each, including physical functioning (PF), role limitation (RL), social functioning (SF), pain (PN), mental health (MH), and vitality (VT), yielding up to 18,000 health states [23]. Recently, a second version of the SF-6D, SF-6Dv2, was developed, which revisited the items selected from the SF-36 and modified the ambiguity between dimension levels and inconsistency of wording in the original version [24]. The SF-6Dv2 has the same six dimensions with five to six levels in each dimension, resulting in 18,750 health states in total [24,25,26]. The Simplified Chinese version of the SF-6Dv2 was developed after translation and cross-cultural adaption, and preliminary psychometric testing was conducted among the Chinese general population [26].

Elicitation tasks design

The composite TTO approach (hereafter TTO) [10, 22, 27,28,29] and DCETTO elicitation approaches were employed in this study (Supplementary Fig. 1) [22]. A total of 295 states were selected for TTO tasks, including the six mildest imperfect states, the worst state, and 288 other states generated based on near orthogonal arrays using SAS® Studio. These 288 states were firstly divided into 48 blocks, the worst state (included in all 48 blocks) and the six mildest states (each randomly included in eight blocks) were then added in these blocks. Respondents were randomly assigned to 1 of the 48 blocks for valuation.

For DCETTO tasks, four levels of life duration, i.e., 1, 4, 7, and 10 years, were chosen [22, 25]. The DCETTO tasks, which consisted of 300 state pairs distributed over 30 blocks, were generated using the balanced overlap method, with the maximized statistical efficiency according to the D-efficiency based on Lighthouse Studio 9.6.0 (Sawtooth Software, Inc) [22, 30,31,32]. The task order and the left–right position of health states within each task were all randomized. Respondents were randomly assigned to 1 of the 30 blocks for valuation.

Sample and data collection

Respondents included in this study were recruited from eight cities and their surrounding rural areas to cover a wide geographical range with varying economic development stages in China [22]. A quota sampling method was used to recruit a representative sample of the Chinese general population in terms of age, gender, and area of residence (urban/rural) [33, 34]. Face-to-face, computer-based interviews were conducted in this study.Footnote 1 More detailed information can be found in the main valuation paper [22].

After the first interview (test), the interviewers asked for the respondents’ consent to re-participate in the face-to-face interview again (retest) and collected their contact information. While the interval between test and retest was set as two weeks [15, 35], it could be relaxed to the range of 10–30 days to ensure that respondents could be interviewed again at their convenient time. The retest interview was held using the same process by the same interviewers. In the retest interview, respondents were assigned to the same block of tasks for both TTO and DCETTO, with the same previously described randomization of the order of tasks, as in the first interview.

Data analysis

Descriptive analyses were first conducted to present the respondents’ characteristics, as well as the distributions of both TTO and DCETTO data. The utility values of the respondents' self-reported SF-6Dv2 health states were calculated using the Chinese-specific value set [22]. Then, the test–retest data for two approaches were analyzed at both the individual level and aggregate level.

For the DCETTO tasks, a calculation of the “pseudo-QALY” approachFootnote 2 was employed to present the relative preference for choice A versus B in each choice pair. The pseudo-QALY was obtained by multiplying the utility value of the health state (calculated using the Chinese-specific DCETTO value set of SF-6Dv2 [22]) by the corresponding life duration. For example, the difference of pseudo-QALY for choice A (121122 with 4 years) and B (413334 with 1 year) in a DCETTO task would be (0.971 * 4)–(0.639 * 1) = 3.245 pseudo-QALYs.

Statistical analyses at the individual level

For the TTO data, three evaluations were conducted. First, the number of respondents changing 0–8 out of their 8 responses between test and retest was investigated. Second, the proportion of responses that had different values between the two tests was evaluated. Any significant difference in values between test and retest was assessed applying the Wilcoxon matched-pairs signed-rank test. The degree of consistency between two tests was also evaluated using the intraclass correlations coefficient (ICC), with the interpretation of ICC < 0.40 = poor, 0.40–0.59 = fair, 0.60–0.74 = good, and > 0.74 = excellent [36]. Third, the degree of agreement of TTO observed values between test and retest was assessed by the Bland–Altman plot.

For the DCETTO data, first, the number of respondents changing 0–10 out of their 10 choices between test and retest was investigated. Second, the proportion of choices that were identical between the two tests was evaluated. The overall agreement irrespective of respondents and blocks was calculated, with the good agreement confirmed at ≥ 70% [37]. The kappa (κ) statistic was also calculated to provide the estimation of agreement that is corrected for chance, with the interpretation (κ) < 0.40 = low, 0.41–0.60 = moderate, 0.61–0.80 = good, and > 0.80 = excellent [15, 38]. Third, the proportions of respondents that gave consistent choices between two tests in different pseudo-QALYs categories were shown using a histogram.

The performance of test–retest reliability among subgroups with different demographic characteristics, the time intervals and the difference in self-reported utility value between two tests was also evaluated. Linear regression was used for TTO data, with the dependent variable being the difference in observed TTO values between the two tests. A binary logistic regression model was used for DCETTO data, in which the dependent variable was measured by whether or not identical choices were observed between the two tests. Cluster-robust standard errors were used to account for one respondent completing multiple tasks.

Statistical analyses at the aggregate level

Considering the relatively small sample size of this study, constrained main-effect only model specifications were estimated for TTO and DCETTO data, respectively. Different from the main valuation study, in which a set of dummy variables was used for each dimension [22], here each dimension was modeled as a continuous variable.

Equation (1) was used to model TTO data:

$${{y}_{i}}=\text{ }\!\!\alpha\!\!\text{ }+{{\beta }_{1}}PF+{{\beta }_{2}}RL+{{\beta }_{3}}SF+{{\beta }_{4}}PN+{{\beta }_{5}}MH+{{\beta }_{6}}VT+\epsilon$$
(1)

where \(y_{i}\) is the disutility value given by the respondent \(i\); \({\upalpha }\) is the intercept; PF, RL, SF, PN, MH and VT are continuous variables representing the different levels in each dimension of SF-6Dv2, assuming linear effect across levels; \(\beta\) are the estimated coefficients on each dimension; and \(\varepsilon\) is the error term.

Equation (2) was used to model DCETTO data:

$$\begin{gathered} U_{ij} = \lambda_{0} t_{ij} + \lambda_{1} PFt_{ij} + \lambda_{2} RLt_{ij} + \lambda_{3} SFt_{ij} \hfill \\ + \lambda_{4} PNt_{ij} + \lambda_{5} MHt_{ij} + \lambda_{6} VTt_{ij} + \varepsilon_{ij} \, \, \hfill \\ \end{gathered}$$
(2)

where \(U_{ij}\) is the binary choice of respondent \(i\) for DCETTO task \(j\); \(t_{ij}\) is the life duration, which is modeled as a linear variable; \(\lambda_{0}\) is the coefficient for the life duration; PF, RL, SF, PN, MH and VT are continuous variables representing the different levels in each dimension, assuming linear effect across levels (and they were included as interaction terms with the life duration variable); correspondingly \(\lambda\) are coefficients for the interactions; \(\varepsilon_{ij}\) is the error term.

Both test and retest data for TTO and DCETTO were modeled using the optimal statistical methods that were selected to generate the Chinese-specific value set of SF-6Dv2 [22]. In brief, the TTO data were analyzed using the random-effect model; the DCETTO data were analyzed using a conditional logit model, following the model specification proposed by Bansback et al. [10] and the corresponding method of anchoring on the QALY scale [10, 39,40,41,42]. More detailed information can be found in the main valuation paper [22]. Owing to the sample size of the retest data, the consistency between test and retest was mainly focused on the rank order of SF-6Dv2 dimensions from the model estimations. The scatter plot was also drawn to visually demonstrate the degree of consistency.

All statistical analyses were conducted using STATA 15.1. For the comparison of distributions of characteristics between subgroups, the t-test was used for continuous variables, while the χ2 or Fisher exact test was used for categorical variables. A two-tailed p-value < 0.05 was considered statistically significant.

Results

Respondents

Of 178 respondents who consented to participate in the retest survey, 16 respondents were excluded because they did not complete the second interview. Consequently, 162 respondents were included in this study. The mean (standard deviation, SD) interval between the first and the second interviews was 15.6 (4.4) days (range 10–33 days). As illustrated in Table 2, the mean (SD) age of the sample was 44.4 (16.5) years, ranging from 18 to 80 years, 51.9% were males, and 37.7% lived in rural areas. The distributions of characteristics of the respondents were similar to those of the Chinese general population in terms of age, gender, and proportion of urban/rural residence [33, 34]. The utility values of self-reported health state using SF-6Dv2 in both interviews were 0.868 and 0.872, respectively. The absolute mean (SD) difference of utility value between the two interviews was 0.026 (0.052), with a range of 0–0.428.

Table 2 The Characteristics of respondents

TTO data

A total of 1,296 TTO responses were provided by the 162 respondents for each test. Histograms of the TTO observed values showed a comparable distribution between both tests (Fig. 1). More than half of the respondents (N = 118, 72.8%) changed four or less of their eight responses in the retest, with only five (3.1%) respondents changing more than seven responses (Supplementary Table 1). Of the 1,296 responses, 770 (59.4%) were identical between the two tests, 231 (17.8%) increased, and 295 (22.8%) decreased (Supplementary Table 3). While the mean absolute difference between the two tests ranged from 0 to 0.142 with an average mean (SD) absolute difference of 0.029 (0.081), there was only one health state (555655) with a significant change (p = 0.041) in median value (Supplementary Table 3). The ICC ranged from 0.500 to 1.000, with a mean ICC of 0.945. The Bland–Altman plot (Fig. 2) showed that the mean difference of observed TTO values between test and retest was 0.01. The 95% limits of agreement ranged from −0.17 to 0.19, and 92.2% of points lay within limits.

Fig. 1
figure 1

The comparison of response distribution for both test and retest data. The pseudo-QALYs was calculated by multiplying the utility value of the health state by the corresponding life duration. The utility value was calculated using the Chinese DCETTO value set [21]. For example, the pseudo-QALY for choice A (121122 with 4 years) in a DCETTO task would be 0.971 * 4 = 3.884 QALYs. TTO time trade-off, DCETTO discrete choice experiment with duration

Fig. 2
figure 2

The Bland–Altman plot of TTO observed values for test and retest

DCETTO data

The DCETTO data consisted of 1,620 responses per test. Histograms of the relative preference for choice A vs. B by the difference in pseudo-QALYs for the two tests showed a similar expected distribution (Fig. 1), in which respondents were always more likely to choose the choice that had a more pseudo-QALYs. 116 (71.6%) respondents gave three or fewer different responses among 10 tasks (Supplementary Table 2). Only two (1.2%) respondents gave seven different responses, and no respondents gave eight or more different responses. The overall agreement was 76.4%, with 1,238 of 1,620 responses being identical between the two tests (Supplementary Table 3). The kappa (κ) statistic was 0.528, which was interpreted as a moderate agreement [15, 38]. As shown in Fig. 3, the proportions of respondents who gave consistent choices between two tests in different pseudo-QALYs categories ranged from 66.7 to 86.4%. A slightly higher proportion could be observed among categories with larger differences of pseudo-QALYs between the two choices.

Fig. 3
figure 3

The proportions of respondents gave consistent choices in different pseudo-QALYs categories for DCETTO data. The pseudo-QALYs was calculated by multiplying the utility value of the health state by the corresponding life duration. The utility value was calculated using the Chinese DCETTO value set [21]. For example, the pseudo-QALY for choice A (121122 with 4 years) in a DCETTO task would be 0.971 * 4 = 3.884 QALYs

Subgroup analyses

As illustrated in Table 3, the differences in observed TTO values between the two tests were not statistically significant among subgroups of all characteristics, except for the difference in the self-reported SF-6Dv2 utility values between the two tests (p < 0.001). The differences in the proportion of identical choices for DCETTO were statistically significant only between subgroups with or without chronic conditions (p = 0.029). Linear regression analysis demonstrated that the difference in observed TTO value between the two tests became larger when the severity score of the health state valued in the TTO task increased (Coef. = 0.001, 95% CI: [0.000, 0.002], p-value = 0.025) (Supplementary Table 4). Logistic regression for the DCETTO data showed that respondents were more likely to give consistent choices between the two tests in the task with a larger difference of pseudo-QALYs (Coef. = 0.112, 95% CI: [0.048, 0.176], p-value = 0.001) (Supplementary Table 4).

Table 3 Subgroup analysis for TTO and DCETTO data

Comparisons on aggregated model estimates

As shown in Table 4, the constrained models for TTO data showed a consistent rank order of dimensions between test and retest, i.e., PN > PF > MH > RL > VT > SF, with all coefficients being statistically significant. Similarly, the constrained models for DCETTO data also showed a consistent rank order of PN > PF > MH > SF > VT > RL, while the coefficients of RL and VT were not statistically significant in both test and retest models (Table 4). The scatter plots (Supplementary Fig. 2) demonstrated that, while generally good consistency was observed for both approaches, the consistency of the estimated utility values between the two tests for TTO was slightly higher than that for DCETTO.

Table 4 Estimated coefficients of the constrained models on TTO and DCETTO data

Discussion

When compared with the most widely used approach TTO, DCETTO is commonly regarded as a promising alternative [10]. To the best of our knowledge, this study provided the first empirical evidence that directly compared the test–retest reliability between TTO and DCETTO approaches. The results demonstrated good test–retest reliability of both utility elicitation approaches in the context of developing the SF-6Dv2 value set in China. Moreover, it should be borne in mind that the implications of the observed levels of test–retest reliability for the two approaches are different, since TTO values are directly modeled as utility values, while DCETTO is modeled as latent values under random utility theory.

The test–retest reliability reported in this study is comparable to or better than what has been reported in the literature. The ICC for TTO was 0.945, which is higher than the previous studies that ranged from 0.780 to 0.880 [16,17,18,19,20]; 59.4% of the responses were identical between the two tests, which is also higher than 24.5% reported in a previous study [13]. Regarding the DCETTO, the overall agreement (76.4 vs. 70.6–80.2%) and the kappa (κ) statistic (0.528 vs. 0.411–0.605) are consistent or higher than those from the previous study [15]. The better result on both TTO and DCETTO data may be partly due to the different interview methods employed in these studies, i.e., the face-to-face interview in this study and postal or telephone interviews in most of the previous studies [15, 17, 19]. It is also worth noting that more respondents reported extreme values of both the worst (−1) and the best TTO values (1) in the retest than those reported in the test, with lower mean values (0.356 vs. 0.367) in the retest. This finding was not consistent with a previous study, which reported a higher mean value of 0.042 in the retest [13]. With very limited studies focusing on test–retest reliability of valuation techniques, more studies are warranted to confirm this finding.

Findings from subgroups analyses in this study are worth highlighting. In regression analyses for both TTO and DCETTO data, characteristics of respondents, including age, gender, education level, chronic condition status, marital status, regions of residence, and the self-reported utility values had no statistically significant impact on the TTO observed values or the DCETTO choices between the two tests. There was a significant negative effect of the severity score on the health state valued in the TTO task. The finding is not surprising since the cognitive burden may be heavier when the health state in TTO tasks becomes worse, especially for those health states that are considered worse than death, in which a different question design from states better than death is used [17, 28]. The finding for DCETTO that it is easier to make a consistent choice when facing a larger difference of pseudo-QALYs between two options also seems potentially reasonable. These findings provide the first empirical exploration to evaluate the relationship between the social-demographic characteristics of respondents and the characteristics of health states and the test–retest reliability of utility elicitation approaches.

The time interval between test and retest was commonly considered associated with the reliability of the results. In this study, we found that the time interval between the two tests also had no significant impact on both TTO and DCETTO approaches. Note that the mean time interval of 15.6 days in this study was comparable with previous studies, which mainly ranged from 3 to 59 days, with the means of 5–19 days [13, 16,17,18,19,20]. Further studies with larger time intervals are warranted to evaluate the relationship between time interval and consistency, as well as the memorizing effect (i.e., the respondents may remember their choices made for the elicitation tasks during the first test for a few days) on the test–retest reliability of the health utility elicitation approaches.

Both TTO and DCETTO are relatively stable overtime on the rank order of dimensions in model estimations between test and retest, which provides evidence of feasibility in eliciting utility at the aggregate level for both approaches. We mainly focused on the constrained models, since the aim of this study was to compare the reliability of these approaches over time rather than generate the utility value set. Constrained models with fewer parameters could generate more robust results comparing a full model with four or five dummies for each dimension among a relatively small sample size. It is also worth noting that potentially better performance of the consistency of estimated values between test and retest data for TTO than that for DCETTO were observed according to the scatter plots. While due to the relatively small sample size in this study, further study with a larger sample size is warranted to further evaluate the consistency of model estimations.

Several limitations of this study should be considered. First, considering the relatively small sample size and corresponding small number of observations for each task in this study given the same experimental designs as the Chinese SF-6Dv2 valuation study, there could be an impact on the statistical efficiency of the utility model estimation. The constrained model specifications (each dimension was modelled as a continuous variable) were therefore used in this study instead of using a full main-effects model (a set of dummy variables was used for each dimension). Second, the distribution of education level of the sample in this study was different from that of the Chinese general population, i.e., a higher proportion of respondents with college or higher degrees and primary school or lower education, and a lower proportion of junior high school were found in this study. However, the subgroup analysis for the education level demonstrated that this difference had a trivial impact on the study findings. Third, 16 interviewers employed in this study had the same extensive training but came from different regions of China, had different academic backgrounds, and they might adopt different interview skills, all of which might influence the findings that have been reported. However, the interview effect was negligible in this study by checking the distributions of TTO and DCETTO data among different interviewers and cities. Fourth, there might be a selection effect for the respondents who completed the retest interviews. However, there was no significant difference in most demographic characteristics between the respondents included in the larger valuation study and respondents included in this study, except for the education level and employment status (Supplementary Table 5) [22]. Fifth, it should be noted that the computation of the pseudo-QALY approach might be distorted by time preferences. Subsequent studies using this approach should pay attention to this issue.

Conclusions

Individual responses to both TTO and DCETTO approaches are relatively stable over time. The rank orders of dimensions in model estimations between test and retest for TTO and DCETTO are also consistent, which provides evidence of feasibility in eliciting utility at the aggregate level for both approaches. Subgroup analyses from this study demonstrated the potentially negligible relationship between the demographic characteristics of respondents and the test–retest reliability of both approaches. The differences in utility estimation between the two tests for DCETTO need to be further investigated based on larger sample size.