Introduction

The status of preference-based measures of health for generating quality adjusted life years (QALYs) was considerably enhanced by the recommendations of the U.S. Public Health Service Panel on Cost-Effectiveness in Health and Medicine to use them in economic evaluation [1]. The use of preference-based measures has grown considerably over the last decade with the increasing use of economic evaluation to inform health policy, for example through the establishment of bodies such as the National Institute of Health and Clinical Excellence in England and Wales [2].

To be a preference-based measure, it has been suggested that the health state valuation technique must be choice-based [13]. The two choice-based techniques most commonly used to value preference-based measures are the cardinal methods of standard gamble (SG) and time trade-off (TTO) [46]. There are concerns about these cardinal methods because they are likely to be affected by factors other than a respondent’s preference for the state, such as risk aversion in the case of standard gamble or time preference and aversion to losses for TTO [7]. Furthermore, these tasks are cognitively complex and respondents might have some difficulty with them, particularly those in vulnerable groups such as the very elderly or children. For these reasons, there has been increasing interest in using ordinal tasks that require the respondent to rank one or more states [810] and in discrete choice experiments (DCE) involving pairwise comparisons [1113].

The ability to derive cardinal health state values from ordinal information comes from the assumption that a respondent’s ranking of a set of states will be related to a latent utility variable. Individuals may give higher ranks to states with lower mean values than other states due to variability across individuals or random error. The proportion of occasions on which such reversals are made is related to the distance between mean values of the states in terms of the latent variable. There will be more agreement in rankings when the mean values for two states are further apart. The latent utility values are estimated using probabilistic choice models on ordinal data from multiple respondents.

A key problem in using ordinal methods has been how to anchor the values estimated by logistic models onto the full health-dead scale required for generating QALYs, anchoring full health at one and dead at zero. If the preference weights do not produce utility values on the full health-dead scale they cannot be used in economic evaluation using cost per QALY analysis. This paper uses existing anchoring techniques for rank and discrete choice data and presents an alternative anchoring technique for discrete choice data for two feasibility valuation studies; one for an asthma-specific measure and the other for an overactive bladder-specific measure. Preference weights obtained using rank and discrete choice data are compared to TTO results.

The paper begins by presenting an overview of the theory underlying ordinal methods. The methods and results of the valuation studies are presented, including a comparison of results using ranking, DCE and TTO on the same full health-dead scale. The implications of this study for further work are considered in the discussion.

Deriving cardinal values for health states from ordinal information

The idea of obtaining cardinal values from ordinal data first came from the work of Thurstone [14] who proposed the ‘law of comparative judgement’. This was recognised [15] as offering a method for deriving cardinal preferences for health states from rank data and later implemented using the sleep dimension of the Nottingham Health Profile [8] and more recently the EQ-5D classification [16].

Thurstone’s approach has been modified in a number of ways, including the application of a logistic function [17, 18] as a means of modelling the latent utility function from ordinal data. This uses two functions: one describes the probability of ranking one state over another given the utility of each state, and the second relates mean utility for each state to the severity levels for each dimension of the health state. Another important modification in this context is that in modelling a population level latent utility function from individual rank data, the error is characterised in terms of the deviation of the individuals’ preferences from population preferences. To use rank data, the assumption of independence from irrelevant alternatives (IIA) is required in order to explode the rank data into a series of pairwise choices, where ranking \( A \succ B \succ C \) implies pairwise choices of \( A \succ B \), \( B \succ C \) and \( A \succ C \) This assumes that the ordering of a pair of states does not depend on the other states being considered.

Recently, conditional logistic regression models were applied to the rank data collected as part of the UK valuation of the EQ-5D [9], SF-6D and HUI2 [10]. The rank model of health states alone does not produce utilities on the full health-dead scale necessary for use in generating QALYs, as it does not enable the anchoring of the values to 0 for dead. For this reason, the values generated by the logit model are transformed onto the full health-dead scale needed to generate QALYs. One method involves normalising the coefficients using the mean TTO value for the worst state defined by the classification system [9]. An alternative approach is to include the state ‘dead’ in the ranking exercise and normalise the regression coefficients so that ‘dead’ achieves a predicted value of zero [10].

DCE is a widely used tool in health economics for eliciting utility values, for example for different health care programmes, but has so far had limited use for eliciting health state values for preference-based measures of health used to derive QALYs. A limited number of studies have used DCE to value health states for their own sake [1113, 1921], but only one study has anchored their results onto the full health-dead scale required for generating QALYs. This study used a partial solution by normalising the DCE results using the estimated TTO value for the worst possible state [12]. The studies presented in this paper are the first attempt to undertake a normalisation of DCE results around dead without the use of cardinal values obtained from external sources. Here, we include the state ‘dead’ in the DCE and use this directly estimated parameter to rescale the regression coefficients. We compare the results to those obtained using the alternative approach of normalising using the estimated TTO value for worst state [12].

Methods

The health state classifications

Asthma-specific measure

The AQL-5D is a 5-dimension health state classification system [22] developed from the Asthma Quality of Life Questionnaire, AQLQ [23]. The dimensions of AQL-5D are: concern about asthma, shortness of breath, weather and pollution stimuli, sleep impact and activity limitations (Table 1). The health state classification system has 5 dimensions each with 5 levels of severity, with level 1 denoting no problems and level 5 indicating extreme problems. By selecting one level for each dimension, it is possible to define 3,125 health states.

Table 1 Asthma quality of life classification (AQL-5D)

Overactive bladder-specific measure

The OAB-5D is a 5-dimension health state classification [24] developed from the overactive bladder instrument, OABq [25]. The dimensions of the OAB-5D are: urge, urine loss, sleep, coping and concern (Table 2). The health state classification system has the same structure as the AQL-5D, also defining a total of 3,125 health states.

Table 2 Overactive bladder quality of life classification system (OAB-5D)

The surveys

Two surveys were conducted for the each classification system. These surveys were identical in design in every way, apart from using different health state classifications to define the health state descriptions. For each classification system, the surveys consisted of interviews containing ranking and TTO tasks and a follow-up postal survey using discrete choice tasks.

Interview

The interview surveys elicited values for a selection of states (AQL-5D/OAB-5D) from a representative sample of 300 members of the general public each. Adults who consented to participate were interviewed in their own home by an experienced interviewer trained by the authors of this paper. Respondents were asked to complete the health state classification questionnaire for themselves to help familiarise them with it. The first valuation task was to rank 7 intermediate states, full health (health state 11111), worst state defined by the health state classification (‘pits’ state 55555), and immediate death.

The next task was to value the 7 intermediate states and ‘pits’, with an upper anchor of full health using TTO. The survey used the TTO-prop method developed by the York Measurement and Valuation Health Group, which uses a ‘time board’ as a visual aid [26]. Respondents were then asked a series of socio-demographic questions. Finally, they were asked about their willingness to participate in a postal survey (described below).

The selection of health states for the interviews was determined by the specification of the model to be estimated. In this study, 98 health states and the worst state (to be repeated across the design) were selected out of the 3,125 possible health states described by the classification system. The selection was on the basis of a balanced design, which ensured that any dimension level (level λ of dimension δ) had an equal chance of being combined with all levels of the other dimensions. These 98 states were stratified into severity groups based on their total level score across the dimensions (simply the sum of the levels), and then randomly allocated into 14 blocks, so that each block has 7 health states. This procedure ensured that each respondent, who was allocated one of the 14 blocks, received a set of states balanced in terms of severity and that each state is valued the same number of times except the worst possible state, the ‘pits’ state, which is valued by all respondents. Each state is valued by 20 respondents on average and this is comparable with other valuation studies, for example SF-6D states were valued by 15 respondents on average [5].

Postal surveys

A DCE questionnaire was mailed to interviewees who had consented to the postal survey approximately 4 weeks after the interviews (the ‘warm’ sample). Size of the warm sample depended on how many interviewed respondents were willing to participate in the postal survey. The same questionnaire was mailed out to a separate sample of the general public who had not been interviewed (the ‘cold’ sample’). The number of questionnaires mailed out was determined by targeted sample size and expected response rate alongside funding constraints. Respondents were asked to complete the health state classification questionnaire for themselves to help familiarise them with it. Respondents were asked to indicate which state they preferred for an example pair of states and then for 8 pairs of states (see example question in the appendix). Finally, respondents were asked a series of socio-demographic questions. Reminders were sent to all non-responders approximately 4 weeks after the initial questionnaire was sent.

The large number of states defined by the classification systems of each measure mean it is infeasible to value all states. States were selected for the postal DCE using an application of a specially developed programme in the statistical package SAS [27], namely the D-efficiency approach. The programme obtains an optimal statistical design for DCE based on level balance, orthogonality, minimal overlap and utility balance. This reduces the number of pairwise comparisons to a manageable number. The programme produced 24 pairwise comparisons from the AQL-5D and OAB-5D, and these were randomly allocated to four versions of the questionnaire with 6 pairwise choices each. Two additional pairwise comparisons were included of two poor health states each compared to ‘immediate death’, and these were common across all versions of the questionnaire. No other states or pairwise comparisons were included in each version of the questionnaire. Only one pairwise comparison involves a logically consistent choice where one state has better health for every dimension.

Modelling health state values

Time trade-off

Time trade-off data was rescaled using the approach used in the UK EQ-5D value set [4] where worse than dead values are bounded at −1. The data from the TTO valuation exercise was analysed using a one-way error components random effects model which takes account of variation both within and between respondents [5]. Estimation is via generalised least squares (GLS). The standard model is defined as:

$$ U_{ij} = \beta({\mathbf{X}}_{\delta \varvec{\lambda} } \varvec{\beta} ) + \varepsilon_{ij} $$
(1)

where i = 1, 2…n represent individual health state values and j = 1, 2…m represents respondents. The dependent variable \( U_{ij} \) is the disvalue (1–TTO value) for health state i valued by respondent j and \( {\mathbf{X}}_{\partial \lambda } \) is a vector of dummy explanatory variables for each level λ of dimension of the health state classification where level λ = 1 is the baseline for each dimension. \( \varepsilon_{ij} \) is the error term which is subdivided \( \varepsilon_{ij} = u_{j} + e_{ij} \), where u j is the individual random effect and e ij is the usual random error term for the ith health state valuation of the jth individual. Details of other models run on the TTO data are available elsewhere for both AQL-5D [28] and OAB-5D [29]. The value of the full health state equals 1 and health state values for all other states are estimated as 1 minus the coefficient for each of the appropriate level dummies for each dimension. The ‘gap’ between full health and the next best state is interpreted as the movement away from full health from having a problem on the appropriate dimension.

Ranking

The rank-ordered logit model was used to analyse the ranking data (a modelling approach also referred to as the conditional logit model [30]). It states that respondent j has a latent utility function for state i, U ij and given the choice of two states i and k, the respondent will choose state j over state k if U ij  > U ik .

The expected value of each unobserved utility was assumed to be a linear function of the categorical levels on the dimensions of the health state classification. Following the approach taken elsewhere [9, 10], the general model specification for each individual j’s cardinal utility function for state i is \( U_{ij} = \mu_{j} + \varepsilon_{ij} \) where μ j is the systematic component that is representative of the preferences of the population and \( \varepsilon_{ij} \) represents the specific preferences of the individual.The general model specification for analysis of the ranking data is:

$$ U_{ij} = \beta({\mathbf{X}}_{\varvec{\delta} \varvec{\lambda} } \varvec{\beta} + \varphi D) + \varepsilon_{ij} $$
(2)

where U represents utility; j = 1, 2,…n represents respondents and i = 1, 2,….m represents health states. The functional form is assumed to be linear. The vector of dummies is as defined for Eq. 1, with the addition of a dummy variable for the state dead. For all health states other than dead D = 0. The rank-ordered logit model produces estimates on an interval scale, yet the origin and units of the interval scale are not on the full health-dead scale [9, 10]. In order to anchor onto the full health-dead scale, the coefficients relating to the levels of each dimension are normalised by dividing each level coefficient by the coefficient relating to dead; \( \beta_{r\lambda \partial } = {\raise0.7ex\hbox{${\beta_{\lambda \partial } }$} \!\mathord{\left/ {\vphantom {{\beta_{\lambda \partial } } \Upphi }}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{$\Upphi $}} \) where \( \beta_{r\lambda \partial } \) is the rescaled coefficient for level λ of dimension ∂, \( \beta_{\lambda \partial } \) is the coefficient for level λ of dimension and \( \Upphi \) is the coefficient for dead [9, 10].

Discrete choice experiment

Two alternative approaches are used to obtain estimates onto the full health-dead scale. Method (1) models the DCE data using an existing approach in the literature [12]. This approach analyses the DCE data using a random effects probit model which takes account of the repeated measurement aspect of the data (whereby multiple responses are obtained from the same individual). This model excludes the data from the pairwise comparisons involving ‘dead’. The value of the full health state is constrained to equal 1 and following the approach for all models estimated in this paper health state values for other states are estimated as 1 minus the coefficient for each of the appropriate level dummies for each dimension. The estimated coefficients are normalised onto the full health-dead scale using the estimated TTO value of the worst state. This means that the value of the worst state in the DCE model is anchored at the value of the worst state in the TTO model. Method (2) analyses all data from the DCE surveys including the pairwise comparisons involving dead using a random effects probit model. Again, an additive specification is used as specified by Eq. 2 where dead is included in the model. The coefficients are normalised in the same way as the rank data by dividing each level coefficient by the coefficient relating to dead. Models are also estimated separately for the ‘warm’ sample that was previously interviewed and the ‘cold’ sample that were not.

Comparison of models

All models are compared. There is no reason why rank or DCE models should produce the same results as the TTO model, although it could be thought that rank and DCE may produce similar results as the use of the rank-ordered logit model means that the rank data is viewed as a series of pairwise comparisons.

Models can be compared in terms of the sign and ordering of their coefficients. The sign of the coefficients on the levels of each dimension are expected to be negative since they are all worse than the baseline (i.e. level 1). Furthermore, the levels in each dimension have a logical ordering, whereby more severe levels should have larger decrements. The number of inconsistencies between significant coefficients is compared between the models. For interest, we examine the relationship between model predictions and observed TTO values including the mean absolute difference, the root mean square of the difference, the proportions of differences greater than 0.05 and 0.1 and Pearson correlation coefficients. Finally, the pattern of the predictions is compared.

Results

Respondents

Three hundred and seven members of the public (response rate of 40%) in South Yorkshire (UK) were interviewed in the AQL-5D survey and 311 people interviewed in the OAB-5D survey (response rate of 26.7%). Table 3 shows that the two samples were very similar in terms of their socio-demographic composition. Amongst the respondents to the AQL-5D survey, 53 (17.3%) had asthma and in the OAB-5D survey 27 (8.7%) reported experiencing symptoms of urge and 18 (5.8%) reported urine loss for at least some of the time. Overall self-reported health status using EQ-5D [4] was very close to the UK EQ-5D norms of 0.85 for females and 0.86 for males [31]. Two hundred and sixty three people responded to the AQL-5D postal survey and 402 people responded to the OAB-5D postal survey. Table 3 shows that the socio-demographic composition of the postal samples are similar to the interview samples, but the OAB-5D postal survey has a larger proportion of respondents over 65 years of age and a higher proportion of females. Overall, the AQL-5D samples have lower mean EQ-5D scores.

Table 3 Characteristics of respondents in valuation surveys

The data set

AQL-5D

There were 2,455 TTO health state valuations generated by the 307 respondents from the interviews and 3,041 rankings from the respondents at their interview. The average number of TTO valuations per intermediate health state was 22 (range from 19 to 22) and the worst possible state (AQL-5D state 55555) was valued by every respondent (n = 307). Mean TTO health state values ranged from 0.39 to 0.94 and generally have fairly large standard deviations (around 0.2–0.4). The distribution of the values was negatively skewed.

There were 168 DCE questionnaires returned out of the 308 who had been interviewed (55%) generating 1,336 observed pairwise comparisons. In total, 95 DCE questionnaires were returned in the cold survey (a 23% return rate) generating 741 pairwise comparisons.

OABq

There were 2,487 health state values generated by the 311 respondents and 3,040 rankings. Each intermediate health state was valued 22 times using TTO (range from 17 to 29) and the worst possible state (OAB-5D 55555) was valued 310 times using TTO (one missing value). Mean TTO health state values ranged from 0.56 for the worst possible state, to 0.91 for state 13,321, with an average standard deviation of 0.28.

The warm survey had 133 returned DCE questionnaires (response rate 44%) generating 1,050 pairwise comparisons. The cold survey resulted in 268 being returned (response rate 27%) generating 2,059 comparisons.

Modelling

AQL-5D

The TTO model and transformed rank and DCE models are presented in Table 4. The TTO model produced the expected negative coefficients for all statistically significant coefficients and the ordering of coefficients was consistent with the dimension levels of the AQL-5D. Three coefficients were positive but statistically insignificant. The rank model produced all negative coefficients and no inconsistencies for all significant coefficients. In comparison to the TTO and rank models, the DCE models have a higher number of positive coefficients and inconsistencies. The DCE results using method (1) that normalises coefficients using the estimated TTO value for the worst state has four positive coefficients, one of which is statistically significant, and one inconsistency between significant coefficients. The DCE results using method (2) for the pooled data (i.e. warm plus cold) produced three positive coefficients, one of which is statistically significant, and one inconsistency between significant coefficients. The warm DCE model produced five positive coefficients, none of which were statistically significant, and one inconsistency amongst statistically significant coefficients. The cold model had one positive coefficient that was not statistically significant and no inconsistencies between significant coefficients. The weather dimension seemed to cause most difficulty for the DCE models, with a suggestion that the levels of this dimension do not conform to the suggested ordering.

Table 4 TTO and normalised rank and DCE model estimatesa for AQL-5D

The size of the dimension level coefficients of the rank and TTO models are quite similar and follow an orderly pattern against the levels of the AQL-5D. The DCE model for the pooled data set reveals some marked differences. The most noticeable differences lie at the lower end of concern, short of breath, pollution and the upper ends of sleep and activity. Level 2 for the dimensions of concern, breath and pollution are all positive and in the wrong direction, quite markedly so for pollution. Sleep and activity have coefficients with the right sign, but they are much larger for levels 4 and 5.

The similarity of the TTO, rank and DCE models can be seen in the plot of predicted health state values against observed mean TTO values in Fig. 1. Mean absolute differences from observed TTO are 0.056 and 0.061 for the TTO and rank models, respectively, with mean differences of around zero. By contrast, the DCE predictions follow different paths depending on the normalisation method used. The DCE model that normalises coefficients using method (1) tended to have health state predicted values that were higher than observed TTO whereas the DCE model that rescaled coefficients using method (2) tended to have health state values lower than observed TTO values. The results from the DCE model using method (1) where coefficients are normalised using the estimated TTO value for worst state are more similar to the TTO model estimates, as expected due to the method of normalisation. Pearson correlation coefficients between observed TTO values and predicted values for each model are consistently high and similar across models. Differences are observed between the mean values for the worst AQL-5D health state of 0.390 for observed TTO and predictions of 0.431 for TTO and DCE method (1), 0.434 for rank data and 0.154 for predictions from pooled DCE data normalised using method (2).

Fig. 1
figure 1

Predictions of TTO, Rank and DCE models for AQL-5D in comparison to observed mean TTO

OAB-5D

The OAB-5D results are presented in Table 5. Overall, the models were broadly consistent with the ordinality of the OAB-5D. All the coefficients in the TTO model were negative and most significant. There were inconsistencies between significant coefficients in 3 cases, but their magnitudes were 0.02 or less. The ranking data produced negative coefficients and all but one were statistically significant with no inconsistencies between significant coefficients. The DCE results for method (1) had no inconsistencies between significant coefficients and has four positive coefficients, none of which were significant. All DCE models normalised using method (2) have five positive coefficients, one of which is statistically significant (coping level 2) and two inconsistencies amongst the significant coefficients.

Table 5 TTO and normalised rank and DCE model estimates for OAB-5D

The OAB-5D TTO model does not predict observed TTO as well as for the AQL-5D as indicated by mean absolute deviation (MAD) and mean error in Tables 4 and 5. Ranking predictions also do not agree with TTO as closely as for the AQL-5D survey and tended to have predicted health state values that are higher than observed TTO values. As for the AQL-5D survey, the DCE predictions normalised using method (2) have a larger scale range (0.249–1.00 compared to 0.623–1.0 for TTO and DCE method (1) and 0.436–1.0 for ranking). Again, the DCE models have different results depending on the method of normalisation. Again, the model estimated using method (2) to rescale coefficients tended to have predicted health state values lower than observed TTO, whereas the model estimated using method (1) tended to have predicted health state values higher than observed TTO, as shown in Fig. 2. Pearson correlation coefficients between observed TTO values and predicted values for each model are high and similar across models, but are all lower than the equivalent values for the AQL-5D survey.

Fig. 2
figure 2

Predictions of TTO, Rank and DCE models for OAB-5D in comparison to observed mean TTO

Discussion

This study has compared health state utility values on the full health-dead scale required to generate QALYs that have been derived using DCE, rank and TTO data. As would be expected, the TTO model best predicted TTO observed values, but then there is no reason to expect rank and DCE data to produce the same values. Perhaps more surprising is the way the rank model coefficients were actually very similar to TTO coefficients in the AQL-5D survey, but less so in the OAB-5D survey. In both surveys, the DCE model was the most different from the other methods, and the model normalising coefficients using the dead coefficient (method (2)) produced a larger range of values.

In modelling, rank data are essentially treated as a series of pairwise comparisons, and aside from the IIA assumptions, are otherwise the same. It is therefore interesting to find that they do not produce the same values. This may suggest that the rank and DCE tasks generate different data, which would have implications for the IIA assumption used in rank data. However, this may be simply due to differences in the study design and number and composition of health states valued. It may also reflect the fact that the ranking task preceded the TTO in the same interview, whereas the DCE data were collected via a postal survey on a different sample (although the warm DCE sample is composed of willing respondents from the interview).

The DCE models based on the warm and cold samples seem to have similar coefficients and so were pooled to focus on the main comparisons with TTO and rank results and the existing approach (method (1)) used to anchor values onto the full health to dead scale [12]. Yet, the pooled data should be treated with some caution as further analysis did find some difference between the samples. However, the sample sizes are small for the ‘cold’ and ‘warm’ samples, particularly for the cold AQL-5D sample. These results suggest the cold sample gave slightly lower values than the sample that had previously been interviewed, though this difference is not sufficiently large to alter the main findings comparing the different valuation methods. The similarity of warm and cold results suggests that it may be possible to obtain DCE data to value health states without prior interview. This would be considerably cheaper, but postal surveys are usually associated with lower response rates and this was true for the AQL-5D survey. For researchers seeking to use DCE without other methods, it may still be preferable to approach respondents directly in their own home to ensure a more representative sample.

The pooled DCE models using different methods to rescale onto the full health-dead scale produce noticeably different coefficients and different ranges of predicted values. As expected the model normalising coefficients using the estimated TTO value of worst state (method (1)) is more similar to observed TTO values and the TTO model. Overall, the results suggest that DCE and TTO produce different results, and the use of TTO data to rescale DCE coefficients rather than using data collected using a DCE alone produces different results. This should be recognised in the future design of DCE surveys to obtain health state values.

The method used here to rescale worse-than-dead TTO values has been raised as a concern in the literature (see for example [32]) as negative values are bounded at −1 and may be interpreted as being measured on a different scale to TTO values that are better-than-dead. This may be considered as a limitation to the TTO model and to the model normalising DCE coefficients using the estimated TTO value of worst state (method (1)). However, this concern is likely to be of lower importance for these measures where only a small proportion of TTO responses are worse-than-dead, as 4% of TTO observations are worse-than-dead for the AQL-5D and only 2% of TTO observations are worse-than-dead for the OAB-5D.

The DCEs were feasibility studies added to valuation studies designed to provide TTO valuations of the AQL-5D and OAB-5D that are recommended by agencies such as NICE [2]. Using a postal method for DCE, for example, may have compromised the quality of the data and it certainly resulted in a lower response rate. The state selection used here did not ensure that implausible states were not chosen, but selected health states were checked to ensure they were plausible as implausible states may lead to an increase in the random variability in responses. Perhaps more importantly, the recommended approach for state selection and design for DCE experiments continually evolves [33], and our study may have benefited from recent improvements in DCE design.

There are concerns with the types of models estimated here since they make restrictive distributional assumptions about the coefficients. Of particular concern is that some orderings are logically determined. For example, suppose there is a health state pair: j and k, and μ j μ k  = X, say 0.2, on the latent variable scale standardised to 1 for full health and 0 for dead. The current approach to modelling ordinal data assumes that any two states that are apart from each other by X will have the same proportion of respondent’s incorrectly ranking j over k. However, it is reasonable to assume that the probability of error will not only be a function of how apart the two states are, but also whether or not the two states have a logically determined ordering. Suppose there are two sets of health state pairs that are apart by X, where pair 1 has no logically determined ordering (e.g. 11122 and 33111) whereas pair 2 has a logically determined ordering (e.g. 11122 and 11133). It is reasonable to expect that the proportion of responses that rank j over k will be different across pair 1 and pair 2. This becomes particularly problematic when one of the states is full health or the worst state. This means that the structure of the error term in Eq. 2 needs to be more sophisticated than it currently is. There is also a concern that the estimated parameters in the DCE model are confounded with an unknown scaling factor which is inversely related to the variance of the error term [34, 35]. However, this should not lead to biased coefficients. There are now more advanced econometric modelling techniques known as mixed logit models [36] that should be explored in future research. This would also overcome the IIA assumption underlying the way rank data are being analysed.

This paper presents a new way of anchoring health state values derived from discrete choice data on the full health-dead scale required for QALY estimation. Dead is included as a state in the pairwise choices and subsequently used to anchor the values generated by the logistic models. Another way to achieve this anchoring would be to include survival as a separate attribute [37]. However, this requires a far larger and more complex design, since survival has a multiplicative relationship to health related quality of life in the QALY model. This has been achieved through an online panel sample for EQ-5D [37]. One disadvantage with including dead as a state and using this to normalise coefficients arises from the fact that many respondents may not regard any state defined by the classification as worse than being dead and so effectively not be willing to trade [38]. This is likely to be more of a problem for milder descriptive systems. For these studies, a sufficient proportion of respondents were willing to make a trade, so that at the aggregate level it has been possible to estimate a societal value for the state of being dead compared to the health states defined by the health state classification.

This study proposes a new method for generating health state utility values on the QALY scale using discrete choice data. The results of feasibility studies on two condition-specific measures suggest that the relationship between health state utility values derived using TTO, rank and DCE data differs across different health state classification systems and potentially different medical conditions. Whilst ordinal methods may offer a promising alternative to conventional cardinal methods of SG and TTO, there is a large and important research agenda to address.