Introduction

Quality-adjusted life-years (QALYs) incorporate information on both morbidity and mortality, and are the preferred outcome in economic evaluations for both the National Institute for Clinical Excellence (NICE) in the UK and for the Canadian Agency for Drugs and Technology in Health [1, 2]. Health utilities are the quality weights used to calculate QALYs, and are typically measured using standardised questionnaires, of which the EQ-5D is very widely used [3]. The EQ-5D-3L questionnaire contains questions on five dimensions, namely mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. For each of these dimensions, the respondent is asked to indicate their level of difficulty, with response options being “no problems”, “some problems”, or “extreme problems”. EQ-5D-3L responses must be converted to health utilities via a scoring algorithm. The scoring algorithm is developed in a valuation study in which respondents from the general population provide preferences for a subset of health states, and regression models are used to describe the mean utilities as a function of the health states.

The first scoring algorithm was developed in the UK by Dolan [4] in the Measurement and Valuation of Health (MVH) study. Since then, different scoring algorithms have been developed for different countries. Many of these studies have reported significant differences between their own algorithm and the original UK algorithm [57]. They differ not just in the values of regression coefficients, but also in the independent variables included in the regression model (hereafter referred to as model specification). Given that health preferences may depend on respondent age, gender, income and self-reported health status [8], and that these population characteristics vary by country, it is not surprising that regression coefficients vary by country. It is less clear that this would also lead to variation in the model specification. The MVH study used an N3 term [4], whilst the USA study used a D1 model [9], and the South Korean algorithm used a main effects model with a log transformation [10]. It is not known how much of this variation in model specification is due to genuine differences in health preferences amongst countries, as opposed to differences in health state selection, the relative frequencies with which states were valued, or the model diagnostics used to select the preferred model.

Whilst some studies used the MVH protocol [4], many use variants. Notably, the subset of states included and the relative frequencies with which states were valued vary among studies. For example, studies in the UK, Spain, USA, South Korea and Chile all used 42 health states [4, 5, 911], whilst studies in Japan and the Netherlands used a modified protocol [12] assessing 17 health states. Simulation evidence from Lamers [13] was used to justify a reduction in the total number of health states valued in the Dutch valuation study; the simulation was based on MVH data and thus assumed that the model specification in the Netherlands would be similar to that in the UK. A further simulation study by Chuang and Kind [14] based on a main effects model suggests that when there are restrictions on study size, it is preferable to include fewer health states, up to a minimum of 31. A recent simulation study by Viney [15] suggests that it is reasonable to expect that studies that incorporate more health states are more likely to detect interactions, and this was used to justify the use of 198 health states in the Australian valuation study. There are thus conflicting recommendations in the literature on the number of health states that should be valued.

Furthermore, the relative frequencies with which states were valued vary; in any given valuation study it is common to find some states which are valued more often than others. For example, the state 33333 was valued roughly 4 times as often as the state 12111 in Denmark [16], compared to 10 times as often in Canada [17], 3 times as often in the USA [9], twice as often in Poland [18] and roughly the same number of times in Japan [7]. Further, studies have differed in their model diagnostics. For example, some use the mean absolute error (MAE), whilst others use an R 2 or an adjusted R 2. Recent comparisons between countries have not been able to disentangle genuine differences in cultural preferences from differences in methodology [19].

The aim of this study was to assess the extent to which health state selection, frequency of health states valued, and model diagnostics have contributed to between-country heterogeneity in model specification among EQ-5D-3L scoring algorithms.

Methods

Systematic review

This study is based on a recent systematic review by Xie et al. [20]. The inclusion criteria for the original review were that studies should have (1) used elicitation techniques to obtain preferences for at least a subset of the EQ-5D-3L health states, and (2) explicitly indicated the preferred scoring algorithm to predict utilities for all EQ-5D-3L health states. For the present review, we included only those studies that used time trade-off (TTO) elicitation techniques and that reported mean observed utilities for each state that was valued. For each state included in each valuation study, data were extracted on the mean utility assigned to that state, the corresponding standard deviation, and the number of respondents valuing that state (see Supplementary Table 1). The final scoring algorithm was recorded.

Statistical analysis

Each of the final country-specific model specifications was re-fitted, both for the country on which it was originally derived and for all other countries. We had access to aggregate data only, that is, for each country we had only the mean observed utility for each valued health state. This is, however, sufficient to obtain unbiased estimates of regression coefficients using ordinary least squares (OLS). We regressed observed mean utilities for the health states onto characteristics of the health states. We began by restricting attention to those studies that included the set of 17 health states indicated by the modified MVH protocol [12]. Models were fitted using data from just these 17 health states. Thus, in this analysis all included countries used the same set of health states and each state received equal weight (using OLS on the aggregate data achieves equal weighting for each health state). The mean absolute error (MAE, i.e. the mean across-health states of the absolute difference between observed and predicted mean utilities), mean squared error (MSE) and rho (i.e. the correlation between observed and predicted utilities) for the country-specific model were compared to those for the other countries’ model specifications. This was done both with and without cross-validation. To implement cross-validation, each of the 17 health states was omitted from the model in turn, and the fit from the remaining 16 health states was used to calculate the diagnostics for the omitted health state. Thus, the diagnostics represent out-of-model prediction errors, which we shall refer to as “leave-a-state-out cross-validation”.

This analysis was repeated using all included countries and all valued health states, again using each reported model specification for each country. This allowed us to assess whether preference for the country’s own model specification changes when states beyond the set of 17 are included. Finally, to assess whether preference for the country’s own model changes when some states are represented more often than others, the analysis using all health states was repeated using weighted least squares (WLS) rather than OLS, with weights proportional to the number of times each state was valued. This reflects the weighting that would be given to each state in the original respondent-level analysis. Indeed, WLS on the aggregate data with weights equal to the number of respondents valuing each state is identical to OLS on the individual-level data.

We also computed adjusted R 2 statistics for models fitted with WLS, using both the restricted set of 17 health states and all valued health states. Since the conceptual basis for the adjusted R 2 assumes that the model has an intercept, this was only done for those models that included an intercept.

Finally, we checked all fitted models for logical consistency. We identified all pairs of health states in which one state was dominant, and compared the model’s predicted mean utility for the dominant health state to that for the dominated health state. A model was denoted as yielding a logically inconsistent value set if there was at least one health state pair with a dominant health state whose utility was lower than the utility for the dominated health state.

Results

This review contains data from 13 countries [4, 5, 7, 9, 11, 13, 1618, 2124], see Table 1. Data from Kind [25] was excluded because it was a subgroup analysis (England and Wales vs Scotland) using the data from Dolan [4]. Similarly, results from Shaw et al. [26] were excluded as this used the same data as in the original study [9], but using median regression in the place of mean regression, and results from Zarate [27] were excluded as this split the Shaw [9] results into subgroups (Hispanics vs others). Since the Taiwanese study of Chang [28] used a minimal set of 13 health states and was designed to fit the main effects model only, this was also omitted from the analysis. The South Korean study [10] had to be excluded as this used a log transformation on the individual level utilities, which we were not able to replicate using aggregate data.

Table 1 Studies included in the analysis, with their own functional forms

The 13 countries used a total of seven model specifications (Table 1). Five countries used main effects models [7, 1618, 24], and five countries included an N3 term, with Germany using a reduced N3 model that omitted some of the main effects, and France omitting the intercept. Three countries used models that included interaction terms other than the N3 term [9, 11, 21].

Model preference using cross-validation

The choice of preferred model was similar between the MAE, MSE and rho, with model preference being the same across diagnostics in nine of the 13 countries (see Table 2). The MAE and MSE had identical preferred models in all but four cases, exceptions being Canada and the USA when using all health states and fitting with OLS, and Poland and the USA when using all health states and fitting with WLS. Notable differences in preference between rho and the MAE or MSE diagnostics were a shift from preference for the German model in the Netherlands when using the MAE or MSE to the French model when using rho, and a shift in the USA away from the German and Chilean models when using the MAE or MSE to the Argentinean model when using rho. Given the similarities in model preferences between diagnostics, in what follows we use the MAE in assessing the impact of using differing health states and weightings on model preference.

Table 2 Preferred model using mean absolute error (MAE) and leave-a-state-out cross-validation, mean squared error (MSE) and leave-a-state-out cross-validation, and adjusted R 2 with no cross-validation

There were ten studies that reported mean utilities for the set of 17 health states [4, 5, 7, 9, 13, 17, 18, 2123]. Table 2 gives the preferred models, whilst Table 3 gives the MAE for each model considered. On re-fitting both the country’s own model specification and the other specifications using the 17 states, the country’s own model was preferred in four countries (Chile, France, Japan and Poland). Five countries (Argentina, Canada, the Netherlands, the UK and the USA) showed a preference for the German reduced N3 model, whilst Spain preferred the Chilean model. Thus, among the ten countries including the set of 17 health states, heterogeneity in model specification was reduced from six specifications to four when a common set of health states was used.

Table 3 Mean absolute errors calculated by computing, for each health state, the absolute difference between the observed mean and the model prediction for each functional form, then taking the mean across health states

When all valued health states were used, we were able to use a further three studies [16, 23, 24]. On re-fitting the country’s own model and the other models, there was a preference for the country’s own model in four of the 13 countries (Chile, Germany, Japan, and Poland), see Tables 2 and 3. The Chilean model was preferred in three countries besides Chile (Denmark, the UK and the USA), the Argentinian model was preferred in three countries (France, Spain and Zimbabwe), the N3 model was preferred in one country (Canada), and the German reduced N3 model was preferred in two countries besides Germany (Argentina and the Netherlands). There were five preferred model specifications across the 13 studies.

When all health states were used and weighted least squares was used to fit the models, we had to exclude four studies [5, 16, 21, 24] as they did not indicate the number of valuations per respondent (this was the weight used in the weighted least squares models). Three of the nine countries favoured their own models (Chile, Germany and Japan). The Netherlands continued to favour the German reduced N3 specification, whilst France favoured the Argentinian specification. There was a very marginal preference for the D1 model in Poland (see Tables 2, 3). Canada, the UK and the USA favoured the Chilean model. The preferred model specification was similar to that using all health states and ordinary least squares, indicating that the number of valuations per health state had little impact on heterogeneity in model specification.

Compared to the countries’ own models, there were substantial improvements in fit on using an alternative specification in several countries. For example, in Argentina the proportion of health states with predictions differing from observed values by more than 0.05 decreased from 77 % on using the Argentinian specification to 68 % when using the German specification (using all health states and OLS), see Table 4. For the Netherlands, 76 % of health states had predicted values differing from the observed by more than 0.05 on using the N3 specification that was used for the Dutch value set, compared to 65 % on using the German specification.

Table 4 Percentages of health states with out-of-sample predictions differing from observed means by more than 0.1 and 0.05

Model preference without cross-validation

In the absence of any cross-validation, there was an overwhelming preference for the Argentinian, Chilean and N3 models. When there was no penalty for the number of parameters in the model, the Argentinian model was preferred regardless of whether the MAE, MSE or rho was used, regardless of whether all health states or just 17 health states were used and regardless of whether OLS or WLS were used; the only exception to this was that the Chilean model was preferred in Chile when the MAE was used on all health states using OLS.

When the number of parameters in the model was accounted for using the adjusted R 2, countries that did not report the number of valuations per health state had to be excluded [5, 16, 21, 24] and, moreover, models that omitted an intercept (France, D1, Argentina) could not be considered. Here the preference was for Chilean and N3 models, with the N3 model more often preferred when using just 17 health states.

Logical consistency

As can be seen from Table 3, all model specifications led to logically inconsistent results for at least one country. The Argentinian model yielded logically inconsistent value sets in the majority of cases, exceptions being Chile and Poland when all health states were used. The Chilean model also yielded logically inconsistent value sets in a number of cases, particularly when just 17 health states were used to fit the model. It was common that one of the Argentinian or Chilean specifications was preferred, using the MAE, MSE or rho, but also yielded a logically inconsistent value set. As can be seen from Table 2, restricting attention to those specifications that yield logically consistent value sets does not change the finding that model diagnostic (MAE, MSE, rho) and weighting have little impact on model choice. This also held when model diagnostics were calculated without cross-validation.

Discussion

This analysis has investigated the impact of health state selection, frequency of health states valued, and model diagnostics on heterogeneity in model specification amongst the existing EQ-5D-3L algorithms. In terms of model diagnostics, there was little difference in model preference between the MAE and MSE. However, use of the adjusted R 2 altered model preference. When cross-validation was adopted, the preferred model specification changed when moving from a common set of 17 health states to using all valued health states. Thus, health state selection has an impact on the preferred model specification. The relative frequencies with which states were valued contributed little to the heterogeneity in model specification.

Use of leave-a-state-out cross-validation resulted in heterogeneity in model specification, whereas omitting cross-validation led to homogeneity in specification. The reduction in heterogeneity on omitting cross-validation should not be taken as an endorsement of the practice, however. There is a strong conceptual argument in favour of cross-validation: whilst the aim of a valuation study is to estimate utilities for each of the 243 health states described by the EQ-5D-3L, typically valuation studies have included at most 42 health states (fewer than 20 % of the total number of health states). Thus, the model is used predominantly to make out-of-state predictions, and so the predictive accuracy of these predictions should be assessed. Furthermore, in the absence of cross-validation, one should expect a model with more parameters to yield better MAEs, MSEs and rhos, even if the additional parameters do not reflect any genuine patterns in the data. This is a plausible explanation for the preference for the Argentinian model (consisting of 17 parameters) over all other models when using MSEs, MAEs and rhos and omitting cross-validation. Use of the adjusted R 2, which does include a penalty for the number of parameters, resulted in preferences for either the N3 or Chilean models. Part of the relative homogeneity on using the adjusted R 2 may have been due to the fact that the adjusted R 2 could not be computed for four of the seven models.

When just 17 health states were used, the preferred model was most often the German reduced N3 or main effects model, whilst when more health states were used, preference often switched to the more complex Argentinian or Chilean models. This is to be expected theoretically, and was demonstrated empirically by Viney [15]; the basic intuition behind the result is that one is better able to detect interaction terms when more health states are represented. The reason for the introduction of the 17-state protocol was due to perceived redundancies in the original 42 states, coupled with a desire to simplify the process of a valuation study [12]. This decision was, however, based on UK data. Subsequent simulation evidence from Lamers et al. [13] also used UK data when selecting health states, and in particular calculated mean absolute errors averaged over a mixture of within-sample and out-of-sample health states, considering only the N3 model.

An important qualification on preference for the Argentinian and Chilean specifications is that in many cases they yielded logically inconsistent value sets despite having the lowest MAE or MSE. Indeed, the published value set for Argentina is itself logically inconsistent for a number of pairs of health states (see, for example the pairs (33111, 33112) and (33321, 33322) in [21]). If a logically consistent value set is desired, we suggest that logical consistency be checked when using these specifications, especially when used in a valuation study that includes a limited number of health states.

The main limitation of this analysis was the use of aggregated data rather than respondent-level data. This meant that we were not able to test differences in regression diagnostics between algorithms for statistical significance within countries, as the aggregated data meant that it was not possible to account for the correlation that arises from each respondent valuing multiple health states. Furthermore, the use of aggregated data meant that it was not possible to consider fitting models using random effects. While we were able to consider cross-validation omitting health states, we were not able to consider cross-validation omitting respondents. For large valuation studies, however, it is more important to consider omitting states than it is to omit respondents. Cross-validation omitting respondents is typically implemented in such a way as to estimate the prediction errors at the individual level; cross-validation omitting health states, as done here, estimates the error in mean utility at the population level.

The use of aggregate data is not as severe a limitation as one might think. It is straightforward to show that OLS and WLS models provide unbiased estimates of regression coefficients even in the presence of within-subject correlation. Moreover, WLS on the aggregate data with weights equal to the number of respondents valuing each state is identical to OLS on the individual-level data. We used both OLS and WLS in our analysis as doing so enabled us to isolate the effects of health state selection and frequency of valuation on heterogeneity in model specification. For example, since OLS on the aggregate data weights all health states equally, comparing OLS using all health states to WLS using all health states allowed us to examine the effect of varying frequency of health state valuation.

A further limitation was the exclusion of several studies such as the Australian valuation study [15] due to unavailability of key data (for example, mean and standard error of the TTO utilities attached to each valued health state). Whilst this limits the number of studies available for analysis, it is unlikely to have introduced bias. Finally, there are other differences in protocol, for example the population sampled, the strictness of the exclusion criteria and the method of transformation for states considered worse than dead. Only Shaw [9] used a transformation other than the original Dolan transformation for states valued as worse than dead, and so it is unlikely that the observed heterogeneity in model specification is due to differences in transformation. Most studies sampled the general population, exceptions being the Argentinian study, which used patients and family members, and the Polish study, which used visitors to inpatients. The exclusion criteria varied considerably. These and other differences may be responsible for some of the observed heterogeneity in model specification.

We found that the choice of health states to value is responsible for some, but not all, of the observed heterogeneity in model specification. Specifically, our results showed that even when the same states and weightings are used in each country, there is heterogeneity in model specification. This finding underscores the importance of a common valuation protocol for EQ-5D valuation studies as a means of reducing model heterogeneity amongst countries, but also suggests that heterogeneity in specification should still be expected despite the common protocol.

The finding that the country’s own specification was often out-performed by alternative specifications has potentially important implications. There are at least two possible explanations for this finding. Firstly, some alternative specifications may not have been considered. This is not critical as no study claims to have found the best specification. Secondly, although several studies used cross-validation to assess model fit, in most cases this was done by omitting respondents. Only Dolan [4] used leave-a-state-out cross-validation. This is striking given that the ability to predict mean utilities for health states not included in the analysis is much more critical than the ability to predict utilities for individuals who are not included in the analysis. Furthermore, in cases where the country’s own specification was out-performed by an alternative specification, the differences in predictions from the two models were large enough to be important for a significant proportion of the health states.

In conclusion, this analysis underlines the importance of health state selection when designing a valuation study, and suggests that cross-validation through the omission of states, rather than respondents, should be considered when assessing model fit.