Introduction

Economic evaluation studies have increasingly employed multi-attribute utility (MAU) instruments for measuring health state utilities. The instruments consist of a descriptive system (or classification)—a set of health-related questions and response categories—and a scoring system or algorithm, which converts responses into a utility score which measures the preference for the health state on a scale where 0.00 represents death and 1.00 represents the best health state described by the instrument (for descriptions, see Brazier et al. [1]).

A review of published articles listed on the Web of Science between 2005 and 2010 [2] identified 392 cases where a head to head comparison had been made between the results from different instruments. The outcome of these comparisons indicated a relatively low level of agreement between utilities predicted from different instruments. Authors generally concluded that instruments are ‘not equivalent’ [3] and are ‘imprecisely related’ [4]. These conclusions are unsurprising as the evaluation methods used to quantify utility, the population groups from which values were obtained and the descriptive systems of the instruments all vary significantly. In using what Dolan et al. [5] describe as a ‘deliberately simplified descriptive system’ (p. 1036) utility instruments have omitted dimensions nominated by individuals as most important to them [6]. Of the three most widely used instruments—EQ-5D, SF-6D and HUI 3—only SF-6D has a dimension relating to social function and each of the three instruments has only one item to describe the mental health dimension which Helliwell et al. [7] note ‘is the single most important determinant of individual happiness (in every case where this has been studied)’ (p. 4). While health-related quality of life is a wider concept than mental health and happiness, their limited description in these instruments suggests the need for a broader descriptive system when these dimensions are important.

The AQoL-8D was constructed in response to this challenge. Its objective was to achieve a high level of sensitivity to psycho-social health states by supplementing the descriptive system of an earlier instrument, the AQoL-6D [8] and to remodel the resulting instrument utilities. Revision of the descriptive system is described in Richardson et al. [9]. A review of ten instruments measuring mental health-related QoL identified 18 relevant health domains (sub-dimensions). Initially, 250 items describing these domains were obtained from the instruments, from four focus groups with mental health patients and (for a small number of domains) from the research team which included, inter alia, two psychiatrists, a psychologist and a personal counsellor. Qualitative and logical analyses reduced these to an item bank of 133. Interviews were conducted which administered the items to a representative sample of 197 members of the public and to 514 mental health patients. As recommended by McDonald [10], a combination of explanatory and confirmatory factor analyses was used to construct the final instrument which is shown in Fig. 1 along with the goodness-of-fit statistics. These indicate that the descriptive system has good psychometric properties and that each of the three physical and the five psycho-social dimensions derived from the psychometric analyses is associated with an overarching physical or psychological ‘super’ dimension.

Fig. 1
figure 1

AQoL-8D structure. Numbers on joined arrows indicate correlation. Numbers on unjoined arrows are the residual (unexplained) variances of the latent variables. Fit statistics CFI = 0.974, TLI = 0.972, RMSEA = 0.073, WRMR = 1.64. Coefficients were estimated from the combined public patient database (n = 711). Source: Richardson et al. [9]

The purpose of the present article is to describe the modelling of the AQoL-8D utilities. This presented a unique challenge because of the size and complexity of the instrument. To date, utility has been modelled using one of two broad approaches. First, decision analytic (DA) procedures have been used by the 15D [11], the three Health Utility Index (HUI) instruments [1214] and the AQoL-4D [15]. DA theory, however, places strict orthogonality requirements upon items to avoid ‘redundancy’: the double counting of content which is common to correlated items [16]. This limits the extent to which sensitivity may be increased through the use of multiple, overlapping (but separate) items and dimensions. Secondly, econometric techniques have been used to calibrate the EQ-5D [17] and the SF-6D [18]. The approach avoids the need for strict orthogonality of items but is also subject to a constraint imposed by multi-collinearity as the number of correlated items and dimensions increase.

These constraints highlight a dilemma encountered in MAU modelling in general and a dilemma which the AQoL-8D sought to overcome. Content validity is usually defined by having items (or tasks in a test) which are ‘representative of the specified universe’ [19, p. 451], and content validity is increased when a measure includes a ‘more representative sample of the target behaviour’ [20, p. 175]. This commonly implies the need for 3–4 items to achieve a high level of sensitivity within a dimension. However, items describing a single dimension necessarily correlate, violating the strict orthogonality requirement of decision analysis, and with the inclusion of multiple dimensions for psycho-social health, the total number of items is too large for simple econometric modelling of health state utilities.

The approach adopted by the AQoL-8D to overcome this dilemma is new and is outlined in the section below which also describes the survey used to calibrate the model. Survey results and their use to model AQoL-8D utilities are outlined in the ‘Results’ section and discussed in the ‘Discussion’ section.

Modelling and data

Modelling

Scores for each of the 8 dimensions of the AQoL-8D were separately modelled and subsequently combined to obtain AQoL-8D utilities. Both the modelling of the dimensions and their combination were carried out in two stages which therefore resulted in a four-stage procedure. Stage 1 used the multiplicative model recommended by DA theory to combine the 35 items into 8 dimension scores. However, each dimension contained multiple, correlated, items which, as discussed above, would result in the double counting of the content which is common to the correlated items. Consequently, in Stage 2, an ‘econometric correction’ was carried out as described below, to eliminate the effects of double counting. Stages 3 and 4 replicated the methods of Stages 1 and 2 to combine dimension scores into a single utility. Stage 3 therefore consisted of the multiplicative combination of Stage 2 dimensions, and Stage 4 was a final econometric correction. Scores obtained for the dimensions are referred to here as ‘values’ to indicate that results for different dimensions are not comparable and to distinguish them from ‘utilities’ which measure preferences on a 0.00–1.00 (death-best health) scale.

Multiplicative modelling is similar to the procedure described in Eq. 1.

$$V = \mathop \prod \limits_{i = 1}^{n} V_{i}$$
(1)

where V i are the values of the items (dimensions) to be combined and V is the multiplicative score. The actual model is more flexible. It is calculated using disvalues, DV, rather than values, and these are adjusted for the relative importance of each of the items (dimensions) using an importance weight. This results in Eq. 2 in which DV(x i ) is the disvalue of item (dimension) i, w i is the corresponding item (dimension) importance weight and k is the overall scaling constant which is similar to the requirement in an additive model that the dimension weights sum to unity. It is obtained by solving Eq. 3

$${\text{DV}} = \frac{1}{k}\left[ {\mathop \prod \limits_{i = 1}^{n} \left[ {1 + kw_{i} {\text{DV(}}x_{i} )} \right] - 1} \right]$$
(2)
$$k = \mathop \prod \limits_{i = 1}^{n} (1 + kw_{i} ) - 1$$
(3)

The relationship between value and disvalue is given by Eq. 4.

$$V = 1 - {\text{DV}}$$
(4)

For both the Stage 2 and Stage 4 econometric correction, a number of multi-attribute (MA) health states were constructed, independently evaluated and regressed upon the multiplicative score and its component items (dimensions). Experimentation with different functional forms resulted in the adoption of linear models in Stage 2 and the exponential model, Eq. 5, in Stage 4. The procedures resulted in 8 dimension specific algorithms and an overall econometric algorithm for the AQoL-8D.

$$\begin{aligned} U & = V^{x} \\ x & = a_{0} + \sum\limits_{i} {b_{i} } I_{i} + \sum\limits_{i} {\sum\limits_{j} {c_{{ij}} } } I_{i} I_{j} \\ \end{aligned}$$
(5)

where U is the independently assessed utility of an MA health state; V is multiplicative value; \(x \, a_{0} b_{i} c_{ij}\) parameters; I i dimension score for dimension i; I i I j score for dimension I i times score for dimension I j .

Data and analysis

The relationship between the data collected and the stages of analysis is shown in Fig. 2. Commencing on the left of this figure, item responses for each of the 35 items were assessed on a 0–1 scale (0 = item worst; 1.00 = item best). Item response category weights were set equal to the average assessment (‘A data’). Following the recommendation of DA theory, item importance weights (‘B data’) were obtained from item worst scores measured on a dimension best–worst (1.00–0.00) scale. Multi-attribute health states were created from the items of each dimension (‘ED data’). These were used as the dependent variable in the econometric adjustment of dimension scores. Dimension weights for use in the multiplicative modelling were obtained from dimension worst health states (‘C data’) measured on an AQoL-8D best (1.00)−death (0.00)−worse than death (negative) scale. Multi-attribute health states spanning all combinations of dimensions were created for the Stage 4 adjustment (‘E data’).

Fig. 2
figure 2

Summary of data and analysis for the scaling of AQoL-8D. (1)Within item response on an item best-item worst scale (questionnaire). (2)Item worst response on a dimension best-dimension worst scale (questionnaire). (3)Within dimension MA health states (interview). (4)Dimension worst response on AQoL-8D best/death/worse than death scales (interview). (5)Multi-dimensional health state (interview) as described in Iezzi et al. [21]

The first four sets of data—A, B, ED and C data—were assessed using a VAS. E data were obtained from TTO valuations. This implies that Stage 4 adjusted for both the effects of the non-orthogonality of dimensions (and the possible double counting of disutility) and also converted the units of analysis from VAS to TTO. VAS valuations were obtained from mail questionnaires administered prior to a subsequent interview which edited data and separately administered the TTO questions. Protocols, props and a model interview are reproduced in Iezzi and Richardson [21].

To obtain ED and E data for the econometric corrections in Stage 2 and Stage 4, MA health states were constructed manually to ensure simplicity and plausibility. (Those with unbearable pain cannot, plausibly, be independent and happy.) Additional criteria were that MA states include all combination of items and that each dimension included mild, moderate and severe health states. All of the Stage 2 and 3 MA states are reported in Richardson et al. [22], along with the frequency distributions of dimension scores for the MA health states. These indicate that the full range of health states was employed in the analysis.

Multiplicative dimension scores were obtained from Eq. 2 using ‘A data’ to measure DV(x i ). Parameters w i and k were solved by inserting ‘B data’ in Eq. 3 (Stage 1). Final dimension scores were obtained from OLS linear regression of ED data upon each dimension multiplicative score, V i , and the dimension items (Stage 2). A multiplicative AQoL score, V, was obtained by inserting these corrected dimension scores and dimension weights—C data—in Eqs. 2 and 3 (Stage 3). The final AQoL-8D algorithm was estimated in Stage 4 from Eq. 5 using the latter, multiplicative score, and each of the dimension scores as independent variables, and the multi-attribute (E data) as the dependent variable.

Various combinations of independent variables were tested in Stage 4. The final algorithm was selected on the basis of the usual goodness-of-fit statistics (R 2, F, RMSE) but also using the ‘internal validity’ of the models; that is, their ability to provide an unbiased estimate of the TTO scores from which the model was derived. This was tested using the regression: TTO = a + b AQoL i ; where AQoL i is the predicted utility, U, from Eq. 5. With no bias, a = 0 and b = 1.00.

Main survey

The survey and data collection are outlined in Richardson et al. [22]. It had two components: (a) a sample of the Australian population and (b) a sample of patients with a clinically diagnosed mental health disorder. The target sample was based on the minimum sampling error of 5 %, which results in a target of 400 cases in each strata [23]. The population sample was drawn from a computer readable phone directory, using a stratified, clustered two-stage design, similar to Hawthorne et al.’s [15] procedures in the AQoL-4D validation study. Postcodes were used as the primary sampling units with the selection based on the Australian Bureau of Statistics Socio-Economic Index for Areas (SEIFA) and the probability of selection proportionate to the population size (to reduce the effect of socio-economic confounding). Telephone subscribers (above 18) were sampled and contacted by letter and subsequently by telephone.

People with mental disorders (‘neuroses’, depression and psychosis) were accessed via Melbourne’s St Vincent’s Hospital Mental Health Service, The Melbourne Clinic (a private hospital) and centres for post-traumatic stress disorder. Treatment providers were approached to assist with recruitment, to obtain informed consent and to ensure that people were well enough to participate in the study. Interviews were conducted with the public respondents at the Centre for Health Economics at Monash University. Patients were interviewed at their treatment service site. The protocol was approved by the Monash University Human Research Ethics Committee (MUHREC Approval No: 2004000750).

Results

Survey respondents are described in Table 1. In total, 670 individuals participated: 323 patients and 347 members of the public. The survey sought to obtain a representative sample of the public between the ages of 18 and 65. Comparison of the final columns indicates that this was achieved in the public sample but also amongst patients despite less scope for targeting patients according to their demographic characteristics. In contrast to the 60 % of the Australian population living in the lowest three SEIFA (SES) groups, only 10.8 and 26.1 % of the public and patient samples were from these postcodes. However, subsequent analyses failed to detect differences in responses between individuals from the top two and bottom three SEIFA groups.

Table 1 Public and patient respondents to main survey by age and gender

Table 2 summarises the types and number of data collected. From the final column, average item category weights were obtained for the 176 response categories of the 35 items for both the public and patients; average importance weights were obtained for each of the 35 items and 8 dimensions for both groups. A total of 174 dimension health states and 370 AQoL-8D health states were constructed and evaluated for the econometric corrections.

Data analysis

Initial data analysis and editing are described in Richardson et al. [22]. With few exceptions, all item response categories were used by respondents and resulting frequency distributions indicated the absence of content ‘gaps’. Public and patient VAS ratings of item responses (A data) were very similar. There were no significant differences in mean scores for 14 of the 35 items and, for another 10, only one response level differed significantly. Where differences occurred, they were small and, contrary to the conventional wisdom (that adaptation results in higher QoL assessments by patients), it was the public, not patients, who gave the higher rating. These differences occurred in the mental health, self-worth and happiness dimensions. There was, similarly, little difference in the public and patient valuation of item worst health states (B data). These differed with 5 % significance for only 4 of the 35 items. Consequently, subsequent modelling used total (public plus patient) data.

Dimension models

Item worst scores (B data) and their use in the calculation of item weights are reported in Supplementary Table S.1. Inserting the weights for each dimension in Eq. 2, in conjunction with the dimension scaling constant calculated from Eq. 3, creates the (Stage 1) multiplicative dimension formulae reported in Box 1.

Box 1 Multiplicative equations for dimensions (Stage 1)

The 174 ‘within dimension’ multi-attribute health states created for the econometric correction of dimensions resulted in 2,787 individual observations (ED data), an average of 348 per dimension. The frequency distributions of these data spanned the full range of scores for each dimension, i.e. from 1.00 to 0.00 (dimension best to dimension worst). In Stage 2, these data were regressed upon the predicted multiplicative score and the average item scores for each dimension. Various models were tested, but linear regressions proved to be the most satisfactory. They are reported in Box 2.

Box 2 Econometric correction to multiplicative scores (Stage 2)

AQoL-8D model

Dimension worst health states (C data) are reported in Supplementary Table S.2 for both the public and patients. Average scores differ significantly only for independent living (5 % significance) and coping (1 % significance). In both cases, the public score is higher, not lower, than the patient score. In Stage 3, the average of both scores was used as the dimension weight, w i , in Eqs. 2 and 3. The resulting multiplicative models for the two super-dimensions and for the AQoL-8D are reported in Box 3.

Box 3 AQoL-8D and super dimension (SD) multiplicative scores (Stage 3)

For the final econometric correction (Stage 4), 3,178 TTO observations were obtained from the 370 MA health states. These successfully spanned all dimensions and combinations of dimensions (E data). Mean TTO scores were used in the final econometric correction, Eq. 5. As noted, since previous valuations employed a VAS, the Stage 4 ‘correction’ took account of both double counting of disutilities and differences between VAS and TTO. Results are presented in Table 3.

The test of internal validity for the five models is reported in the bottom block of Table 3. The 370 directly elicited mean TTO scores were regressed upon the predicted scores using linear OLS. For an unbiased estimate of the TTO, the resulting equation should be of the form: TTO = 0.00 + 1.00 AQoL-8D. Using this criterion model, 4 marginally outperforms model 5. However, the quantitatively large but marginally significant negative coefficients in this model result in a number of anomalous scores (in excess of 1.00). Consequently, model 5 was adopted for the final algorithm. The OLS test regression was separately estimated for the public and patient samples. Results reported as footnote to Table 2 are very similar and suggest that the preferences of the two groups do not diverge substantially.

Table 2 Summary of data collected by stage of analyses

Predicted AQoL-8D utilities from the final algorithm are plotted against average TTO scores in Fig. 3. To simplify the figure, data were aggregated to create a single observation for each percentile of the AQoL-8D utilities. The vertical axis therefore measures the average TTO for AQoL-8D utilities in the range (1.00–0.995), (0.985 to <0.995), (0.975 to <0.985), etc. The increased aggregation did not alter the OLS regression reported for model 5 in Table 3, but increased the R 2 from 0.73 to 0.92.

Fig. 3
figure 3

Average TTO versus (predicted) AQoL-8D

Table 3 Regression results for Eq. 5 (Stage 4)

Discussion

AQoL-8D methods were innovative in several respects. Its descriptive system was derived using psychometric methods [9]. As described here, its modelling employed a four-stage methodology incorporating econometric corrections. This was necessary as the descriptive system was too large to model with the single-stage econometric analyses used in some other MAU instruments [1]. But the use of DA methods to combine items necessarily violated orthogonality requirements [12]. The econometric correction reconciled the need for a larger descriptive system with the problems this created for modelling utility.

The methods were necessarily based upon a number of limiting assumptions. Utility was measured using the TTO technique. Coefficients at all stages of the analysis were derived from average responses. This implies that the instrument will not properly represent the preferences of particular individuals. The average utilities reflected the preferences of an Australian sample which may differ from preferences in other countries. The sample did not include many respondents aged above 65 whose valuation of ill health may differ from other age groups. However, these problems are general. Even a perfectly representative cross section of a national population may not capture preferences of a particular socio-demographic or disease subgroup in the same country. It would, nevertheless, be desirable for future research to test the validity of AQoL-8D for use with other populations and particularly the elderly who represent an increasingly important section of the population.

Despite these limitations, AQoL-8D is unique in having employed both a representative cross section of the general public and patients with clinically defined mental health problems for constructing both the descriptive system and the utility algorithm. The objective was to achieve a ‘mixed perspective’ as a compromise between the perceived differences between public and patient perspectives. Its success is qualified by the non-random selection of the patients of participating psychiatrists and sampling according to predetermined criteria was, therefore, not feasible. However, there were no grounds for believing that these patients were otherwise atypical. In the event, very few differences emerged between the valuations from the two groups. Contrary to the common belief that adaptation results in higher patient scores, when differences were observed, it was the public, not the patient group which had the higher score. This suggests that the common view of adaptation may apply chiefly to physical disability where the public can envisage the consequences of the disability. A lesser ability to envisage and empathise with psychological health states would also explain the higher valuations placed upon them by the public when differences occurred.

A consequence of the enlarged psycho-social component of the descriptive system is that the AQoL-8D is sensitive to changes in the dimensions which drive adaptation and which, when absent, differentiate patient and public valuations. As information relating to self-esteem, coping, happiness, etc. is included in the health state description of the AQoL-8D, it is less likely that patient and public valuations will diverge because of a differing assessment of these elements. The similar evaluations by the two groups in the AQoL-8D surveys suggest that differences between public and patient assessments are, in part, attributable to the absence of relevant information in the health state description and that when appropriate psycho-social information is provided, the valuations of items, dimensions and health states will be relatively similar. However, by increasing the relative importance of psycho-social dimensions, the AQoL-8D decreased the relative importance of physical dimensions. This is possibly an inescapable consequence of expanding the scope of measurement.

By September 2013, the AQoL register (www.aqol.com.au) had recorded 99 research teams which had indicated an intention to use AQoL-8D, 59 of which provided details as published on the AQoL website [24]. The instrument has been included in an 8,022 person multi-instrument online comparative survey [25]. This found that, despite its greater length, there was an average completion time of 5.45 min. The instrument demonstrated a high degree of sensitivity to all the health states but particularly to health states with poor psycho-social health. The survey was used, inter alia, to test AQoL-8D reliability and validity [26]. Reliability (Cronbach alpha) coefficients from the analysis are reported in Supplementary Table S.3, and the frequency distribution of AQoL scores for the healthy public shown as Supplementary Figure S.1. Population norms for the instrument have been published [27].

Conclusions

The objective of the AQoL-8D project was the creation of an MAU instrument with increased sensitivity to psycho-social health states. The construction of a descriptive system with content validity in these dimensions resulted in an instrument which is large by the standards of extant MAU instruments and too large for the use of past methods to create a flexible utility algorithm. The present article has described the methods adopted to meet this challenge.

The resulting algorithm predicts utilities which are in the same range as extant instruments. The predicted utilities are an unbiased estimate of the TTO valuations of the health states. The instrument has high internal and test–retest reliability and good external validity. It exhibits high sensitivity to the health dimensions which it was designed to measure. These results imply that the AQoL-8D is a useful addition to the extant instruments and particularly for analyses requiring measurement where psycho-social health is affected.