Introduction

Evaluating the health of the general population is one of the specific applications proposed for multi-attribute measures of health status and well-being [1]. Such instruments are increasingly used in national health surveys alongside the well-established single question of perceived general health [2]. Generic multi-attribute instruments are suitable for this type of application as they can be used in healthy individuals as well as those with different conditions. These generic instruments can in turn be classified as psychometric profile measures or econometric index measures [3]. Psychometric measures generate scores on different health dimensions (profile). Econometric measures provide a single global score (index) which incorporates societal preferences for health states (utilities) that can be used to calculate quality-adjusted life years for use in economic evaluations.

Given its low respondent burden, the EQ-5D has been widely used in population surveys [410]. Its descriptive system, composed of only five items with 3 response options for each, is used to generate a preference-based index [11]. On the other hand, the length of the most commonly used profile measure, the Short Form-36 Health Survey (SF-36) [1215], may have limited its use in population health surveys. However, the development of the SF-12 [16], an abbreviated version with only 12 items, considerably reduced respondent burden making it more suitable for use in large health surveys while still providing the two component summary scores (Physical and Mental) of the original SF-36 [16].

The Short Form-6 Dimension (SF-6D) is an econometric index with an empirically derived system of health state preference values developed first from the SF-36 [17] and then from the SF-12 [18]. Given that the SF-12 questionnaire can generate both an index (the SF-6D) and profile (Physical and Mental Component Summaries), it may provide a useful alternative to the EQ-5D for use in health surveys. However, the advantages of each measure in this context remain unclear.

Several studies have compared the EQ-5D with the SF-6D index and/or SF-12 summaries in different patient groups [1929], though evidence on the comparative performance of this type of questionnaire requires cumulative results from different settings and types of study. Nevertheless, there have been much fewer head to head comparisons in health surveys [2, 6, 12, 3032]. Five studies in the general population compared EQ-5D and the SF-12 [2, 6, 3032] but only one also considered the SF-6D, though the comparison was restricted to the group of individuals in good health [31]. All of these studies found a high proportion of respondents with the best possible scores (ceiling effect) on EQ-5D, suggesting limited discriminative capacity at mild levels of morbidity, while the SF-6D has been shown to suffer from floor effects in patients with severe conditions [29]. To the best of our knowledge, no studies are available which compare the EQ-5D with the SF-6D index and the SF-12 profile in terms of their power to differentiate between groups defined by their socio-economic characteristics or illness profile in the general population as a whole.

The aim of this study was to evaluate the adequacy of the EQ-5D, SF-6D, and SF-12 version 2 for use in population health surveys by comparing their ceiling effects and capacity to discriminate between groups defined by socio-demographic and health characteristics in a general population sample.

Methods

Study population

Data used in this study came from the 2006 Catalan Health Interview Survey (CHIS), a cross-sectional health survey carried out in Catalonia [33, 34], an Autonomous Community in the north-east of Spain with about 7.2 million inhabitants. A representative sample of individuals from the non-institutionalized general population was surveyed [35]. Participants were selected using a multi-stage, random sampling strategy. In the initial sampling stage, the number of interviews required in each of Catalonia’s 37 health care governing districts (HGDs) was established based on population size. The number of interviews estimated per HGD ranged from 400 to 650 for a sampling error at this level of ±5% (p = q = 0.5). In the case of Barcelona, 305 interviews were planned for each of the city’s 10 municipal areas. In a second stage, representative municipalities were chosen at random for each HGD after stratifying by number of inhabitants. In a third stage, participants from each municipality were selected by simple random sampling from the Catalan census register [36]. The study was approved by the Consultants’ Committee of Confidential Information Management (CATIC) from the Catalan Health Department, according to the 2000 revision of the Helsinki Declaration.

To minimize non-response, when selected sampling units could not be located or declined to participate, they were replaced by units with the same characteristics in terms of age group, sex, and neighbourhood. Respondents were replaced as a result of change of address (15.1%), erroneous data (11.2%), refusal to participate (10.8%), absence (5.1%), or death (1.3%). Computer-assisted personal interviews were administered by trained interviewers in the respondents’ homes, from December 2005 to July 2006 (n = 18,126).

The EQ-5D has been included in all editions of the CHIS performed to date. The inclusion of the SF-12 in a random sub-sample of the 2006 CHIS consisting of 4,319 participants aged 15 or over provided the necessary data for the present head to head comparison.

Both questionnaires were interviewer administered. The EQ-5D was administered towards the beginning of the survey and the SF-12 at the end. In the sub-sample used for the present study, the same sampling methods were applied as in the overall sample to ensure representativeness of the Catalan general population. Within this sub-sample, statistical power to detect small differences (effect size = 0.3) between groups with and without selected chronic conditions was calculated for the least prevalent condition (stroke). Given a standard deviation of 0.27 for the EQ-5D Index and 0.17 for SF-6D in the overall sample, and a significance level of 0.05, statistical power was calculated to be 0.80 for both measures.

Measurement instruments

The EQ-5D is a brief, multi-attribute, generic, preference-based health status measure [11, 37]. Its descriptive system covers five dimensions of health (mobility, self-care, usual activities, pain or discomfort, and anxiety or depression) with three levels of severity in each dimension (no problems, some problems, extreme problems). The instrument therefore defines 243 distinct health states from all the possible combinations of dimensions and levels of severity (i.e. 35). In this study, the Spanish version of the EQ-5D and Time Trade Off preference values from the Catalan general population were used [38], thereby producing a single preference-based index ranging from 1 (best health status) to negative values (health states valued as worse than death, being the minimum −0.6533), where 0 is equal to death.

The SF-12 [39] is composed of 12 questions covering eight dimensions of health: Physical Functioning, Role Physical, Bodily Pain, General Health, Vitality, Social Functioning, Role Emotional and Mental Health. In this study, version 2 of the SF-12, which has previously been adapted into Spanish [40], was used. Version 2 of the SF-12 has greater comparability among linguistic adaptations and uses a five-level response scale for all items, except those in the Physical Functioning dimension which retained the original 3-point Likert scale. Scores for the two component summaries (physical and mental component summaries, PCS-12 and MCS-12) were calculated using a standardized procedure to obtain a mean of 50 and a standard deviation of 10 based on USA general population data [39].

The SF-6D [17] is a utility index based on a descriptive system composed of items from six of the eight SF-36 dimensions (role physical and role emotional were combined and general health was omitted). It can be derived from the SF-36 or SF-12 though the number of items included differs (11 and 7 items, respectively). The SF-6D derived from the SF-12 [18] was used in this study. It classifies individuals into 7,500 health states, of which 241 identified by a fractional factorial design were valued by a representative sample of 611 members of the UK general population. The utility algorithm and the standard gamble preference utilities applied in this study were developed by Brazier et al.[18] Health state values on the SF-6D are anchored to 1 (perfect health) and 0 (death), and theoretical values for health states generated by the instrument range from 1 to 0.345.

Gender, age, income, and social class were chosen to test the instruments’ capacity to discriminate by socio-demographic groups, as it has frequently been shown that there are differences in health on these variables. Women have consistently been shown to have worse health status than men in many different populations [4143] despite, paradoxically, having a higher life expectancy. A deterioration in health status with age has also been shown on the physical domains of health, though older adults often report better psychological health than younger adults [44]. Those with low income or in less-privileged social classes also tend to report poorer health status [45].

Social class was assigned according to the respondent’s most recent occupation (or the occupation of the head of the household in the case of those who were looking after the home) using an adapted version of the British Registrar General’s Social Classes [46]: classes I and II (managerial and free-lance professionals), class III (skilled non-manual occupations), class IV (skilled manual workers), and class V (non-skilled manual workers).

Recent health problems were assessed by asking respondents about visits to a health care professional and limitation of activities during the previous two weeks. Limitation of activities was computed as a dichotomous variable which assumed limitation in individuals responding affirmatively to any of three questions: having to spend at least half a day in bed; being unable to perform household tasks or go to work or school; or having to reduce daily activities such as walking, sports, play, and/or shopping as a result of health problems.

Perceived health was assessed by the SF-12 question “In general, how would you rate your health: Excellent, Very good, Good, Fair, or Poor?” Psychological distress was used as the indicator of mental health. It was measured using the 12-item version of the General Health Questionnaire-GHQ [47], a screening instrument designed to detect diagnosable psychiatric disorders, which has shown adequate psychometric properties [48]. The global GHQ-12 score was computed following the method proposed by Goldberg with all responses on the 4 point Likert scale being dichotomized to (0-0-1-1). A total GHQ score ≤2 was considered to indicate absence of psychological distress and ≥3 points to indicate risk of psychological distress [47, 48].

The CHIS also included a checklist of 27 common chronic conditions. Respondents were asked ‘Do you suffer from or have you suffered from any of the following chronic conditions?’ and had to answer Yes or No for each condition. A summary indicator of physical health was derived from the checklist based on the number of reported chronic physical conditions and after excluding mental disorders. Respondents were classified into one of four categories by quartiles (none, 1 or 2, 3 or 4, and 5 or more physical conditions), as well as by presence or absence of fourteen specific chronic conditions considered the most clinically relevant in each medical specialty.

Data analysis

To obtain corrected standard errors from the complex sample survey, analyses were performed using the SUDAAN version 8.0 statistical package [49]. Standard errors (SE) and test significance were estimated using the Taylor series method and Without Replacement design, correcting by finite units in each stage. A weighting factor which took into account age, gender, and municipality was applied to restore the representativeness of the population in Catalonia.

Sample characteristics were described by calculating the raw number of individuals in each group and the weighted percentage. Ceiling and floor effects (proportion of respondents with the best and worst possible theoretical scores, respectively) were calculated for the EQ-5D, SF-6D, and SF-12 component summaries.

To evaluate the capacity to discriminate between known groups, group mean scores were compared using one-way analysis of variance. Effect sizes were computed as the difference between the mean of the groups divided by the pooled standard deviation [50]. Due to the complex sampling, the pooled standard deviation was estimated from the corrected standard errors and the weighted number of individuals in the groups. The extreme-split formula for the variance [50] was used to calculate 95% Confidence Intervals (95% CI) for the effect sizes. For variables with more than two categories (i. e. age and social class), the effect size between extreme groups was calculated. General guidelines define an effect size of 0.2 as small, 0.5 as moderate, and 0.8 as large [51, 52]. This classification was used to interpret differences in the discriminative capacity of the instruments studied. A relevant difference was defined as one in which different instruments showed different categories of effect size for the same between-group comparison, as this would imply disagreement on the amount of health burden. For example, we would consider a relevant difference to exist if the SF-6D showed a large effect size and the EQ-5D a moderate effect size for the same between-group comparison.

The discriminative properties of the econometric indexes were also compared using receiver operating characteristic (ROC) curves [12, 19, 53]. In this analysis, the performance of the EQ-5D index and the SF-6D index was evaluated against two external indicators of health status: “Perceived Health” and “Reported chronic physical conditions”. These two indicators were dichotomized using all possible cut-off points. The utility measure generating the largest area under the ROC curve (AUC) was regarded as the most sensitive at detecting differences in the external indicator. F-ratios of the significance test for the AUC were referenced to 1.0 for the SF-6D Index. A value over 1.0 would therefore indicate that the EQ-5D Index was more efficient than the SF-6D at detecting differences between groups.

Results

Table 1 shows the sample characteristics for the 4,319 participants in the CHIS SF-12 subsample. Half of the respondents were women, almost 40% were in social class IV, and 41.4% reported having 3 or more chronic physical conditions. However, only 13.9% of the sample reported limitation of activities in the previous two weeks and 88.2% presented no risk for psychological distress according to their GHQ scores. The most frequent chronic condition was low back pain and stroke had the lowest prevalence with weighted percentages of 28.2 and 1.3%, respectively.

Table 1 Sample characteristics of the Catalan Health Interview Survey (CHIS) SF-12 subsample (2006)

The mean scores of the EQ-5D and SF-6D indexes (Table 2) were 0.88 and 0.82, respectively. Median scores were higher than means, indicating a skewed distribution for both indexes which was particularly noticeable on the EQ-5D. Floor effects were negligible on all of the measures. The EQ-5D Index showed a considerable ceiling effect, with 59.7% of individuals having the highest possible score. This compares to a ceiling effect of 18.3% and 0% on the SF-6D and SF-12 summaries, respectively. The ceiling effect of the EQ-5D and SF-6D indices varied substantially across different groups (Table 3). For example, ceiling effects on the EQ-5D Index ranged from 89% in individuals who reported no chronic conditions to 20% in those with ≥ 5 chronic physical conditions. The corresponding figures for the SF-6D were 27% and 7%, respectively.

Table 2 Distribution of EQ-5D and SF-6D indexes and SF-12 summary scores on the 2006 CHIS SF-12 subsample
Table 3 Ceiling effect for the EQ-5D and SF-6D by socio-demographic, physical and mental indicators, in the 2006 CHIS SF-12 subsample

Table 4 shows mean scores and effect sizes for groups based on socio-demographic variables and health indicators. Statistically significant differences (P < 0.01) by age and social class were found for all between-group comparisons on both indexes and the SF-12 Physical Component Summary. Small differences were found by gender, with effect sizes (ES) ranging from 0.23 to 0.35. Differences between the upper and lower age groups were large for both indexes and the SF-12 Physical component summary (ES = 0.75–1.60). Small to moderate differences were observed on the three instruments between groups defined by social class (ES = 0.18–0.54), income (ES = 0.27–0.53), and medical visits (ES = 0.31–0.55). Effect sizes between groups differing on activity limitations, perceived health, number of chronic physical conditions, and psychological distress were large on all three instruments.

Table 4 Discriminative capacity for the EQ-5D and SF-6D indexes and the SF-12 version 2 summary scores in the 2006 CHIS SF-12 subsample

Mean scores and effect sizes for groups with and without specific chronic conditions are shown in Table 5. The highest effect sizes were found for arthritis (1.09 on SF-12 PCS and 1.05 on EQ-5D), while chronic allergies presented the lowest effect sizes. Effect sizes on the EQ-5D index were large for 4 conditions, moderate for 8 conditions, and small for asthma and chronic allergies. Most effect sizes on the SF-6D were moderate or small (7 and 6 conditions, respectively); a large effect size was only observed when comparing groups based on presence of mental disorders. Differences between SF-6D and EQ-5D effect sizes were statistically significant in five of the six conditions for which the SF-6D was less sensitive than the EQ-5D. This is because the 95% CI of the difference did not include the zero for hypertension (0.06–0.27), myocardial infarction (0.07–0.65), arthritis (0.27–0.48), low back pain (0.17–0.36), and skin problems (0.01–0.33).

Table 5 Discriminative capacity for the EQ-5D and SF-6D Indexes and the SF-12 version 2 summary scores by specific chronic conditions in the 2006 CHIS SF-12 subsample

The pattern of effect sizes across physical chronic conditions shown by the SF-12 Physical component summary was similar to that of the EQ-5D, with differences being observed on only four conditions; two in favour of the SF-12 (diabetes ES = 0.78 vs. 0.72 and cancer ES = 0.84 vs. 0.74, large vs. moderate), and two in favour of the EQ-5D (skin problems ES = 0.55 vs. 0.39 and migraine ES = 0.57 vs. 0.43, moderate vs. small). However, the differences were not statistically significant. Finally, effect sizes for the SF-12 Mental component summary were small for all conditions except mental disorders, which showed large effect sizes.

Table 6 shows the areas under the receiver operating characteristic curves (AUC) for the EQ-5D and SF-6D indexes when detecting differences in Perceived Health and Reported chronic physical conditions. F-ratios ranging from 1.03 to 3.57 indicated that EQ-5D was more efficient in detecting differences between groups in almost all cases.

Table 6 Area under receiver operating characteristic curves (AUC) with 95% confidence intervals (95% CI) and F-ratios of EQ-5D and SF-6D indexes to detect differences in perceived health and number of reported chronic physical conditions

Discussion

This comparative study of the EQ-5D, SF-6D, and SF-12, all short instruments which are likely to be increasingly used as indicators of population health status, confirmed the considerable ceiling effect of the EQ-5D index observed previously in general population health surveys [2, 6, 30, 31]. However, both the EQ-5D and the SF-12 showed a good capacity to distinguish between groups defined by socio-demographic and health variables, and neither was found to be clearly superior, with the results depending on which groups were compared. The SF-6D did, however, generally show poorer discriminative capacity for chronic physical conditions.

The EQ-5D index presented an extremely large ceiling effect compared with the SF-6D index (59.7% and 18.3%, respectively, in this sample). Previous studies in general population samples have also reported ceiling effects of approximately 50% for the EQ-5D index [2, 6, 12, 30, 31]. Furthermore, the ceiling effect observed among individuals reporting specific chronic conditions was also substantial (over 20% in the majority of cases). In contrast, the ceiling effect on the SF-6D was lower than the frequently quoted threshold of 15% [54] in all conditions except chronic allergies (16%). The difference in the ceiling effects may be due, at least in part, to the different recall periods of the EQ-5D (‘today’) and the SF-6D (“last 4 weeks”), as a longer recall period may give more scope for a respondent to include small transient effects on health which might not be picked up by the EQ-5D [55]. The SF-12 component summaries showed no ceiling or floor effects because they were obtained by summing scores of health dimensions after applying positive and negative weights to each dimension [39].

Only two recent studies have directly compared the EQ-5D and the SF-6D indexes in the general population [12, 31]. Most previous head to head studies only compared the EQ-5D index with the SF-12 profile [2, 6, 30]. In our study, the SF-6D index showed poorer discriminative capacity than the EQ-5D and the SF-12. These findings are consistent with those reported by Bharmal and Thomas [31], who compared the discriminative capacity of the SF-6D with the SF-12 in individuals reporting full health on the EQ-5D. However, findings from the Health Survey for England [12] showed that the SF-6D derived from the SF-36 was better than EQ-5D at discriminating between groups. The SF-6D had greater areas under the curve (AUC) than EQ-5D, and it was 30.9%-100.4% and 10.4%-45.6% more efficient than the EQ-5D at detecting differences in self-reported health status and illness, disability or infirmity, respectively. In contrast, in our study, the EQ-5D presented a significantly higher AUC than the SF-6D. Although differences in effect sizes between the SF-6D and EQ-5D for these or similar indicators were not relevant according to our criteria, effect sizes on the EQ-5D were nevertheless higher, supporting the AUC comparison. These contrasting findings may be explained by the fact that the descriptive system of the SF-6D derived from the SF-12 is composed of four fewer items than that derived from the SF-36 (2 physical functioning items, 1 pain item, and 1 mental health item) [17, 18], which could make it less discriminative. Further research is needed to compare these two versions of the SF-6D.

Other differences with previous studies include the fact that the EQ-5D was compared with version 1 of the SF-12 [2, 6, 30, 31]. Although both versions of the SF-12 are similar, version 2 incorporated a five-level response scale instead of the dichotomous response format used for the role dimensions in version 1 [39] and there were minor changes in the mental health and vitality dimensions (6 to 5 options). Our results are therefore not directly comparable with previous studies which used version 1 of the SF-12 [2, 6, 30, 31]. Regarding the discriminative power of the SF-12 version 2, the mental component showed similar discriminative capacity to the EQ-5D on psychological distress and chronic mental conditions, while the physical component showed similar discriminative capacity across socio-demographic categories, and by recent health problems, and number of chronic physical conditions. On the other hand, the SF-12 physical component summary was more sensitive than the SF-6D in several specific conditions (hypertension, myocardial infarction, arthritis, low back pain, diabetes, stroke, and cancer). Furthermore, for six of these seven conditions, differences between SF-6D and SF-12 effect sizes were statistically significant; this can be explained by the fact that the SF-12 component summaries measure more specific aspects of health than the SF-6D, which provides a global health measure, and might not be as sensitive to differences in physical and mental health.

Our study had a number of limitations. The most important was the lack of a gold standard for determining the size of the true difference between the groups studied. However, there was considerable agreement between the EQ-5D and the corresponding SF-12 summary score in terms of the magnitude of between-group effect sizes, whilst the SF-6D generally showed lower effect sizes. This may be evidence for a higher degree of validity for the magnitude estimated by the EQ-5D and the SF-12 than the SF-6D. The estimation of effect size also had some advantages to compare the discriminative capacity of different instruments. Although the effect size, the F-statistic and the area under the ROC curve all provide a means of testing for statistically significant between-group differences in score, only the effect size provides information which facilitates a direct interpretation of the differences. For example, all instruments showed that stroke has an impact on health, but whereas the SF-6D indicated a moderate impact (ES = 0.55), the EQ-5D and the SF-12 Physical component summary suggested that the impact was large (ES = 0.83 and 0.89, respectively).

Secondly, preference weights applied to the SF-6D were obtained from a representative sample of the UK general population because no Spanish preferences for this instrument were available. However, only minor differences were found in terms of preferences for EQ-5D health states between Spanish and English respondents [56]. Thirdly, in addition to the preference-based index, the EQ-5D also includes a visual analogue scale on overall health status. This measure has been shown to be more sensitive to between-group differences than the index [30, 31], but it was not included here as our intention was to focus on multi-attribute measures. Fourthly, the perceived health indicator used was an item from the SF-12, and analysis of discriminant capacity based on this variable could not be applied for the SF-12 Component Summaries. However, this analysis was valid for the SF-6D because the item is not used in the calculation of this index. Finally, the sample surveyed is only representative of Catalonia. However, given similarities on national indicators such as life expectancy or Healthy Life Years in the general population between Catalonia, Spain, and other developed countries [57], it is likely that the results found here in terms of relative discriminant capacity will be generalizable to other European countries, USA, or Canada.

One of the original contributions of this study is that, as far as we are aware, this is the first time the discriminative properties of the SF-6D derived from the SF-12 have been compared to the EQ-5D index in a general population health survey. The previous study by Bharmal and Thomas [31] focused on respondents who reported being in full health on these indexes, which might not be the group of primary interest from a public health perspective. Rather, those with a need for care are of particular importance as regards absolute and comparative access to health care. In conclusion, this study indicates that the EQ-5D is more sensitive to between-group differences than the SF-6D index and may be more appropriate for determining the burden of chronic conditions or for use in economic studies requiring utilities. However, the SF-12 may be a suitable alternative to the EQ-5D in health surveys, because it is generally comparable in estimating burden of disease and provides further complementary information on Physical and Mental components of health.