Introduction

There is a proliferation of patient-reported outcome measures (PROMs) designed for different purposes. One such set of measures are accompanied by value sets (also referred to as utilities) which cover core dimensions of health and generate preference weights typically obtained from the general population or patients for calculating quality-adjusted life years (QALYs) in cost-effectiveness analysis (CEA). There are different types of generic preference-accompanied PROMs. Traditional PROMs include those which comprise a concise descriptive system using a fixed number of question items, accompanied by a set of values (e.g. EQ-5D [1], HUI [2, 3]), and those which obtain patients’ health states on a longer generic questionnaire (e.g. SF-36 [4]) and then reduce it from multiple items to one attribute per dimension to generate utilities (e.g. SF-6D [5]). More recently, PROMs based on item response theory (IRT)—most notably, the Patient-Reported Outcomes Measurement Information System (PROMIS), have been developed [6]. PROMIS currently has over 100 health domains with a calibrated “item bank” of questions that can be administered by standard short forms, custom short forms, or computer adaptive testing [7]. The PROMIS Profile measures (i.e. PROMIS-29, PROMIS-43 and PROMIS-57) have been developed to meet the demand for a fixed short-form version of an instrument that is relevant for most health contexts. Utilities for PROMIS-29 are available, but (to date) only for the US [8, 9].

Before introducing the key instruments, it is important to define terms to allow comparability. We define both PROMIS-29 and EQ-5D to be preference-accompanied PROMs, i.e. instruments which seek to describe a set of health states that are mutually exclusive and exhaustive, and which are accompanied by preference weights. We use the term preference-accompanied rather than preference-based because both these PROMs can be considered as measurement systems that are independent of the preferences that accompany them, although a central part of their use is in the application of a preference algorithm to them, for example, in the estimation of QALYs. Elsewhere, preference-accompanied PROMs are known as utility instruments, or multi-attribute utility instruments (MAUIs) [3, 10]. Each preference-accompanied PROM consists of a set of dimensions, that are used to measure respondents’ self-perceived health. Elsewhere, these dimensions are also known as attributes or domains. In this paper, we allow the use of the terminology and descriptions used by the instrument developers in each case.Footnote 1

PROMIS-29 is the brief generic instrument within the PROMIS system that consists of 29 fixed items (questions) covering seven health domains. It has been quickly taken up in many research and clinical settings [9, 14]. From the three PROMIS profile measures reported in Cella et al. (2019) [14], most users have opted to use PROMIS-29 (personal correspondence, Health Measures, 23rd July 2020). The EQ-5D is a very concise generic instrument which has five dimensions with one item in each dimension and has been widely used in many countries. Both PROMIS-29 and EQ-5D are self-report measures of core health domains and accompanied by preference weightings. Therefore, both are potentially important candidates for use in economic evaluation. However, they differ in the way they measure and value health, which has the potential to yield differences in health states described and the estimates of QALYs to be used in CEA in the decision-making process [15].

The EQ-5D and PROMIS-29 differ in the health domains covered. The EQ-5D includes five dimensions: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression; the five dimensions used in seeking self-reported health from patients are fixed. Each dimension has one question. The initial version of the EQ-5D has three levels of problems within a particular EQ-5D dimension (EQ-5D-3L); a more recent version retains the same five dimensions but increases the levels of problems on each to five (EQ-5D-5L) [16]. For EQ-5D-5L, there are five response levels for each dimension: no problems = 1; slight problems = 2; moderate problems = 3; severe problems = 4; and unable/extreme problems = 5. EQ-5D-5L health states can be described as a five-number string, representing the level of each dimension in the order in which they appear in the questionnaire (as described above) [17].

The PROMIS-29 comprises 4-item short forms of seven domains: Physical Function, Depression, Anxiety, Fatigue, Sleep Disturbance, Ability to Participate in Social Roles and Activities (Social Roles), Pain Interference, plus a 0–10 Pain Intensity numeric rating item. As one of the PROMIS Profile measures, PROMIS-29 is derived from the PROMIS item banks and has shown evidence of strong psychometric properties such as reliability and validity [14]. In PROMIS-29, each item (or question) has five response levels ranging in value from one to five (raw score), and the domain sum raw score ranges from 4 (1 × 4) to 20 (5 × 4). Health state for each domain in PROMIS-29 (and in other PROMIS measures is expressed in T-scores, which have a mean of 50 and a standard deviation of 10 compared with the general US population.

The stated preference methods used to obtain utilities for each instrument are also very different. The development of the EQ-5D was motivated in part the aim to facilitate QALY estimates for use in economic evaluation [1]; the five dimensions in the descriptive system are identical to the five domains used in the stated preference exercises that generate the utilities for it. Utilities for the EQ-5D-3L are available for many countries and were typically developed using either Time Trade-Off (TTO) or, for some older value sets, Visual Analogue Scale (VAS) methods [18]. Utilities for the EQ-5D-5L have been produced in many countries and are typically obtained using a protocol which includes both Discrete Choice Experiments (DCE) and the composite Time Trade-Off (cTTO), with resulting value sets based either on cTTO or a hybrid of cTTO and DCE [19]. Alongside this, DCE methods including duration have also been used as a standalone approach for the estimation of value sets that are used in decision-making [20].

Two valuation studies have produced utilities that can be applied to PROMIS-29. An ‘official’ preference-based scoring system (PROPr) has been developed to attach utilities to health states described by PROMIS domains based on multi-attribute utility theory using Standard Gamble [8]. The PROPr system was not designed for PROMIS-29 specifically but for any PROMIS measures that collect measurements on seven PROMIS domains: Cognitive Function, Depression, Fatigue, Pain Interference, Physical Function, Sleep Disturbance and Social Roles. Among them, six domains are included in PROMIS-29 (the exception being Cognitive Function). To facilitate the calculation of the PROPr summary score for PROMIS-29, the PROMIS team developed a function to predict the missing Cognitive Function score based on Profile measures [21]. A further, ‘unofficial’ valuation study is available, which focussed on PROMIS-29 domains with a 10-year duration (i.e. respondents were asked to consider decrements in health lasting 10 years) using DCE, has produced preference weights for the PROMIS-29 [9].

The choice of a particular instrument for measuring health outcomes can have a non-trivial effect on estimates of QALYs and QALY gains. The use of either EQ-5D-3L, EQ-5D-5L or PROMIS-29 will potentially yield different results and evidence for decision-making, deriving both from differences in what is measured in each case and how it is valued. Previous work has used a literature review to compare the psychometric properties of different preference-accompanied measures and has proven it as a valid and useful method [22, 23]. This literature review aims to synthesise existing evidence on the relationship between EQ-5D and PROMIS-29-based instrumentsFootnote 2 to establish what is currently known about the comparative properties of these measures and their utilities and to identify where there are gaps in knowledge.

Methods

We employed a literature search in PubMed and Web of Science, from inception to 28 Feb 2021, to identify studies reporting on both instruments and examining their relationships. We used combinations of the following keywords: “PROMIS-29”, “PROMIS Profile”, “PROMIS”, “Patient-Reported Outcomes Measurement Information System” and “EQ-5D”, “EQ-5D-5L”, “EQ-5D-3L”, “EuroQol” and “Euro-QoL”. There were 77 records yielded from PubMed, and 101 records from Web of Science. After removal of duplicates, a total of 95 studies remained.

A two-stage study selection process was employed. The first stage involved the identification of papers reported original research, and there was an indication in the title or abstract that it contained both instruments, i.e. PROMIS-29 and EQ-5D (EQ-5D-3L or EQ-5D-5L). In the second stage, full text articles were assessed. Studies that included only one or two domains in PROMIS-29 were excluded because we were interested in the two instruments as generic, multidimensional measures of self-perceived health. Papers were included if they investigated the following:

  • Comparing PROMIS-29 and EQ-5D as descriptive systems,

  • Response mapping between PROMIS-29 domains and EQ-5D dimensions,

  • Mapping of PROMIS-29 domain scores to EQ-5D utility scores (refers to functions for which an instrument without utility weights is “mapped” onto one with weights),

  • Comparing the characteristics of the value sets available for PROMIS-29 and EQ-5D,

  • Comparing PROMIS-29 utility scores and EQ-5D utility scores using population or patient data,

  • Scale transformation (refers to functions which seek to establish the link between instruments which both have existing utility weights), and

  • Validating an existing mapping or scale transformation algorithm.

Data extracted included the instruments, patient or population group, country of origin, sample size, types of relationship evaluated, statistical methods, psychometric properties (reliability, validity and responsiveness), model specification, performance (within and external validation) and authors’ comments on the study.

Results

Search results

A total of 95 unique studies were identified from the literature search, resulting in 73 initial inclusions after title and abstract screening. Among the remaining studies, only nine studies investigated the relationship between PROMIS-29 and EQ-5D, including six using the EQ-5D-5L measure.Footnote 3 Figure 1 presents the flow chart of the inclusion process of the literature review.

Fig. 1
figure 1

Flow chart of the literature review

An overview of the evidence

Of the nine included studies, seven were from the United States and two were from Europe and UK. Three of the nine sampled from the general population only, five used patient data and one used one sample from both the general population and patients. The sample size ranged from 20 to 2623. Table 1 provides a summary of the included studies.

Table 1 A summary of existing evidence

Among the six categories of comparison or associations we wished to explore (as listed in our inclusion criteria), we found four have been addressed in the literature on PROMIS-29 and EQ-5D-3L, and three on PROMIS-29 and EQ-5D-5L (as shown in Table 2). Specifically, five studies collecting patient data using both instruments provided information on known-group validity and or responsiveness [33, 34, 36, 37, 40]. Two studies predicted EQ-5D-3L and EQ-5D-5L utilities from PROMIS-29 domain scores, respectively [28, 38]. There were two studies focussing on utility scores. One compared and transformed the PROMIS-29 utilities (obtained using DCE) to EQ-5D-3L utilities [30]. The other examined the convergent and known-group validity between PROPr utilities and EQ-5D-5L values (using the EQ-5D-3L crosswalk link function that maps the descriptive system of EQ-5D-5L to that of 3L and then using the US TTO value set) [31].

Table 2 Types of the relationships that have been investigated in literature

Key findings

Comparing PROMIS-29 and EQ-5D as descriptive systems

Six studies compared the health states described by PROMIS-29 and EQ-5D. One US study provided a descriptive analysis of PROMIS-29 and EQ-5D-3L responses from the general population (2623 adults). Hartman and Craig [30] found that, in their sample, 74 out of 243 possible health states for EQ-5D-3L were observed, and the two most prevalent health states accounted for more than 55% of the sample (43% of the sample reported no problems on five dimensions); there were 1670 health states out of the nearly 0.4 sextillion possible health states (528×11) for PROMIS-29 and the ten most prevalent ones accounted for less than 10% (4% of the sample reported no problems on the 29 items). It indicates that EQ-5D-3L has a large ceiling effect and may not be sensitive to mild health problems, particularly among those in good health.

We found five published studies that did not directly examine the relationship between the two instruments but reported PROMIS-29 domain scores, EQ-5D level of problems and/or EQ-5D utilities among different patient groups. Wyngaer et al. examined the association between measures of physical function, risk of falls and quality of life using PROMIS-29 (T-scores) and EQ-5D-3L (index score and level of problems) among patients on maintenance haemodialysis in Belgium. There was a statistically significant association between physical health measures and EQ-5D-3L index score. Patients with severely impaired functional performance showed statistically significant lower scores on all dimensions of EQ-5D-3L except for anxiety/depression, but only on one PROMIS-29 domain (physical function) [36].

The remaining four studies used EQ-5D-5L. Khanna and colleagues reported PROMIS-29 domain scores and EQ-5D-5L health states and utilities between treatment and control groups before and after an intervention using patient data [33]. Among patients with longer disease duration, there was a statistically significant difference between the treatment and control groups in the physical function domain scores for PROMIS-29 but not in the levels of mobility on the EQ-5D-5L; and there was a statistically significant difference in levels of pain and discomfort dimension for EQ-5D-5L but was not found in pain-related domains for PROMIS-29. In a more recent study, using the same data, Khanna and colleagues examined the association between baseline PROMIS-29 fatigue and follow-up social participation, physical function and overall quality of life measured by EQ-5D-5L index scores. They found baseline fatigue only predicted social participation scores 16 weeks later, and there was negative associations [34]. Dunn et al. tested the reliability and validity of the State-Trait Hopelessness Scale in patients with heart disease and moderate to severe hopelessness using PROMIS-29 and EQ-5D-5L. The authors reported the EQ-5D-5L level sum scores and found that State Hopelessness Subscale was positively associated with the summary score, and some of the PROMIS-29 domains (depression, fatigue and social roles) [37]. Kempton et al. developed a new instrument (HRDq) for measuring disease-related distress among haemophilia patients, and EQ-5D-5L and PROMIS-29 were used to assess the validity of HRQq. The author reported correlation between EQ-5D-5L index score and HRDq total scores and between PROMIS-29 domain and HRDq. There were moderate correlations in the expected directions [40].

Mapping PROMIS-29 domains to EQ-5D utility score

Revicki et al. [28] selected five PROMIS domains (physical function, fatigue, pain impact, anxiety, and depression) and mapped the T-scores for the five domains to EQ-5D-3L utility based on a sample of 1658 respondents in the US, using a linear ordinary least square regression model. The model explained 57% of the variance in the EQ-5D utility score. Regression coefficients for physical function (b = 0.0077), fatigue (b = −0.0021), pain impact (b = −0.0040), anxiety (b = −0.0023) and depression (b = −0.0022) were all statistically significant at 1% level in the model.Footnote 4 These results indicate that there are conceptual overlaps in the health domains covered in the two instruments. Fatigue is important in EQ-5D utility scores but is not explicitly measured in the descriptive system. It was unclear that what the remaining 43% of the variance in EQ-5D utility scores was explained by. The authors examined the performance of the model by testing the agreement between predicted and actual EQ-5D utility scores and assessing known-group validity. The results suggested good fit of the model but there was some overestimation for EQ-5D scores under 0.40.

More recently, Klapproth et al. collected EQ-5D-5L and PROMIS-29 profiles from general populations in the UK, France and Germany to find the best models to predict the EQ-5D-5L crosswalk from all seven PROMIS-29 domains [38]. The authors applied stepwise regression with backward selection, starting with full models that incorporated linear, quadratic and cubic effects for all seven PROMIS-29 domains and age and sex, for each included country separately. In addition, the authors compared Revicki’s model and their optimal models by applying all models to the European data. Besides the country-specific coefficients, the main differences between Revicki’s model and Klapproth’s model is the latter incorporated two additional PROMIS-29 domains (social roles and sleep disturbance) and higher-order effects for health state utilities prediction. It is reported that the application of Revicki’s model to the European data would systematically underestimate the EQ-5D-5L crosswalk for the UK and France but not for Germany, while Revicki’s model performs well in upper ranges of health. The authors highlighted that due to the country-specific nature of health state utility, mapping algorithms for health state utility should not be generalised across countries.

Comparing PROMIS-29 utility score to EQ-5D utility score

Characteristics and correlation of the utility scores

Two studies from the US provided descriptive statistics on the range and distribution of the utility scores using data from the general population, both examined convergent validity [30, 31] and one also assessed known-group validity [31]. Although the two studies used different value sets for PROMIS-29 and different EQ-5D measures,Footnote 5 both showed that PROMIS-29 utility score had a wider range and was more granular than the EQ-5D utility score. There were discontinuities in distribution and ceiling effects for EQ-5D utility score, while PROMIS-29 utility score did not have a ceiling effect. It indicates that there is a higher chance to report mild health problems rather than full health in PROMIS-29 compared to EQ-5D utility scores. Although the PROMIS-29 utility score was much lower than that of EQ-5D, PROMIS-29 and its value sets might be more sensitive to capture changes in utilities, particularly for mild health problem. The correlation coefficients between the actual utility scores ranged from 0.70 to 0.77 [30, 31]. Hanmer et al. (2018) examined known-group validity showed that both EQ-5D-5L value sets and PROPr can distinguish between socio-demographic groups and the presence of self-reported chronic conditions [31]. However, EQ-5D-5L and PROPr provided different utility estimates across different groups.

Scale transformation

Hartman and Craig [30] developed a utility scale transformation of PROMIS-29 utility to EQ-5D-3L utility, but we have not found any study validating this model. Using a sample of 2623 adults from the US, they transformed PROMIS-29 utility loss (i.e. 1 minus the utility value) to EQ-5D utility loss. The best fit model suggested that EQ-5D-3L utility loss equals 0.1784 × (PROMIS-29 loss)0.7286.Footnote 6 The authors also assessed external validity between EQ-5D-3L and rescaled PROMIS-29 utility by comparing to general health responses (EQ VAS scores and 5-level self-rated health), and they found that with the worsening of self-reported health, the mean utility loss of both instruments increased more rapidly (which means utility score decreased more rapidly) as might be expected.

Discussion

This review found only nine studies that investigated the relationship between PROMIS-29 and EQ-5D, five of which have been published very recently (in 2020 or 2021). There are three main findings of note from the literature. First, PROMIS-29 and EQ-5D instruments have clearly different descriptive systems. There are overlaps but also divergence in the health domains measured.Footnote 7 PROMIS-29 captures a broader range of health states and does not have ceiling effects compared to EQ-5D-3L, indicating that PROMIS-29 may have more descriptive power. However, there is little evidence on the relative merit of PROMIS-29 over the EQ-5D-5L, given the reduced ceiling effects in the latter compared with its EQ-5D-3L predecessor. Second, mapping from PROMIS-29 domains to predict EQ-5D-3L and EQ-5D-5L utilities is feasible. However, the evidence on the performance of the prediction model is mixed, and this has implications for using the predicted utility score. Third, PROMIS-29 utilities and EQ-5D (3L and 5L) utilities are highly correlated, but PROMIS-29 utilities are much lower than EQ-5D utilities. The implication of this for their use in cost-effectiveness analysis is not obvious, as the focus is on the incremental changes in QALYs, so some of the differences in utilities between the instruments may cancel out. It is possible to rescale PROMIS-29 utilities to EQ-5D-3L utilities, but we have not found external studies to assess the model performance.

The body of evidence is surprisingly small, given the widespread use of EQ-5D in the health utility field and beyond, and the growing importance of PROMIS-29 as a concise multidimensional preference-accompanied measure [14, 41]. The number of studies is also small compared to the number of studies exploring the relationship among alternative widely used preference-accompanied PROMs, such as SF-6D, Quality of Well-Being Scale, the Health Utilities Index (HUI) [e.g. 42, 4346]. For the comparison between PROMIS-29 and EQ-5D-3L, we did not find any study examining the correlations between the domain responses of the two measures and no study conducted response mapping. There is an increasing number of studies including both PROMIS-29 and EQ-5D-5L; however, there is currently a dearth of literature regarding the value set characteristics and scale transformation. This is particularly important, given the growing evidence on the superiority of the EQ-5D-5L over EQ-5D-3L as a descriptive system, reflected in increasing numbers of clinical trials and other studies opting to use the EQ-5D-5L rather than the three-level version [47,48,49,50,51,52].

There are several methodological considerations arising from the existing literature. First, evidence on the performance of the prediction model of EQ-5D-3L utilities from PROMIS-29 domain scores is mixed. Recent studies pointed out that the US model developed by Revicki et al. [28] was regression-based, which was not suitable for mapping due to regression to the mean (which is a common problem across all mapping studies), and the predicted EQ-5D utilities were inaccurate particularly when applying the algorithm to populations from another country [53, 54]. There is no external validation for the recent developed European prediction model of EQ-5D-5L utilities from PROMIS-29 profile. The validation and the choice of model may have an impact on the calculation of QALYs. Second, methodological considerations regarding the choice of value sets for instruments have not been fully addressed in scale transformation. Scale transformation of preference-accompanied PROMs can help to bridge the gap between instruments in economic evaluations. However, the discordance in scale results not only from the difference in instruments used, but also from the value sets chosen for the instruments. Hanmer et al. used the US EQ-5D-3L crosswalk value set for their EQ-5D-5L measure [31] but we know that there are important differences between the values produced by crosswalking between 5 and 3L and EQ-5D-5L value sets [35, 55]. The two available preference weights for PROMIS-29 also differ in several ways, including the domains included and elicitation method. It is important to understand the characteristics of the value set before comparing utility scores using data from the general population or patient group. Particularly, with the availability of US preference weights for the EQ-5D-5L [35], an updated analysis of scale transformation is possible and needed. Third, the recruited participants in the existing literature may not be representative, either of the general population or of specific disease groups. The respondents in one of the US studies using general population samples were more likely to be older and white [31]. Seven of the nine studies were from the US, and these results may not be generalisable to other settings where PROMIS-29 (and other PROMIS measures) have been translated and used [14, 38, 56].

Due to the lack of evidence on the relationship between the two instruments, the implications of instrument choice for the evidence to be considered in the decision-making process is not clear. PROMIS-29 does not have ceiling and floor effect when describing health states and is more sensitive to differences in health states compared to EQ-5D. However, it takes more time to complete the former instrument. That PROMIS-29 and EQ-5D cover some similar but also some different domains which may influence the choice of which to use, e.g. in clinical settings. The fact that PROMIS-29 generates lower utility values compared to EQ-5D might make PROMIS-29 a less attractive instrument to use (for more detail on the differences in utilities see Pan et al. (2020) [11]). However, as noted earlier, the effect of this on estimates of QALY gains is ambiguous. Perhaps the more important limiting factor in use of PROMIS-29 for QALY estimation to inform health technology appraisal is the fact that utilities for it are thus far only available in the US. Researchers collecting PROMIS-29 can estimate utility in multiple ways: generated from PROMIS-29, rescaled EQ-5D utility from PROMIS-29 utility, predicted EQ-5D utility or HUI utility [57] from PROMIS-29 domains with different models; similarly, EQ-5D-5L utilities can be obtained either by crosswalking or by use of EQ-5D-5L value sets. This complicates the assessment of how instruments and utilities compare; it also risks gaming and the choices of utilities being made on the basis of what will yield the most favourable cost-effectiveness results.

Given the gaps in research evidence and limitations of the existing literature, this review has identified several future research areas. First, a direct comparison of PROMIS-29 and EQ-5D (3L and 5L) as the descriptive system should be the starting point to understand the associations and differences between the two measures. The descriptive systems provide information on aspects of health measured by the instruments and the feasibility of mapping approach, which depends on the assumption that there is conceptual overlaps between the descriptive systems [58]. Second, more research work is needed around exploring relationships between PROMIS-29 and EQ-5D-5L, given its use is superseding EQ-5D-3L. Such research includes a comparison of the two instruments as descriptive systems, comparing the characteristics of the value set, validation of the EQ-5D-5L mapping algorithm from PROMIS-29 domains particularly in patient groups, and scale transforming of utility scores using data from the general population and patient populations. In ongoing work, we compare the characteristics of all the theoretical values from both value sets [11], and assess the impact of differences in value sets using patient data [59]. Third, evidence from countries outside the US is also needed. International network and collaboration such as the PROMIS Health Organization International Committee could help coordinate and promote the development, validation and use of PROMIS instruments across countries. Fourth, with the availability of preference weights for PROMIS-29, additional analysis on comparing the utility scores generated from two instruments can be conducted among studies that have already collected data from general populations and patient groups. Lastly, further work is needed to collect and report patient outcomes from both instruments in a variety of clinical contexts in order to improve understanding of the differences between them and the impact of these differences in longitudinal settings.

Conclusion

This review has highlighted the dearth in studies examining the relationship between PROMIS-29 and EQ-5D. Given the widespread use of EQ-5D in the health utility field and beyond, and the growing use of PROMIS-29, this is an important gap in evidence. Future work is needed to better understand the relationship between the two descriptive systems and their utility scores, so that the implications for the use of evidence from these instruments in reimbursement decisions is clear.