Solving the Tower of Babel Problem for Patient-Reported Outcome Measures

Bjorner, Jakob Bue

doi:10.1007/s11336-021-09778-x

Solving the Tower of Babel Problem for Patient-Reported Outcome Measures

Comments on: Linking Scores with Patient-Reported Health Outcome Instruments: A Validation Study and Comparison of Three Linking Methods

Application Reviews and Case Studies
Published: 18 June 2021

Volume 86, pages 747–753, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Psychometrika Aims and scope Submit manuscript

Solving the Tower of Babel Problem for Patient-Reported Outcome Measures

Download PDF

Jakob Bue Bjorner ORCID: orcid.org/0000-0001-7033-8224^1,2,3

568 Accesses
5 Citations
Explore all metrics

Abstract

The PROsetta Stone Project, summarized in this issue by Schalet et al. (Psychometrika 86, 2021), is a major step forward in enabling comparability between different patient-reported outcomes measures. Schalet et al. clearly describe the psychometric methods used in the PROsetta Stone project and other projects from the Patient-Reported Outcomes Measurement Information System (PROMIS): linking based on unidimensional item response theory (IRT), equipercentile linking, and calibrated projection based on multidimensional IRT. Analyses in a validation data set and simulation studies provide strong support that the linking methods are robust when basic assumptions are fulfilled. The links already established will be of great value to the field, and the methodology described by Schalet et al. will hopefully inspire the next series of linking studies. Among potential improvements that should be considered by new studies are: (1) a thorough evaluation of the content of the measures to be linked to better guide the evaluation of measurement assumptions, (2) improvements in the design of linking studies such as selection of the optimal sample to provide data in the score ranges where linking precision is most critical and using counterbalanced designs to control for order effects. Finally, it may be useful to consider how the linking algorithms are used in subsequent data analyses. Analytic strategies based on plausible values or latent regression IRT models may be preferable to the simple transformation of scores from one patient at the time.

Differential Item Functioning Analyses of the Patient-Reported Outcomes Measurement Information System (PROMIS®) Measures: Methods, Challenges, Advances, and Future Directions

Article 12 July 2021

Linking Scores with Patient-Reported Health Outcome Instruments:A VALIDATION STUDY AND COMPARISON OF THREE LINKING METHODS

Article 26 June 2021

Some recommendations for developing multidimensional computerized adaptive tests for patient-reported outcomes

Article Open access 23 February 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The “Tower of Babel” problem for patient-reported outcome measures (PROMs) was explicitly stated more the 30 years ago (Mor and Guadagnoli 1988; van Knippenberg and Haes 1988). In 1988, according to the authors, PROMs had proliferated without a uniform approach, without a clear conceptual framework, and with only limited agreement on the definition of core concepts. All these factors hindered the comparison of scores. Since then, the proliferation of PROMs has only increased. Luckily, the conditions for overcoming the Tower of Babel problem have improved as well. While the field has not agreed on one overall conceptual framework, there is practical agreement on a number of core concepts, the field has established standardized methods for achieving content validity, and there is increasing alignment on ways to phrase questions and response choices. Last, but not least, methods to link scores from different PROMs have been imported and adapted from educational testing. The excellent paper by Schalet et al. (2021) describes the steps taken when using the current standard: linking based on unidimensional item response theory (IRT) models. Also, useful comparisons are made with other approaches: equipercentile linking and calibrated projection. The authors’ careful empirical analysis convincingly demonstrates that when basic assumptions are fulfilled, the three approaches concur and these results are reasonably robust in a new data set. Further, Schalet et al. use simulation studies to identify situations where equipercentile linking or calibrated projection may be better choices for linking. These excellent analyses and clear results leave little to add regarding the psychometric work. Instead, I will comment on some issues around this solid psychometric foundation: that a content analysis may be a helpful supplement to correlations and factor analyses in evaluating unidimensionality (Sect. 2), that an optimal design of the linking study may help make the results more robust (Sect. 3), and that if the ultimate aim of the linking study is to enable better group comparisons, there may be alternative approaches to linking the score from each individual participant (Sect. 4).

2 Content Analysis

In their analysis of sufficient unidimensionality of the measures to be linked, Schalet et al. rely on score correlations supplemented by confirmatory factor analyses. They cite Dorans (2004) for using 0.866 as a lower bound for an acceptable correlation, but note that correlations in the range of 0.70 to 0.85 may be acceptable if the purpose is to enable group comparisons. I would argue that content analysis may add insight into whether two measures can be linked and in what situations linking may be problematic. Table 1 presents results of a content analysis of the two measures linked by Schalet et al.: the PHQ-9 (Kroenke et al. 2001) and the PROMIS depression item bank (Pilkonis et al. 2011). I used the DSM-IV (2000) depression criteria as the organizing framework, since the PHQ-9 was built to reflect these criteria (Kroenke et al. 2001). However, while the developers of the PROMIS item bank were well aware of the DSM-IV depression criteria, they also relied on other conceptual frameworks and patient interviews in the item bank development. Also, some subdomains, like fatigue and sleep, are covered by other item banks and therefore only covered sparsely in the depression item bank. Finally, the PROMIS depression item bank avoids items concerning somatic symptoms such as weight gain or loss, fatigue, and psychomotor speed, since these symptoms may cause problems for evaluating psychiatric morbidity in patients with somatic disease (see, e.g., Holzapfel et al. 2008). Thus, despite considerable content overlap between PROMIS depression item bank and the PHQ-9, there are also distinct differences, suggesting that the link between the two tools may be different in patients with somatic disease. Confounding by somatic symptoms may explain the discrepancy between cross-walked and actual PROMIS depression scores found in some studies of somatic patients (Katzan et al. 2017; Kim et al. 2017). While Schalet et al. should be applauded for testing the robustness of their linking across gender and age group, testing whether the linking is also valid for patients with somatic disease would be advisable. A content analysis may be useful in identifying such potential problems.

Table 1 Content validity of the PHQ-9 and PROMIS depression items according to DSM-IV depression criteria

Full size table

Table 2 Continued

Full size table

3 Design Considerations

Schalet et al. made excellent use of archival data from general population samples. However, it may be useful to consider the optimal design for a linking study. The PROMIS depression item bank and the PHQ-9 were developed for use in clinical research and diagnosis of depression. The two measures have optimal precision in the T-score range of 45 to 80—the range relevant for assessing clinical depression severity. However, scores between 70 and 80 are rare in the general population. In this range, the linking methods show discrepant results. As Schalet et al. note for the equipercentile scoring method, the discrepancy in this score range is likely to be partly caused by sparse data. A patient sample including participants with high depression scores would provide a more robust link in the severe score range.

Rather than the random-groups design often used in educational linking studies (Kolen and Brennan 2014), the PROsetta Stone project chose a single-group design. This design has advantages in the ability to check the unidimensionality assumption and check for agreement between linked and observed scores. Also, the single-group design has greater statistical power for a given sample size. The main potential problem of the single-group design is the possibility of order effects, e.g., due to respondent fatigue. Schalet et al. note that the order of questionnaire administration should be counterbalanced but suggest that this is less critical in patient-reported outcome (PRO) research since test-taking fatigue is unlikely to be a major factor. However, available evidence shows otherwise. In a PROMIS study of methods of administration, two parallel depression short forms were developed from the PROMIS depression item bank (Bjorner et al. 2014). The forms were administered counterbalanced allowing for an estimation of the order effect. Results showed highly significant order effects: scores for whatever form was administered last were 1.94 to 4.68 T-score points lower, indicating less depression. These results suggest that counterbalancing may also be important for linking studies in the PRO field.

4 Linking Procedures for Group Comparisons

Schalet et al. provide a very useful discussion of the differences between the purposes of linking in educational testing and in PRO research. One difference is that educational testing often involves decisions made on the basis of an individual equated score or cut-off value. In contrast, PROMs are mostly used for comparisons of groups. Given this emphasis on group-based analyses, it may not be wise to apply the linking procedure of Schalet et al. in the most simplistic way: for each person who has answered the PHQ-9, simply estimate a score on the PROMIS metric. To illustrate, I simulated a data set using the same item parameters and sum-score linking procedures (Choi et al. 2014) that were evaluated by Schalet et al. The results from these simulations are presented in Fig. 1. The left column shows the distribution of score estimates by applying expected a posteriori (EAP) estimation based on the PROMIS depression item bank, EAP estimation based on the PHQ-9 items, and applying the sum score transformation algorithm. While all procedures were effective in capturing the correct mean of 50, the standard deviation is underestimated when using the PHQ-9 items and the score distribution is far from normal, due to floor effects that still exist after linking. The right column in Fig. 1 shows “score distributions” using a plausible values approach (Mislevy 1991)—well known in psychometric research. The plots illustrate that this approach was very effective in estimating the correct mean and standard deviation and achieving a score distribution more similar to the generating latent distribution. Thus, the plausible values approach or latent regression IRT models may be useful additions to the linking procedures discussed by Schalet et al. (also, see Fischer and Rose 2019).

5 Conclusion

While the PROsetta Stone project is not the first project to link PROMs (see, e.g., Orlando et al. 2000; Bjorner et al. 2003), it is the largest and most ambitious of such efforts within the field of PROMs. While some details might be improved, the links already established will be of great value to the field. Similarly, the excellent summary by Schalet et al. will be a great help to the next generation of researchers seeking to overcome the Tower of Babel problem.

References

Bjorner, J. B., Kosinski, M., & Ware, J. E, Jr. (2003). Using item response theory to calibrate the Headache Impact Test (HIT) to the metric of traditional headache scales. Quality of Life Research, 12(8), 981–1002.
Article Google Scholar
Bjorner, J. B., Rose, M., Gandek, B., Stone, A. A., Junghaenel, D. U., & Ware, J. E, Jr. (2014). Method of administration of PROMIS scales did not significantly impact score level, reliability, or validity. Journal of Clinical Epidemiology, 67(1), 108–113. https://doi.org/10.1016/j.jclinepi.2013.07.016.
Article PubMed PubMed Central Google Scholar
Choi, S. W., Schalet, B., Cook, K. F., & Cella, D. (2014). Establishing a common metric for depressive symptoms: Linking the BDI-II, CES-D, and PHQ-9 to PROMIS depression. Psychological Assessment, 26(2), 513–527. https://doi.org/10.1037/a0035768.
Article PubMed PubMed Central Google Scholar
DSM-IV-TR., A.P.A. (2000). Diagnostic and statistical manual of mental disorders, fourth edition, text revision: DSM-IV-TR (4th ed., text rev). Washington, DC: American Psychiatric Association.
Dorans, N. J. (2004). Equating, concordance, and expectation. Applied Psychological Measurement, 28(4), 227–246.
Article Google Scholar
Fischer, H. F., & Rose, M. (2019). Scoring depression on a common metric: A comparison of EAP estimation, plausible value imputation, and full Bayesian IRT modeling. Multivariate Behavioral Research, 54(1), 85–99.
Article Google Scholar
Holzapfel, N., Müller-Tasch, T., Wild, B., Jünger, J., Zugck, C., Remppis, A., et al. (2008). Depression profile in patients with and without chronic heart failure. Journal of Affective Disorders, 105(1–3), 53–62.
Article Google Scholar
Katzan, I. L., Fan, Y., Griffith, S. D., Crane, P. K., Thompson, N. R., & Cella, D. (2017). Scale linking to enable patient-reported outcome performance measures assessed with different patient-reported outcome measures. Value in Health, 20(8), 1143–1149.
Article Google Scholar
Kim, J., Chung, H., Askew, R. L., Park, R., Jones, S. M. W., Cook, K. F., et al. (2017). Translating CESD-20 and PHQ-9 scores to PROMIS depression. Assessment, 24(3), 300–307.
Article Google Scholar
Kolen, M. L., & Brennan, R. L. (2014). Test equating, scaling, and linking: Methods and practices (3rd ed.). New York, NY: Springer.
Book Google Scholar
Kroenke, K., Spitzer, R. L., & Williams, J. B. W. (2001). The PHQ-9: Validity of a brief depression severity measure. Journal of General Internal Medicine, 16(9), 606–613.
Article Google Scholar
Mislevy, R. J. (1991). Randomization-based inference about latent variables from complex samples. Psychometrika, 56, 177–196.
Article Google Scholar
Mor, V., & Guadagnoli, E. (1988). Quality of life measurement: A psychometric tower of Babel. Journal of Clinical Epidemiology, 41(11), 1055–1058.
Article Google Scholar
Orlando, M., Sherbourne, C. D., & Thissen, D. (2000). Summed-score linking using item response theory: Application to depression measurement. Psychological Assessment, 12(3), 354–359.
Article Google Scholar
Pilkonis, P. A., Choi, S. W., Reise, S. P., Stover, A. M., Riley, W. T., Cella, D., & PROMIS Cooperative Group. (2011). Item banks for measuring emotional distress from the Patient-Reported Outcomes Measurement Information System (PROMIS®): Depression, anxiety, and anger. Assessment, 18(3), 263–283.
Schalet, B. D., Lim, S., Cella, D., & Choi, S. W. (2021). Linking scores with patient-reported health outcome instruments: A validation study and comparison of three linking methods. Psychometrika, 86. https://doi.org/10.1007/s11336-021-09776-z.
van Knippenberg, F. C., & de Haes, J. C. (1988). Measuring the quality of life of cancer patients: Psychometric properties of instruments. [Review]. J.Clin.Epidemiol., 41(11), 1043–1053.

Download references

Author information

Authors and Affiliations

QualityMetric Incorporated, LLC, 1301 Atwood Avenue, Suite 311N, Johnston, RI, 02919, USA
Jakob Bue Bjorner
Department of Public Health, University of Copenhagen, Copenhagen, Denmark
Jakob Bue Bjorner
National Research Centre for the Working Environment, Copenhagen, Denmark
Jakob Bue Bjorner

Authors

Jakob Bue Bjorner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jakob Bue Bjorner.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bjorner, J.B. Solving the Tower of Babel Problem for Patient-Reported Outcome Measures. Psychometrika 86, 747–753 (2021). https://doi.org/10.1007/s11336-021-09778-x

Download citation

Received: 16 February 2021
Revised: 06 March 2021
Accepted: 19 May 2021
Published: 18 June 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s11336-021-09778-x

Solving the Tower of Babel Problem for Patient-Reported Outcome Measures

Abstract

Similar content being viewed by others

Differential Item Functioning Analyses of the Patient-Reported Outcomes Measurement Information System (PROMIS®) Measures: Methods, Challenges, Advances, and Future Directions

Linking Scores with Patient-Reported Health Outcome Instruments:A VALIDATION STUDY AND COMPARISON OF THREE LINKING METHODS

Some recommendations for developing multidimensional computerized adaptive tests for patient-reported outcomes

1 Introduction

2 Content Analysis

3 Design Considerations

4 Linking Procedures for Group Comparisons

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Solving the Tower of Babel Problem for Patient-Reported Outcome Measures

Abstract

Similar content being viewed by others

Differential Item Functioning Analyses of the Patient-Reported Outcomes Measurement Information System (PROMIS®) Measures: Methods, Challenges, Advances, and Future Directions

Linking Scores with Patient-Reported Health Outcome Instruments:A VALIDATION STUDY AND COMPARISON OF THREE LINKING METHODS

Some recommendations for developing multidimensional computerized adaptive tests for patient-reported outcomes

Explore related subjects

1 Introduction

2 Content Analysis

3 Design Considerations

4 Linking Procedures for Group Comparisons

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation