Introduction

The Patient-Reported Outcomes Measurement Information System (PROMIS) is a set of instruments measuring patient-reported health [1, 2]. PROMIS instruments consist of item banks, a set of items (questions) that measure one health domain. These banks can be applied as short forms (fixed length subsets of items out of the item bank) or highly efficient computerized adaptive tests (CAT). A CAT is a computer-administered measure in which successive items are selected by a computer algorithm informed by the responses to previous items. Persons generally have to complete only a small number (3–7) of highly informative and relevant items to obtain a reliable (r = 0.90) score. Overall, PROMIS instruments are less burdensome, have less measurement error, have better content validity than traditional Patient-Reported Outcome Measures, and are easy to interpret [3,4,5,6].

One of the goals when developing PROMIS was to create measures that would be universally applicable. A universal measure should be applicable within multiple (patient) populations and should also be valid for comparisons across (patient) populations. PROMIS item banks have been developed using item response theory (IRT). Validity of comparisons between populations, in the context of IRT, is plausible if the item parameters are equivalent between the comparison populations at issue. Equivalence of item parameters implies the absence of the so-called differential item functioning (DIF) [7,8,9].

Three of the most commonly used PROMIS item banks are the PROMIS physical function (PROMIS-PF), pain interference (DF-PROMIS-PI), and pain behavior (PROMIS-PB) banks. Those banks showed good psychometric properties for cross-sectional use within different (patient) populations [10,11,12,13,14,15,16,17]. Furthermore, some studies on the PROMIS-PF, PROMIS-PI, and PROMIS-PB banks evaluated DIF with respect to language (Dutch-Flemish vs. English, Spanish vs. English, German vs. English) and demographic variables, such as age and gender. In these studies, either no DIF was found or the observed DIF had a negligible impact on the T-scores [10,11,12,13,14,15,16,17,18,19,20,21,22,23]. To our knowledge, no studies so far have examined DIF across patient populations for the PROMIS-PF, PROMIS-PI, and PROMIS-PB banks.

In patients with musculoskeletal disorders, physical functioning and pain are core outcomes. Health care providers, including rehabilitation physicians, rheumatologists, orthopedic surgeons, and physiotherapists, provide care to patients with different musculoskeletal disorders [e.g., patients with chronic pain, rheumatoid arthritis (RA), or osteoarthritis (OA)]. It would be beneficial to all these providers if one measure can be used in all these patient populations, in patients who have multiple of these disorders, and also to compare these populations with each other and with healthy persons. Therefore, the aim of the study is to investigate the validity of comparisons across patients with different musculoskeletal disorders and persons from the general population by evaluating DIF for the Dutch-Flemish PROMIS-PF, PROMIS-PI, and PROMIS-PB banks.

Methods

Samples

We used five datasets to study DIF across patient populations for the PROMIS-PF (V1.2), PROMIS-PI (V1.1), and PROMIS-PB (V1.1) banks. All datasets contained cross-sectional data including multiple item banks and most datasets combined response data of more than one sample.

The first dataset consisted of Dutch patients with chronic musculoskeletal pain (PAIN dataset). We used, firstly, data of a sample of patients with chronic pain from the Amsterdam Pain (AMS-PAIN) cohort. These data were collected at the rehabilitation outpatient department of Reade, a care center for rehabilitation and rheumatology, in the Netherlands (PROMIS-PF, n = 1247 [16]; PROMIS-PI, n = 1085 [14]; and, PROMIS-PB, n = 1042 [13]). We used, secondly, data of Dutch patients with chronic pain registered at practices of 31 participating physicians specialized in musculoskeletal medicine in the Netherlands (PROMIS-PI, n = 1677 [21]; PROMIS-PB, n = 1602 [20]). So, with respect to the PROMIS-PF bank, the dataset consisted of patients from the AMS-PAIN cohort only (AMS-PAIN dataset), whereas, with respect to the PROMIS-PI and PROMIS-PB banks, the dataset consisted of the two combined chronic pain samples (PAIN dataset). A preliminary analysis indicated no DIF between these two chronic pain samples for the PROMIS-PI and PROMIS-PB banks, supporting our decision to combine these two samples.

The second dataset comprised Dutch and Flemish patients with RA (RA dataset). The Dutch sample consisted of patients with RA from the Amsterdam Rheumatoid Arthritis cohort and the data were collected at the rheumatology outpatient department of Reade (PROMIS-PI, n = 1370; PROMIS-PB, n = 1005) [19]. The Flemish sample consisted of patients with RA from an arthritis cohort from University Hospitals Leuven, Flanders, the Dutch speaking part of Belgium (PROMIS-PI, n = 682; PROMIS-PB, n = 549) [19]. In a previous study, we found no DIF for language (Dutch vs. Flemish) for these item banks [19], which legitimizes the merging of the data from the two samples.

The third dataset consisted of Dutch patients with hip or knee OA (OA dataset). We used, firstly, response data of a sample of patients with hip or knee OA from the Amsterdam Osteoarthritis (AMS-OA) cohort. These data were collected at the rehabilitation outpatient department of Reade (PROMIS-PF, n = 425; PROMIS-PI, n = 425 [unpublished]). We used, secondly, response data of patients with early hip or knee OA from the Cohort Hip and Cohort Knee (CHECK) cohort [24]. These data were collected during a 10-year follow-up measurement at Erasmus Medical Center Rotterdam, Kennemer Gasthuis Haarlem, Leiden University Medical Center, Maastricht University Medical Center, Martini Hospital Groningen/Allied Health Care Center for Rheumatology and Rehabilitation Groningen, Medical Spectrum Twente Enschede/Ziekenhuisgroep Twente, Reade, Center for Rehabilitation and Rheumatology, St Maartenskliniek Nijmegen, University Medical Center Utrecht, and Wilhelmina Hospital Assen (PROMIS-PF, n = 822 [25]). So, with respect to the PROMIS-PF bank, the dataset consisted of the two combined datasets (OA dataset). A preliminary analysis indicated one item with DIF between these two OA samples for the PROMIS-PF bank, but the impact of this DIF was negligible, supporting our decision to combine these two samples. With respect to the PROMIS-PI bank, the dataset consisted of patients from the AMS-OA cohort only (AMS-OA dataset).

The fourth dataset consisted of Dutch patients who received any kind of physiotherapy (PT) in primary care in the year prior to completing the questionnaire (PT dataset, PROMIS-PF, n = 805 [17]). Forty-nine percent of the patients consulted PT because of disorders of muscles, bones or joints, and twelve percent as part of recovery after a surgery [17].

The fifth dataset represented a Dutch general population sample (GEN) (GEN dataset). Participants were recruited from an existing internet panel of the general Dutch panel, polled by a certified company (Desan Research Solutions) (PROMIS-PF, n = 1310; PROMIS-PI, n = 1052; and PROMIS-PB, n = 745 [unpublished]). The sample was representative for the Dutch general population (maximum of 2.5% deviation) with respect to distribution of age, gender, education, region, and ethnicity, according to data from Statistics Netherlands in 2016.

Measures

The participants completed a paper-and-pencil or web-based survey which included, among others, demographic and clinical characteristics, and the Dutch-Flemish versions of the full PROMIS-PF, PROMIS-PI, or PROMIS-PB [26] banks.

The PROMIS-PF bank assesses a wide range of activities, from self-care (activities of daily living) to more complex activities that require a combination of skills (i.e., strenuous activities such as playing tennis, bicycling or jogging). The PROMIS-PF bank (V1.2) consists of 121 items, including items about functioning of the axial regions (neck and back), the upper and lower extremities, and ability to carry out instrumental activities of daily living (i.e., housework, shopping) [10]. There is no time frame set for the items, but current status is inferred. There are three different 5-point Likert response scales. For the PROMIS-PF bank, higher T-scores indicate higher (i.e., better) levels of physical function. The PROMIS-PF bank showed good psychometric properties for cross-sectional use within different populations [10, 15,16,17].

The PROMIS-PI bank assesses self-reported consequences of pain on relevant aspects of one’s life. This includes the extent to which pain hinders engagement with social, cognitive, emotional, physical, and recreational activities [27]. The PROMIS-PI bank (V1.1) consists of 40 items. The time frame is the past 7 days, and the bank uses three different 5-point Likert response scales [11, 27]. For the PROMIS-PI bank, higher T-scores indicate higher (i.e., worse) levels of pain interference. The PROMIS-PI bank showed good psychometric properties for cross-sectional use within different populations [11, 14, 19, 21].

The PROMIS-PB bank measures self-reported external manifestations of pain: behaviors that typically indicate to others that an individual is experiencing pain [28]. The PROMIS-PB bank (V1.1) contains 39 items. Patients rate how frequently they engaged in the pain behaviors in the past 7 days on a 6-point Likert response scale [12]. We excluded patients who endorsed the ‘had no pain’ response category on any of the items, resulting in IRT analyses with five response options [13, 29]. This is in line with later analyses of the PROMIS pain behavior item bank (resulting in version 2.0) where the researchers decided to develop version 2.0 only for patients with pain, and the response option “had no pain” is no longer used [29]. For the PROMIS-PB bank, higher T-scores indicate higher levels (i.e., worse) of pain behavior. The PROMIS-PB bank showed also good psychometric properties for cross-sectional use within different populations [12, 13, 19, 20].

PROMIS scores are expressed as T-scores, and a mean of 50 represents the average score of the general population with a standard deviation of 10.

Statistical analysis

In order to study the validity of comparisons across (patient) populations, we evaluated DIF across (patient) populations. For the PROMIS-PF bank, we made six comparisons: AMS-PAIN vs. OA, AMS-PAIN vs. PT, AMS-PAIN vs. GEN, OA vs. PT, OA vs. GEN, and PT vs. GEN (Table 1). With respect to the PROMIS-PI bank, we also made six comparisons: PAIN vs. RA, PAIN vs. AMS-OA, PAIN vs. GEN, RA vs. AMS-OA, RA vs. GEN, and AMS-OA vs. GEN (Table 1). For the PROMIS-PB bank, we made three comparisons: PAIN vs. RA, PAIN vs. GEN, and RA vs. GEN (Table 1).

Table 1 Demographic and clinical characteristics of the patient populations with different musculoskeletal disorders and the general population sample per PROMIS item bank

DIF analyses evaluate if persons from different populations (e.g., OA vs. GEN) with similar levels of the domain (e.g., physical function) respond similarly to the items [7,8,9]. The absence of DIF implies valid comparisons of T-scores between the populations at issue. There are two kinds of DIF: uniform and non-uniform [7,8,9]. Uniform DIF exists if the magnitude of DIF is consistent across the entire range of the domain. Non-uniform DIF exists if the magnitude of DIF varies across the domain.

DIF was evaluated with the R package Lordif (version 0.3-3), which uses an ordinal logistic regression framework [7, 30,31,32]. Three models were formed, of which Fig. 1 shows a simplified version of the model originally published by Choi et al. [31]. These models will be explained using the physical function domain as example. Model 1, the base model, assumes that the persons’ level of physical function (theta or, in the context of PROMIS, the T-score) only predicts the persons’ item response. Model 2 posits that, in addition to the level of physical function, the persons’ item response is predicted by population membership (e.g., OA vs. GEN). Uniform DIF is identified if model 2 predicts the item response better than model 1. Model 3 includes an interaction term between the level of physical function and population membership and posits that the relation between the level of physical function and the persons’ item response is different across the populations being compared. Non-uniform DIF is present if model 3 predicts the item response better than model 2.

Fig. 1
figure 1

Models used in the ordinal logistic regression

There are several criteria for identifying DIF and to date PROMIS researchers mostly have used the criterion of R2-change of ≥ 0.02 [13, 15, 18, 23, 30, 33, 34]. We used in this study McFadden’s pseudo R2-change between two models of 0.02 as the critical value to flag for possible DIF [35].

If items were flagged for DIF, the impact of DIF on the item score and the T-score were examined by plotting item characteristics plots and test characteristic curves (TCCs), respectively. The item characteristic plots include four plots:

  1. 1.

    The item characteristic curves (ICCs) or item true score functions per population. This plot illustrates which population has higher item scores across levels of theta.

  2. 2.

    The absolute difference between the ICCs or differences in item true score functions. This plot shows the difference in item scores between the populations across levels of theta.

  3. 3.

    The item response functions, including the item slope and threshold parameters, per population. This plot visualizes which population has higher probabilities of endorsing the response categories at issue across levels of theta. The thresholds indicate the level of theta necessary to respond above this threshold with 0.50 probability.

  4. 4.

    The impact weighted by density. This plot shows the absolute difference in item scores weighted by the theta distribution of the samples [30,31,32].

The TCCs show per item bank and per population comparison of the test score (raw summary score) for all items (ignoring DIF) in the left plot, and the test scores for only the items having DIF in the right plot [30,31,32]. The area between the two curves within each plot provides an indication of the impact of DIF on the test score.

Results

Samples descriptives

Table 1 summarizes the demographic and clinical characteristics and PROMIS-PF, PROMIS-PI, and PROMIS-PB T-scores per dataset. The average age in the different samples, the proportion male vs. female, and the distribution of the duration of the conditions, match with the demographic and clinical characteristics in comparable populations [13, 14, 16, 17, 19,20,21]. Most clinical samples showed reduced physical function levels and elevated pain interference and pain behavior levels compared to the general population.

Differential item functioning

Table 2 summarizes the results. For the PROMIS-PF bank, 25 out of 121 items were flagged for DIF, of which 10 items were flagged for DIF in multiple comparisons and of which 3 items are present in the PROMIS-PF 20a short form. For the PROMIS-PI bank, only 2 out of 40 items were flagged for DIF. Both items are not present in any PROMIS-PI short form. For the PROMIS-PB bank, only 3 out of 39 items were flagged for DIF, of which all 3 items are present in the PROMIS 7a short form. All DIF items showed uniform DIF. Appendices 1–3 show the item characteristics plots and TCCs for the DIF items found in this study per bank.

Table 2 Results of the DIF analysis of the PROMIS-PF, PROMIS-PI, and PROMIS-PB banks

The interpretation of the DIF is illustrated for the comparison of the OA population with the GEN population for item PFC41 (item 88) of the PROMIS-PF bank: “Are you able to sit down in and stand up from a low, soft couch?” The McFadden’s pseudo R2-change value for the difference between models 2 and 3 was below the criterion of 0.02 (R223 = .0004), indicating no non-uniform DIF, but the McFadden’s pseudo R2-change value for the difference between models 1 and 2 was above the criterion of 0.02 (R212 = .0264), indicating uniform DIF (Fig. 2, left upper plot). The threshold parameters for the OA population (− 2.07, − 1.14, − 0.62, 0.13) were slightly higher than for the GEN population (− 2.48, − 1.64, − 1.12, − 0.28), indicating that the OA population will endorse lower response categories at the same level of physical function (Fig. 2, left lower plot). For the interpretation of item PFC41 this means that at the same level of physical function, the OA population is less likely to be able to sit down in and stand up from a low, soft couch than the GEN population.

Fig. 2
figure 2

The item characteristics plot of item PFC41 (item 88)—“Are you able to sit down in and stand up from a low, soft couch?” from the PROMIS-PF bank for the comparison OA vs. GEN, includes four plots: (1) the ICCs or item true score functions per population (OA vs. GEN), illustrates which population has higher item scores given the levels of theta; (2) the absolute difference between the ICCs or differences in item true score functions, showing the difference in item scores given the levels of theta; (3) the item response functions, including the item slope and threshold parameters, per population, visualizing which population has higher probabilities of endorsing the response categories at issue given the levels of theta; and (4) the impact weighted by density, showing the absolute difference in item scores weighted by the theta distribution of the samples. In this example, all four plots show negligible impact of DIF

The overall impact of DIF on the item scores and T-scores for all item banks was negligible. For example, the item characteristics plots for item PFC41 showed a small difference between the item true score functions (Fig. 2, left upper plot), indicating that the difference in item score given the levels of theta was minimal. In addition, the TCC of the PROMIS-PF bank, comparing the OA and GEN populations, showed that the area between the curves was negligible in both the left and right plot, indicating minimal impact of DIF by population on test scores (see Fig. 3). Similar results were found for the other comparisons and banks of this study.

Fig. 3
figure 3

The test characteristic curves (TCCs) of the PROMIS-PF bank of the comparison OA vs. GEN, shows the test scores (raw summary score) for all 121 PROMIS-PF items (ignoring DIF) per population in the left plot, and the scores per population for only the 14 items having DIF in the right plot. The area between the two curves within one plot provides an indication of the impact of DIF on the test score, showing in this example negligible impact of DIF

Only three items of the PROMIS-PF bank, items PFA51 (OA vs. PT), PFC33r1, and PFC40 (OA vs. AMS-PAIN), showed relatively high (> 0.05) R2-change values (R212 = 0.220; 0.065; 0.052, respectively). The item characteristic plots of PFA51 and PFC40 showed relatively large differences in item true score functions and item response functions, and the item characteristic plots of PFC33r1 showed clustered item response functions (see Online Appendix 1). However, the corresponding TCCs indicate that the impact of DIF of these items was minimal on the test scores (see Online Appendix 1).

Discussion

Our aim was to investigate the validity of comparisons across populations of patients with different musculoskeletal disorders and persons from the general population when applying the PROMIS-PF, PROMIS-PI, and PROMIS-PB banks, by evaluating DIF across (patient) populations. We found some items with DIF, but the magnitude and impact of DIF on the T-scores were negligible, supporting the universal applicability of the item banks. The item banks can be used by health providers and clinical researchers to compare patients with different musculoskeletal disorders and healthy persons.

Although the impact of the DIF items on the T-scores was negligible in the current study, there was a possible explanation for some DIF items. For the PROMIS-PF bank, the two comparisons that showed the largest amount of DIF items were the comparisons between the patients with hip or knee OA and chronic pain, and between the patients with hip or knee OA and the persons from the general population. In both comparisons, 14 items with DIF were found. For 12 out of 14 items with DIF, patients with hip or knee OA were less likely to endorse the items than patients with chronic pain, given the same level of theta. For instance, at the same level of function, patients with hip or knee OA were slightly less likely to be able to run 100 yards, to run two, five, or ten miles, to get up from or kneel on the floor, to squat and get up, to take a tub bath, and to sit on and get up from a low couch or toilet. At the same level of function, patients with hip and knee OA were less likely to endorse that they were able to squat and get up, to get in and out of a car, to run ten miles, to sit down in and stand up from a couch or toilet, and to be out of bed most of the day, than persons from the general population. Moreover, patients with hip or knee OA were more likely to endorse to have more difficulty in doing daily physical activities and to have more limitations in walking around the house, taking a shower, going for a short walk, and going outside the home, than persons from the general population given the same level of theta. All these DIF results may be explained by the fact that the activities addressed in these items are specifically influenced by knee and hip problems.

With respect to the PROMIS-PB bank, patients with RA were more likely to endorse that they moved stiffly when were in pain, than persons from the general population given the same level of theta. This may be because stiffness is one of the typical clinical characteristics of RA.

For the Dutch-Flemish PROMIS-PF, PROMIS-PI, and PROMIS-PB banks, the validity of comparisons across populations has also been shown for other comparisons populations. Previous studies showed no DIF or DIF with negligible impact on the T-score, for sub-populations differing in age, gender, education level, administration mode (paper–pencil vs. web-based), disease activity or language. Results on DIF in those studies are summarized in Table 3. The results of the current study, addressing the Dutch-Flemish PROMIS banks, can most likely be generalized to the original American-English PROMIS banks, as previous studies of the Dutch-Flemish PROMIS-PF, PROMIS-PI, and PROMIS-PB banks showed the absence of DIF or DIF with negligible impact between the Dutch and English language [13, 14, 16]. The current results combined with the previous results of studies on DIF for other variables indicate that the item parameters seem to be quite stable across different (sub)populations. Only the PROMIS-PF bank maybe needs more caution. For instance, the DIF found for items PFA51, PFC33r1, and PFC40 of the PROMIS-PF bank might be of a slight concern.

Table 3 Number of items with DIF in previous studies of the PROMIS-PF, PROMIS-PI, and PROMIS-PB banks

Although it does not seem that there is a high impact of items with DIF on the item bank T-scores, the impact on short-form T-scores or CAT T-scores might be larger since only a small number of items are administered. It could be that from these small numbers of items in short forms and CATs, just those items with DIF are included, and that the DIF is cumulative. From the items with DIF found in the current study, three items (PFA51, PFA56, and PFC45r1) from the PROMIS-PF bank and 3 items (PAINBE24, PAINBE25, PAINBE45) from the PROMIS-PB bank are present in, respectively, the PROMIS-PF 20a-item short form and the PROMIS-PB 7a-item short form. The impact of DIF on short-form T-scores and CAT T-scores could be examined in a future study.

Study strengths are that we were able to use large and diverse datasets. However, for future research it might be important to include patients with disorders that differ considerable from patients with musculoskeletal disorders (e.g., patients with heart condition, cancer or stroke). A study limitation is that we used the logistic regression method to detect DIF and the McFadden’s pseudo R2 change of 2% as critical value only, while there are multiple methods for detection of DIF and multiple criteria available [33]. We chose our method and critical value because these are commonly applied in PROMIS studies [7, 13, 15, 18, 23, 30, 33]. For future studies, we recommend to study and compare other methods and cut-off values as well, for instance the Monte Carlo simulations approach which facilitates empirical identification of the critical R2-change value [31]. Future studies could also consider to use, as an alternative approach, multiple group DIF analysis which enables to compare multiple clinical groups and a reference groups simultaneously. A disadvantage, however, of this alternative approach, is that it provides an overall test for DIF between any of the groups only and, therefore, offers less insight in the differences between each group.

In conclusion, this study contributes to the evidence for the universal applicability of PROMIS across (patient) populations. Moreover, our results provide evidence that comparisons across patients with different musculoskeletal disorders and persons from the general population are valid, when applying the PROMIS-PF, PROMIS-PI, and PROMIS-PB banks.