Introduction

Upper extremity (UE) musculoskeletal disorders are a common health problem, with estimated point prevalence rates ranging from 2 to 53%, with a high burden for patients, health care, and society [1]. With the aging of the population, the burden of this condition is expected to increase further [2]. Patients with UE musculoskeletal disorders suffer from symptoms such as pain and functional decline [3].

Numerous Patient-Reported Outcome Measures (PROMs) for measuring functional status in patients with UE musculoskeletal disorders are used in daily clinical care and in research, but these measures are not without problems [4,5,6,7,8,9]. There is a lack of convincing evidence regarding their measurement properties [9]. The variety and availability of multiple PROMs hampers comparability of scores across conditions and settings. Traditional PROMs sometimes contain irrelevant questions, which can lead to incomplete questionnaires and place a high burden on respondents [10, 11]. Thus, several PROMs that are currently used do not meet the recommended minimum standards [12].

The Patient-Reported Outcomes Measurement Information System (PROMIS®) was initiated by six US research institutions and the National Institutes of Health (NIH), with the aim to improve the quality and comparability of health outcome measures, and to reduce the burden for respondents. To achieve this aim, item banks for measuring specified health domains have been developed and validated [13, 14]. An item bank is a set of items (questions), all measuring the same domain, e.g., physical function [15]. The items of an item bank are calibrated on a scale, using Item Response Theory (IRT) modeling, which enables the calculation of precise (reliable) and valid test (total) scores. Moreover, IRT-based item banks enable the use of short forms, i.e., fixed subsets of items from the item bank, and Computerized Adaptive Testing (CAT). CAT uses an algorithm that selects the most informative items from the item bank, based on the individual’s responses (answers) to previously administered items. In this way, high precision is combined with low patient burden [16, 17].

PROMIS includes a Physical Function (PROMIS-PF) item bank v2.0, consisting of 165 items, covering central (i.e., spinal), upper, and lower extremity functions, and activities of daily living [18, 19]. Subsets of items from the PROMIS-PF item bank, form an item bank on its own and can be used for measuring lower extremity related (PROMIS Mobility) and UE-related physical function (PROMIS upper extremity [PROMIS-UE]), respectively [20]. Several studies have shown that the precursor of the current PROMIS-UE item bank, v1.2 that included 15 items only, exhibited a ceiling effect [21,22,23,24,25]. The newly developed and extended PROMIS-UE item bank v2.0, which includes 46 items, assesses a wider range of UE functioning which might preclude this ceiling effect [26].

In 2010 the Dutch–Flemish PROMIS group was established, with the aim of translating the PROMIS item banks into Dutch–Flemish and to implement these item banks in the Netherlands and Flanders. Four out of the 46 PROMIS-UE v2.0 items, have not yet been translated into Dutch–Flemish. After translation of the new items, the psychometric properties of the entire Dutch–Flemish PROMIS-UE (DF-PROMIS-UE) item bank v2.0 should be established. Evaluating cross-cultural validity is important in order to determine whether the algorithm, which calculates the IRT-based test scores for American patients, is also applicable for Dutch and Flemish patients. Moreover, this is important to establish the comparability of the scores of US patients versus Dutch and Flemish patients, e.g., for benchmarking purposes. Evaluating construct validity is vital to determine whether the bank is really measuring the intended construct. Absence of floor and ceiling effects is important for the discriminative and evaluative properties of an instrument.

The aim of the current study was to develop the DF-PROMIS-UE item bank v2.0, to investigate its cross-cultural and construct validity, as well as its floor and ceiling effects in Dutch patients with musculoskeletal UE disorders.

Methods

This study consisted of two parts: (1) the development of the DF-PROMIS-UE item bank v2.0 and (2) the evaluation of some of its psychometric properties. The development of the DF-PROMIS-UE item bank v2.0 consisted of a translation project that included cognitive debriefing interviews in order to check the comprehensibility and relevance of the preliminary item translations. The evaluation of some its measurement properties comprised evaluation of its cross-cultural and construct validity, and floor and ceiling effects.

Part 1: development

Translation

The translation of the PROMIS-UE items was integrated in a larger project to update the Dutch–Flemish PROMIS-PF (DF-PROMIS-PF) item bank from v1.2 (121 items) to v2.0 (165 items). All 45 newly developed PROMIS-PF items were translated into Dutch–Flemish, including the four new items of the DF-PROMIS-UE item bank v2.0. The translation process was performed similarly to the previous translation of Dutch–Flemish PROMIS item banks, using state of the art methodology [27,28,29]. In short, the process involved 2 forward translations (by 1 Dutch and 1 Flemish native-speaker), 1 reconciled version, 1 back translation by a native English speaker, comparison of original with back translation, and reviews by 3 bilingual experts (2 Dutch and 1 Flemish). Cognitive debriefing interviews were conducted for all 45 newly developed PROMIS-PF items.

Participants

Debriefing sample

Consecutive eligible persons with ample knowledge of Dutch or Flemish were invited to participate in the cognitive debriefing interviews. A minimum of five native Dutch and five native Flemish patients, and five native Dutch and five native Flemish people from the general population, were invited to participate.

Part 2: evaluation measurement properties

Study design

A cross-sectional study design was used.

Participants

Dutch sample

Patients who visited the outpatient clinic of the orthopedic department of the OLVG, a large teaching hospital (in Amsterdam, the Netherlands), were invited to participate in order to evaluate the measurement properties of DF-PROMIS-UE bank. Eligible patients were characterized as being 18 years or older with a musculoskeletal disorder of the UE, able to read and write in Dutch language, and to provide informed consent.

US sample

Existing response data from persons from an US online panel, being 18 years or older, and having some difficulty due to UE pain or function, were also used to evaluate the cross-cultural validity of the Dutch–Flemish and US PROMIS-UE item banks [26]. More information about these persons is provided elsewhere [30].

Procedures

This part of the study was approved by the local institutional review boards of Slotervaart/Reade (Reference Number P1749) and the OLVG. Patients visiting the outpatient clinic of the orthopedic department between February and May 2018 were invited to fill in a web-based (digital) or paper-and-pencil (paper) questionnaire that included, among others, the DF-PROMIS-UE item bank.

Measures

First, the questionnaire included questions addressing demographic data, i.e., age, gender, country of birth, educational level, and clinical characteristics, i.e., location of pain, disease duration, and type of disorder.

Second, the questionnaire included the full DF-PROMIS-UE item bank v2.0. This bank measures the construct (domain) UE functioning, which is defined as activities that require use of the UE including the shoulder, arm and hand [31]. The bank contains 46 items. There are two different 5-point Likert scale response scales: (1) Unable to do/With much difficulty/With some difficulty/With a little difficulty/Without any difficulty; (2) Cannot do/Quite a lot/Somewhat/Very little/Not at all. No timeframe is specified, but current status is assumed. Higher scores indicate better function. The total score of the DF-PROMIS-UE item bank is expressed as a T-score, which is a standardized score, with 50 representing the average score of the US general population and 10 being its standard deviation (SD).

Third, the questionnaire included the Dutch–Flemish PROMIS Global Health Questionnaire v1.2. This questionnaire measures the overall evaluation of one’s physical and mental health. It contains 10 items. There are two subscales; global physical health (GPH; 4 items) and global mental health (GMH; 4 items) [32]. The scores of the Dutch–Flemish PROMIS Global Health subscales are also expressed as T-scores. We used the Dutch–Flemish PROMIS pain intensity item (Global07r) from this bank as a legacy instrument for evaluating construct validity [32, 33]. It assesses pain intensity and consists of an 11-point numeric rating scale (NRS) with anchors 0 = “no pain” and 10 = “worst pain imaginable”.

Fourth, the questionnaire contained three disease-specific legacy instruments:

  1. 1.

    The Disabilities of the Arm, Shoulder and Hand (DASH) questionnaire, Subscale Disability/Symptoms, which measures physical function and symptoms in patients with musculoskeletal disorders of the upper limbs [3]. The subscale consists of 30 items. The time frame for the items is the past week. The total score ranges from 0 to 100, with higher scores indicating more disability. The DASH has satisfactory psychometric properties [4,5,6, 34, 35]. An official Dutch translation showed good psychometric properties [36, 37].

  2. 2.

    The Functional Index for Hand Osteoarthritis (FIHOA), that assesses functional impairment in patients with hand osteoarthritis. It consists of 10 items. No time frame is specified, but current status is assumed. Total scores range from 0 to 30, with higher scores indicating more functional impairment. The psychometric properties of the FIHOA are good [38,39,40]. An official Dutch translation showed good psychometric properties as well [41].

  3. 3.

    The Michigan Hand Outcomes Questionnaire (MHQ), subscale activities of daily living (MHQ-ADL), which assesses difficulty in performing daily activities for the right (5 items), the left (5 items) and both hands (7 items), in patients with conditions of, or injury to, the hand or wrist [42]. The time frame for the items is the past week. The MHQ-ADL total score is converted to a score from 0 to 100, with higher scores indicating less disability. The psychometric properties of the MHQ scale are good [42,43,44,45,46,47,48,49,50,51]. A Dutch translation of the MHQ showed good responsiveness [52].

Analysis

Demographic and clinical characteristics of the Dutch and US sample were summarized with descriptive statistics. Differences between the Dutch sample and the US sample were evaluated by χ2-tests for categorical variables and independent sample-t-tests for continuous variables.

Cross-cultural validity of the DF-PROMIS-UE item bank was evaluated with differential item functioning (DIF) analyses. DIF analyses examine whether people from different groups (in this study: English and Dutch speaking patients) with the same level on the construct or trait (theta \(\left[ \theta \right]\), in this study: the UE function T-score) have different probabilities of giving a certain response to an item [16]. There are two types of DIF: uniform and non-uniform. Uniform DIF exists when the magnitude of DIF is constant across the trait. Non-uniform DIF exists when the magnitude of DIF varies across the trait, i.e., the item has a different discriminative ability in the groups. DIF for language was evaluated by ordinal logistic regression models with the item score as the dependent variable. An intercept model (Model 0) and three nested models were formed: Model 1 with theta as the explanatory variable, Model 2 with both theta and language as explanatory variables, and Model 3 with theta, language and an interaction term for language and theta as explanatory variables. A McFadden’s pseudo R2 change of 2% was used as the critical value to flag items with possible DIF [16, 53,54,55]. Items were flagged as having possibly non-uniform DIF, if the R2 values of Models 2 and 3 differed by more than 2%, and possibly uniform DIF, if non-uniform DIF was absent and the R2 values of Models 1 and 2 differed by more than 2%. If any items were flagged for DIF for language, the impact of DIF on the item scores was examined by plotting item characteristic curves (ICCs) and the impact on the DIF items on the test (total) score by plotting test characteristic curves (TCCs). The TCC plots show the test score for all 46 PROMIS-UE items and the test scores for the items flagged for DIF only [54, 55].

Construct validity was evaluated by calculating the correlations of the DF-PROMIS-UE item bank v2.0 T-scores with the total scores of the legacy instruments. Pearson’s correlation coefficient r was used for normally distributed data and Spearman’s correlation coefficients ρ for non-normally distributed data. Hypothesis were formulated a priori regarding the expected correlations according to the COSMIN guidelines [56, 57]. It was hypothesized that the DF-PROMIS-UE item bank would have a moderate negative correlation (-0.50 < r ≤ − 0.30) with the Dutch–Flemish PROMIS pain intensity item [32, 33], given the fact that these instruments are intended to measure related constructs (UE physical function and pain, respectively) only. Moreover, we hypothesized that the DF-PROMIS-UE item bank would have strong negative correlations (r ≤ − 0.50) with the DASH, Subscale Disability/Symptoms [3] and the FIHOA scores [38,39,40] and a strong positive correlation (r ≥ 0.50) with the MHQ-ADL score [42], given the fact that these instruments are intended to measure the same construct (UE physical function).

To evaluate floor and ceiling effects, the proportions of patients who achieved the highest or lowest raw scores were calculated for each measure. These proportions were calculated for the full DF-PROMIS-UE item bank (raw scores 46 and 230, respectively) and the Short Form 7a (raw scores 7 and 35, respectively) in the 212 participants who completed all items. For all measures a floor effect referred to the proportion of patients with a poor health status whereas a ceiling effect referred to the proportion of patients with a good health status, and a proportion of 15% or more was considered a floor/ceiling effect [58]. We followed the international PROMIS standards with respect to the sample sizes for this study [59, 60]. These standards prescribe a minimum sample size of 200 participants for evaluating of DIF between language groups and a sample size of 50–100 participants for evaluating construct validity. DIF analyses were done with R using the package Lordif (version 0.3-3) whereas all other analyses were done with IBM SPSS Statistics 25 (Armork, New York, USA).

Results

Part 1: translation

Table 1 provides an overview of the translated PROMIS-UE items. A sufficient Dutch–Flemish translation was obtained for the four new items from the DF-PROMIS-UE item bank v2.0, and no separate translations for Dutch and Flemish were required.

Table 1 Description of the four new items of the Dutch–Flemish PROMIS upper extremity item bank v2.0

In total, 28 native-speaking (18 Dutch and 10 Flemish) persons participated in cognitive debriefing interviews. Their mean age (standard deviation [SD]) was 46 (19) years, and 68% were female. Most participants were patients with UE disorders (68%) whereas the remaining participants were healthy persons without complaints (32%).

During cognitive debriefing three out of four items (PFM2, PFM16 and PFM18) were considered to be less relevant or as describing unusual activities by some participants (both patients and people from the general population). Despite these comments, we decided to maintain the items without adaptation of the translation in the preliminary DF-PROMIS-UE item bank, enabling to investigate whether DIF for language would occur for these items.

Part 2: evaluation measurement properties

Participants

With respect to the Dutch sample 371 patients were screened for eligibility and 67 patients did not meet the selection criteria. Of the 304 patients fulfilling the selection criteria, 218 (72%) were willing to participate, provided informed consent, and completed the DF-PROMIS-UE item bank fully (n = 212) or partly (n = 6). Their data were used to study cross-cultural validity. Of the 304 patients fulfilling the selection criteria, 205 (67%) patients completed all measures, digitally (n = 199) or on paper (n = 6). Their data were used to study construct validity.

Table 2 summarizes the demographic and clinical characteristics of the Dutch and US samples. In the Dutch sample, the mean age was 53 years, half of them were female (50%), most were born in the Netherlands (73%) and had at least a high school degree (92%). Most patients reported having pain in one or both shoulder(s) (76%) or arm(s) (56%). Most reported to have a trauma (33%) or physical (e.g., muscle) injury (19%). The results of the t-test and χ2 test showed that the Dutch participants, as compared to the US participants, were on average older, more often male, and differed in level of education.

Table 2 Demographic and clinical characteristics of the Dutch and US samples

Measures

Table 3 summarizes the scores on the DF-PROMIS-UE item bank, the PROMIS Global Health Questionnaire and the legacy instruments. The mean PROMIS-UE item bank T-scores of the Dutch sample (34.7 [SD = 8.6]) and the US sample (36.5 [SD = 7.0]) differed slightly, albeit statistically significant (p < 0.05, Hedges g = 0.24 [small]).

Table 3 PROMIS and legacy instruments scores

Cross-cultural validity

Table 4 summarizes the eight items that were flagged for DIF for language. Six items showed uniform DIF (PFA36, PFB13, PFB21r1, PFB28r1, PFB56r1, and PFC43). Two items showed non-uniform DIF (PFM2 and PFM16) and the discrimination parameters were higher in Dutch patients than in US participants.

Table 4 Dutch–Flemish PROMIS upper extremity items with differential item functioning (DIF) for language

Figure 1 shows the impact of DIF for language in the TCC. The left graph shows the TCC for all 46 UE items, and the right graph shows the TCC for the eight items flagged as possibly having DIF only. The finding that the solid and the dashed curves in the left graph (all 46 UE items) are almost overlapping, indicates a minimal impact of DIF by language for the full item bank.

Fig. 1
figure 1

The total impact of differential item functioning for language on the test characteristic curves (TCC). Both graphs show the relation between the level of upper extremity function (theta as estimated in the DIF analysis) on the x-axis and the total (raw) test scores on the y-axis. The left graph shows the TCC for all 46 Dutch–Flemish (DF [solid line]) and United States (US [dashed line]) PROMIS upper extremity items. The right graph shows the TCC for the eight items having DIF

Construct validity

Table 5 summarizes the correlations between the DF-PROMIS-UE T-scores and the legacy instrument scores. All correlations were as hypothesized.

Table 5 Correlations between the Dutch–Flemish PROMIS upper extremity and legacy instruments

Floor and ceiling effects

Table 6 provides an overview of the proportion of participants that achieved the lowest or highest possible raw scores on the measures. No floor or ceiling effects were found for the DF-PROMIS-UE item bank, and no floor and a small (2.4%) ceiling effect for the Short Form 7a. No floor and a minimal (0.5%) ceiling effect were found for the DASH Subscale Disability/Symptoms, a minimal floor (0.5%) and a ceiling (17.6%) effect for the FIHOA, and no floor and some ceiling effect (11.7%) for the MHQ-ADL.

Table 6 Floor and ceiling effects of the full Dutch–Flemish PROMIS upper extremity item bank v2.0, its short form 7a and legacy instruments

Discussion

The aim of this study was to develop the DF-PROMIS-UE item bank v2.0, to investigate its cross-cultural and construct validity, as well as its floor and ceiling effects in Dutch patients with musculoskeletal UE disorders. DIF analyses flagged eight items as possibly having DIF for language, but the impact of DIF on the test score was negligible, indicating sufficient cross-cultural validity. The construct validity for the item bank was sufficient, because none of the four predefined hypotheses about the correlations with legacy instruments had to be rejected. The full item bank and the short form had no floor or ceiling effects.

A limitation of our study is that the Dutch and US samples differed with respect to age, gender, educational level, administration mode, and the US sample was a non-clinical sample. These differences between the two samples might also have caused the DIF that we have found for the eight items. However, in previous studies, addressing the DF-PROMIS-PF item bank v1.2, that included 42 of the current 46 items, no DIF was found between groups differing with respect to age, gender, educational level and administration mode, and between several clinical samples and a non-clinical, general population, sample [61,62,63,64]. Therefore, it seems unlikely that the demographic and clinical differences, that we found between the Dutch and the American samples in this study, were an explanation for the DIF of the eight items. Nevertheless, we recommend, for future research, to study the DF-PROMIS-UE item bank v2.0 with respect to DIF for age, gender, educational level and administration mode, and between clinical and non-clinical samples. Moreover, the US sample used in this study was a subsample of the US calibration sample (and not the centering sample) [26]. If any bias exists between the US sample used in this study and the US centering sample, the results of our DIF analyses may be similarly biased.

This is the first study investigating the cross-cultural validity of the 46 item PROMIS-UE item bank v2.0 outside the US. Comparable to a study addressing the DF-PROMIS-PF item bank v1.2, we also found sufficient cross-cultural validity, although several items in both studies showed some DIF for language [61]. In the cognitive debriefing interviews, three items (PFM2, PFM16, and PFM18) were regarded as less relevant or as describing unusual activities by some participants. Two out of these three items, which reflect higher levels of UE function (PFM2 and PFM16) also showed non-uniform DIF and responses showed that the activities described in these items were more difficult for Dutch participants with lower levels of UE function. Four DIF items (PFA36, PFB13, PFB28r1, and again PFM16) are part of the standard 7a short form and PFM16 is the current starting item of the CAT algorithm. This might indicate that some items will be less suitable to maintain in the final DF-PROMIS-UE item bank or short form and that another starting item might be more appropriate for the Dutch–Flemish CAT. This will have to be investigated in the final item bank calibration. Nevertheless, the right graph in Fig. 1 shows that, even if all eight items with DIF would be administered in a short form or CAT, the impact of DIF on the test score would be minimal. We therefore decided to keep these items in this preliminary version of the item bank.

To examine construct validity, we formulated a priori four hypotheses about the correlations with legacy instruments, as is proposed for studies on measurement instruments [56]. The constructs that are measured by the legacy instruments should be clear and these instruments should have sufficient measurement properties in a comparable population, which was the case in our study. None of the a priori formulated hypotheses were rejected, herewith indicating sufficient construct validity for the DF-PROMIS-UE item bank [57]. Three other studies examined the correlation between the US PROMIS-UE item bank v2.0 and legacy instruments. Minoughan and coworkers studied the bank, administered as a CAT, in patients with shoulder arthritis and found a moderate correlation (r = 0.57) with the American Shoulder and Elbow Surgeons (ASES) shoulder assessment form and a moderately strong correlation (r = 0.64) with the Simple Shoulder Test (SST) [65]. Kaat and colleagues reported, in a sample with participants with UE limitations, a correlation of 0.72 with the PROMIS-PF short form (SF8b), which is a generic physical function PROM, and a correlation of 0.69 with the Flexilevel Scale of Shoulder Function (FLEX-SF), which is a shoulder-specific PROM [26]. Van Bruggen and colleagues reported, in 303 patients from an outpatient department of a level 1 (academic) trauma center, correlations of the DF-PROMIS-UE item bank with the DASH, Patient-Reported Wrist Evaluation (PRWE) function and MHQ-ADL of − 0.84, − 0.75, and − 0.73, respectively. This study also showed a sufficient structural validity and internal consistency of the Dutch–Flemish PROMIS-UE item bank [66].

In previous studies, that examined the PROMIS-UE item bank (v1.2) in clinical populations with UE conditions, ceiling effects in the item bank were found [21,22,23,24,25]. In the current study, no floor or ceiling effect were found for the full DF-PROMIS-UE item bank, and no floor and a small, well below the 15% criterion, ceiling effect for the Short Form 7a. These findings are comparable to those in the study of Kaat et al. addressing the expansion and validation of the PROMIS-UE item bank v2.0 [26]. In our study, the FIHOA had a ceiling effect and the MHQ-ADL had some, below the 15% criterion, ceiling effect. These effects reduce the discriminatory and evaluating properties of a measure. Moreover, floor and ceiling effects may also exclude the application of some statistical analyses as many of them assume a normal distribution. Thus, the DF-PROMIS-UE item bank v2.0 has an improved measurement range compared to the initial PROMIS-UE item bank (v1.2) and the measurement range seems comparable to the US PROMIS-UE item bank v2.0.

In line with previous work of the Dutch–Flemish PROMIS group, the results of our study add to the evidence about the psychometric properties of the Dutch–Flemish PROMIS banks. Following the PROMIS guidelines, cross-cultural validation is the first recommended step after translation of PROMIS items banks [60]. Once cross-cultural validity has been established, further development of the item bank is warranted. We recommend to expand the current study to a larger sample, with a minimal sample size ≥ 500, for a so-called full item bank calibration. Afterwards, PROMIS CATs can be applied in clinical practice and research.

In conclusion, in this study we found sufficient cross-cultural and construct validity of the newly developed DF-PROMIS-UE item bank v2.0, and absence of floor and ceiling effects. Further validation of the item bank is now warranted and the item bank has the potential of improved measurement of UE functioning in the Dutch–Flemish population.