Introduction

Pain behaviors (PBs) are behaviors that communicate to others that a person is experiencing pain [13]. PB is an important outcome in studies of persons living with chronic pain [4, 5] because PBs may predict development of disability [6].

The National Institutes of Health’s (NIH) patient-reported outcomes measurement information system (PROMIS) included PBs among its targeted outcomes [7]. All PROMIS measures were developed as item banks, and candidate items were administered to a large sample of predominantly healthy community participants (Wave 1). Few individuals from Wave 1 reported higher levels of pain, requiring additional data collection. Participants were recruited from the American Chronic Pain Association (ACPA) with higher levels of pain and completed an online survey that included the PROMIS–PB items. The data from Wave 1 and from the ACPA were combined for the purpose of calibrating the items. However, calibration of the PB items was conducted in the combined sample without an investigation of measurement invariance.

Measurement invariance means that the same construct is measured similarly across groups. For instance, cancer pain typically has unique emotional components not necessarily found in other types of chronic pain, and this emotional component of the pain might influence several dimensions of PBs. Researchers may be concerned that test score differences observed in various subgroups are due to measurement instrument problems rather than true differences in the trait being measured. Lack of measurement invariance has been mainly investigated using two methods: multi-group confirmatory factor analysis (MG-CFA) and item response theory (IRT).

MG-CFA procedures are commonly employed to test for measurement equivalence [812]. The main question underlying tests of measurement equivalence across groups is whether certain factor analytic parameters such as loadings, intercepts, error variances, factor variances, factor covariances, and factor means can be assumed equivalent across groups [10, 12, 13].

In the IRT framework, when a lack of measurement invariance occurs at item level, it is referred to as differential item functioning (DIF). DIF is defined as “a difference in the probability of endorsing an item across comparison groups when the scores are on a common metric” [14]. Several researchers have investigated similarities and differences of the two models in detecting a lack of measurement invariance [1416]. Stark et al. [14] reported both CFA and IRT methods showed similar results in detecting DIF across a majority of simulated conditions. The authors found that the CFA approach performed slightly worse than the IRT approach in dichotomous data; however, it performed better under condition of polytomous data with a small sample size. The authors also pointed out that testing measurement invariance via the IRT approach seemed more complicated than the CFA approach. In the current study, we explored measurement invariance across Wave 1 and ACPA samples with both MG-CFA and IRT-based DIF approaches. Evidence of measurement invariance provides support for using the PROMIS–PB score to compare observed differences in group means for both healthy and clinical samples. The data for the study were collected in the process of instrument development, and the study design is described in detail in Cella et al. [7]. The purpose of the current study was to investigate the level of measurement invariance of the PROMIS–PB across a sample of individuals from the general population who are generally healthy and a sample of individuals with chronic pain.

Methods

Participants

The PROMIS Wave 1 data included 21,133 research participants. Of these, 19,601 were recruited from an internet panel (YouGovPolimetrix; www.polimetrix.com), and 1,532 were recruited from primary research sites associated with the PROMIS network. A detailed description on Wave 1 data collection is available at http://www.nihpromis.org/science/ calibration testing. For purposes of this study, only the data from participants who responded to the full bank and had no missing data were used.

As described above, the sample size for Wave 1 was quite large; however, few individuals reported higher levels of pain. With IRT models, a sufficient number of responses in every response category are essential for precise estimates of item parameters [17]. Thus, research participants with chronic pain were recruited through the ACPA. Eligibility requirements included being 21 years of age or older and having one or more chronic pain conditions for at least 3 months prior to the survey.

Analyses

Three levels of measurement invariance were tested using the MG-CFA approach. The first and weakest level, configural invariance [18], assumes that the same pattern of item-factor loadings exists across groups being compared; the same items must have nonzero loadings on the same factors. Metric invariance [19] requires, additionally, that unstandardized factor loadings be invariant across the comparison groups. Scalar invariance is the strongest level of invariance [18, 20] and requires that all the assumptions of configural and metric invariance be met. In addition, the scale’s item intercepts be invariant across groups.

Mplus software 6.1 [21] was used with weighted least-squares mean and variance adjusted (WLSMV) estimation. Several fit indices were used in the current study: χ 2, comparative fit index (CFI) [22], Tucker–Lewis index (TLI) [23], and root mean square error of approximation (RMSEA) [24, 25]. CFI and TLI values above 0.90 are considered acceptable [13, 26], and RMSEA values of <0.08 are considered to indicate adequate fit [27].

In the MG-CFA approach, fit of a baseline model is compared to the fit of increasingly constrained models. The \( \chi^{2} \) difference test is utilized to compare the fit of two nested models [2830]. A nonsignificant \( \chi^{2} \) difference supports the less parameterized model (i.e., the addition of the extra parameters does not significantly improve model fit). To account for the sensitivity of the \( \chi^{2} \) difference test to sample size, α-level of 0.05 for \( \chi^{2} \) difference test was used. Additionally, a difference of <0.01 in the Δ CFI index was used to supports the less parameterized model [9, 10]. Note that the model fit was compared only when both models of interest individually fit the data.

Additionally, DIF was analyzed with the R software package Lordif [31]. The Lordif utilizes an ordinal logistic regression framework, and the graded response (GR) model is used for IRT trait estimation [32]. Two criteria were considered to detect meaningful DIF in the current study: (1) <0.13 pseudo R 2 statistic [33] and (2) 10 % changes in beta [31, 34, 35].

Following Cook et al. [36] approach, the impact of DIF on the scores was assessed; a Pearson correlation between DIF-adjusted person scores and the original person scores was calculated to examine the existence of meaningful impact of DIF on the scores. A strong magnitude of correlation would suggest that adjusting for DIF would make a negligible difference in the person scores. This indicates that item parameters calculated when combining all groups together could be used without concern for substantial impact of DIF on person’s scoring.

Items

The PROMIS–PB item bank provided good coverage of the PB construct [37]. A census-weighted subsample of the PROMIS Wave 1 data was used to anchor the PROMIS scores on a T-score metric (M = 50; SD = 10) [38]. The PROMIS–PB items have a seven-day time frame and are rated on a six-point scale that ranges from 1 = had no pain to 6 = always. Because of low frequencies of responses, categories 1 and 2 (never) were subsequently combined.

Results

Initial analyses

Initial analyses were conducted using data from all 36 items administered to combined PROMIS and ACPA samples. The initial model, however, had poor fit: χ 2 (594, N = 1,176) = 8,397.010, p < .01, CFI = 0.894, TLI = 0.888, RMSEA = 0.106 (from 0.104 to 0.108). We investigated potential local dependency among items because it can cause biased parameter estimates. To identify the potential local dependency and to modify model specifications, residual correlations and modification indices were inspected. Any items with absolute values of residual correlations >0.20 indicate local dependency [39]. Based on the results, nine items were eliminated due to the potential local dependency: PB2, “When I was in pain I became irritable”; PB9, “When I was in pain I became angry”; PB16, “When I was in pain I appeared upset or sad”; PB23, “When I was in pain I asked one or more people to leave me alone”; PB24, “When I was in pain I moved stiffly”; PB29, “When I was in pain I used a cane or something else for support”; PB31, “I limped because of pain”; PB43, “When I was in pain I walked carefully”; and PB53, “When I was in pain I moved my arms or legs stiffly.” A schematic flow of the item analysis used in the present study is illustrated in Fig. 1.

Fig. 1
figure 1

A schematic flow of item analysis

Descriptive analysis

A total of 426 PROMIS Wave 1 (Male = 192, Female = 234) and 750 ACPA participants (Male = 136, Female = 610, missing = 4) participants were included in the current study. Table 1 describes demographic and clinical details of the samples. The PROMIS Wave I and ACPA samples were statistically different on age, t (1,172) = 4.990, p < .001, gender, χ 2 (1, N = 1,172) = 96.922, p < .001, ethnicity, χ 2 (1, N = 1,170) = 50.485, p < .001, marriage status, χ 2 (2, N = 1,119) = 7.137, p < .001, and education χ 2 (4, N = 1,174) = 30.957, p < .001.

Table 1 Demographics between the PROMIS Wave 1 sample and the ACPA sample for pain behavior

MG-CFA approach

Configural invariance

A configural invariance model (i.e., the same pattern of item-factor loadings across groups) was tested across the comparison groups. The findings supported configural invariance between the PROMIS and ACPA samples: χ 2 (648, N = 1,176) = 3,453.968, p < .01, CFI = 0.904, TLI = 0.896, RMSEA = 0.086 (from 0.083 to 0.089) (Table 2).

Table 2 Results of testing measurement invariance of the pain behavior items across PROMIS Wave 1 and ACPA using MG-CFA

Metric invariance

A metric invariance model (i.e., equal constraints on unstandardized item-factor loadings across groups) also supported good fit: χ 2 (675, N = 1,176) = 3,486.512, p < .01, CFI = 0.904, TLI = 0.900, RMSEA = 0.084 (from 0.081 to 0.087). Next, the model fit of the configural and metric invariance models was compared. The Chi-square difference test results were statistically significant: Δχ 2df = 27) = 428.170, p < .01, indicating that some unstandardized factor loading values were statistically different between PROMIS Wave I and ACPA samples. Since the χ 2 difference is sensitive to relatively larger sample sizes, CFI different test (∆ CFI) is frequently used in testing measurement invariance [9, 10]. Dissimilar to the χ 2 difference, a decrease of <0.01 in the CFI value (∆ CFI = 0.00) was found in the nested model comparison, supporting the same unstandardized factor loading values between PROMIS and ACPA samples.

Scalar invariance

After finding support for both configural and metric invariance, the authors examined the PROMIS–PB for scalar invariance (i.e., invariance of the unstandardized item thresholds across groups). The results did not support scalar invariance: χ 2 (771, N = 1,176) = 9,085.440, p < .01, CFI = 0.716, TLI = 0.742, RMSEA = 0.135 (from 0.133 to 0.138).

IRT-based DIF approach

The criterion of pseudo R 2 (i.e., classifying pseudo R 2 < 0.13 as negligible DIF) resulted in no items being detected as DIF. Using the DIF criterion of 10 % beta change, seven items were identified as having meaningful DIF. The correlation between the original and adjusted scores was 0.98, indicating no concern for substantial impact of DIF on person’s scoring when combining all groups together.

Discussion

The current study examined the measurement invariance of PB items using MG-CFA across two samples to evaluate whether the construct of PBs is the same in healthy people and those with chronic pain. The PROMIS Wave 1 community sample was comprised predominantly of healthy participants, and the ACPA sample was comprised exclusively of individuals living with chronic pain. There is still little consensus in the literature in regard to the level of equivalence necessary for inferring measurement invariance across groups. Horn and McArdle required metric invariance to sure that the same constructs are measures across groups [19]. Chen, Sousa, and West argued that comparing means across groups could be meaningful after confirming the existence of scalar invariance [40]. Reise, Widaman, and Pugh, however, claimed that a form of partial loading invariance is actually required to permit across-group comparisons [16]. The findings of the current study supported measurement invariance at the level of metric invariance, but not at the level of scalar invariance.

Conclusions and recommendations

Had the PROMIS-PI failed to support either configural or metric invariance, we might need to consider a remedy such as re-calibrating the item bank or removing items that function differently in the two compared groups. The results from this study found that a subset of 27 PROMIS–PB items met all but the strictest from of measurement invariance. Based on IRT-based DIF analysis results, it was concluded that although statistically significant DIF was identified using 10 % beta change, the adjustments for DIF would result in negligible changes in person scores since correlations between adjusted and nonadjusted scores were approximately 0.98. For this reason, it was concluded that any DIF in this item set among the MS and APCA groups could be disregarded. This implies that the instrument measures the same construct in both healthy and clinical including those with chronic pain. Based on the findings of the current study, we conclude that using the originally obtained parameter estimates from the combined sample of PROMIS Wave I and ACPA participants are acceptable, and the instrument can be scored and used as originally published.

The current study could use only 27 of the 36 items in the PROMIS–PB item bank mainly due to local dependence. Local dependence may cause biased parameter estimates [41, 42], and thus, we recommend that the PROMIS–PB address the local dependence in the item bank or utilize testlets to handle local dependence among the items [42]. In summary, the results of the current study support the use of PROMIS–PB item parameters obtained from the combined general population and chronic pain sample. The construct of PBs appears to function in the same way in a community sample as well as in people living with chronic pain. As a result, the PROMIS–PB score can be used to compare mean differences between groups.