Introduction

Disease-specific patient-reported outcome (PRO) measures have been widely used to assess impacts of specific diseases on health-related quality of life (QOL) [1, 2]. These measures, however, have largely been developed independently of each other, greatly limiting comparisons of how different diseases impact QOL. For example, two of the heart disease-specific PRO measures, the Seattle Angina Questionnaire (SAQ) [3] and the Minnesota Living with Heart Failure Questionnaire (MLHFQ) [4], both represent only four common item content areas and they differ in scoring, limiting comparisons of QOL impact attributed to angina versus congestive heart failure. It is even more challenging to compare disease-specific QOL impact across more distinct conditions (e.g., heart diseases versus arthritis). Comparisons of general profiles show that some diseases affect the same areas of life but may have different “signature” profiles of impact on QOL depending on their varying symptoms, stage of treatment, progression, and overall severity [5, 42]. It is generally believed that disease-specific measures of QOL impact are more responsive than general QOL measures [43]. In theory, increased standardization of the content and scoring of QOL impact attributed to specific diseases would add to the advantages of specificity, the advantage of a better understanding of their relative overall impact on QOL.

The Quality-of-life Disease Impact Scale (QDIS®) was developed to standardize the content and scoring of QOL impact attributed to specific diseases using item response theory (IRT), in order to enable more meaningful comparisons of QOL impact across diseases [68]. QDIS items are standardized to have identical content (questions and response categories) and to differ only in their attribution to a specific disease. The 49 items form an item bank that allows researchers to administer the test in different ways, including: the full bank, a fixed short form such as the QDIS-7, or a computerized adaptive assessment. QDIS items were self-administered and IRT-calibrated across five independent US household samples differing in their chronic diseases. Tests in these development samples provided the basis for QDIS cross-disease standardized IRT item parameters. QDIS scores based upon the standardized IRT item parameters were highly consistent with those based on disease-specific IRT item parameters [44].

In this study, we examined the invariance (equivalence) of standardized IRT item parameters provided by the QDIS developer in comparison with parameters estimated for an independent clinical sample hospitalized with acute coronary syndrome (ACS). ACS is a major cause of morbidity and has been shown to have substantial impact on patients’ QOL [12, 13]. These tests of IRT invariance are stringent since results for patients interviewed after an acute coronary event were compared with results from Internet-based self-administered surveys for household samples.

Methods

Participants

Standardization sample

The QDIS was developed during the computerized adaptive assessment of disease impact test (DICAT) Project, a three-year NIH-sponsored effort to develop and improve disease-specific PRO measures [6]. DICAT participants were recruited from an online research panel (www.knowledgepanel.com) in order to achieve a probability sample representing the US adult population and to achieve quotas for pre-identified samples with each of five chronic conditions: arthritis (including osteoarthritis and rheumatoid arthritis), cardiovascular disease (including angina, congestive heart failure, myocardial infarction), chronic kidney disease, diabetes, and respiratory disease (including asthma and chronic obstructive pulmonary disease). It should be noted that ACS was among the cardiovascular diseases sampled in the DICAT project. The pre-identified chronically ill samples (N = 5,451) were used to calibrate the QDIS-standardized IRT item parameters (QDIS standardization sample). Responses were collected online because of the practical advantages demonstrated in other recent studies (e.g., PROMIS). The latter has also added to the substantial evidence that Internet-based surveys produce both IRT parameters and scale scores that are equivalent to those from other self-administered modes [50]. The DICAT study was approved by the New England Institutional Review Board (NEIRB 09-062).

ACS clinical sample

An independent sample was analyzed from the Transitions, Risks, and Actions in Coronary Events—Center for Outcomes Research and Education (TRACE-CORE) project, a prospective cohort study of patients hospitalized for ACS at six hospitals in Massachusetts and Georgia. The design and methods of TRACE-CORE are described elsewhere [14]. Briefly, one month following hospital discharge, ACS patients (N = 1,544) completed a 7-item short-form of the QDIS during a telephone interview. The Committee for the Protection of Human Subjects at the University of Massachusetts Medical School and Institutional Review Boards at each participating site approved this study.

Measures

The full-length QDIS consists of 49 items assessing disease-specific impacts on patients’ physical, emotional, social functioning, and well-being. For each item, content is the same across diseases except for the disease attribution. For example, in DICAT, the pre-identified cardiovascular patients were asked “how often did your angina/CHF/MI limit your everyday activities during the past 4 weeks,” and the pre-identified arthritis patients were asked “how often did your OA/RA limit your everyday activities during the past 4 weeks.” In TRACE-CORE, the ACS patients were interviewed using a more general attribution, such as “how often did your heart problem limit your everyday activities during the past 4 weeks.” Each item included the same five-choice categorical rating scale, higher score indicating worse QOL. The QDIS-standardized IRT item parameters were calibrated using the entire 49 items fitted with the generalized partial credit model (GPCM) across samples of the five chronic conditions pre-identified in the QDIS standardization sample [44]. IRT was used to enable administrations and comparisons across different (adaptive and static) QDIS forms. The 7-item QDIS-7, which is a static short-form constructed and recommended by QDIS developers using various criteria [7], was administered in TRACE-CORE as required by its study protocol including paper–pencil interviewer-administered surveys (as opposed to computer adaptive testing). The QDIS-7 covers seven distinct content areas and performed very well for cardiovascular and other conditions in the DICAT study [7] (see content area in Table 2). Standardized IRT parameters for the seven items had already been calibrated along with the rest of the 49 items.

Analyses

Descriptive statistics

Descriptive statistics for QDIS-7 were compared across the samples, including the item-level mean, standard deviation (SD), and item-total correlations (corrected for overlap as recommended for short surveys [15]). Scale internal consistency (the Cronbach’s alpha [16]) and ceiling and floor percentages were also compared.

Differential functioning of items and tests (DFIT)

To test the IRT invariance of QDIS-7 across samples, differential functioning was compared for the items and the scale. For this purpose, IRT item parameters were calibrated independently within the QDIS standardization sample and the ACS sample. Results were compared in terms of item-, scale-, and person-level score differences. This method, referred to as differential functioning of items and tests (DFIT) [17, 18], involves three general steps as illustrated in Fig. 1.

Fig. 1
figure 1

Illustration of steps in IRT-based differential functioning of items and tests (DFIT) analyses

Step 1: calibrating sample-specific IRT item parameters

Since the standardized IRT item parameters were provided by the QDIS developer, we only needed to calibrate IRT parameters for the new ACS sample. The same IRT model (GPCM) as used in the standardization sample was applied. Items were freely calibrated without imposing any prior knowledge of the standardized IRT item parameters. Results in the independent clinical sample are referred to as “ACS-specific” IRT item parameters. As a validity check, we evaluated the goodness of fit of the GPCM model in the new ACS sample, based on the Chi-square item fit statistics [19] and IRT graphical residual plots [20, 21]. The IRT calibration was completed using the software Parscale [22], and the IRT graphical residual analyses were completed using ResidPlots [23].

Step 2: linking IRT item parameters across samples

Despite the desirable feature of IRT parameter invariance across samples, the ACS-specific and standardized IRT item parameters were not readily comparable due to IRT scale indeterminacy [9, 24]. IRT scale indeterminacy indicates that the solution to IRT item parameter estimation is mathematically unidentifiable (illustration in ‘Item response theory indeterminacy’ in Appendix 1). Due to the nature of IRT indeterminacy, item parameters from separate IRT calibrations are invariant (consistent) only up to a set of linear transformations rather than being numerically equivalent.

The linking (equating) process aims to compute the slope and intercept of the linear transformation, in order to put IRT item parameters from separate calibrations back onto a common metric for meaningful comparisons [25]. There are various linking methods available, such as using item parameters, test characteristic functions, or theta scores, and the choice often depends on factors such as study design, item quality, and software availability [26]. Because only 7 items were used, any substantial DIF could greatly contaminate a linking process based on item parameters. Therefore, we chose to link item parameters based on theta scores. To begin, we scored the QDIS-7 for ACS patients twice using the ACS-specific and standardized IRT parameters, respectively. Then we computed the slope (A) and the intercept (B) of the linear transformation based on the two sets of theta scores (calculations in ‘Slope (A) and intercept (B) of linear transformation using IRT theta scores’ in Appendix 1). Lastly, we applied the slope and intercept to link the ACS-specific with the standardized IRT item parameters. The ACS-specific item parameters after the linear transformation were labeled “ACS-LT.” After linking, the ACS-LT and standardized IRT item parameters were meaningfully comparable.

Step 3: comparing IRT-based item, scale, and theta scores

First, we evaluated scatter plots of the ACS-LT and standardized item parameters, for IRT a- and b-parameters separately, in terms of the magnitude of their correlation and differences (distance off the diagonal line). Any data point with a difference larger than 2 times the SD of the differences across the items, roughly equivalent to a significance level of 0.05, was flagged [27]. Next, the IRT item characteristic curve (ICC), encompassing both a- and b-parameters, was used to evaluate their joint impact on item scores. Two ICCs were plotted for each item, one using the ACS-LT and the other using the standardized IRT item parameters. Smaller areas between two ICCs indicate smaller impact on item scores. The non-compensatory differential item functioning (NCDIF) index was computed to objectively compare the area across the items (calculation in ‘Non-compensatory differential item functioning (NCDIF)’ in Appendix 1) [11, 28]. The NCDIF index was compared to the cutoff calculated as 0.006 (k  1)2. The cutoff is 0.096 when k = 5, which is for five-choice categorical rating items [2931].

The IRT test characteristic curve (TCC), which aggregates ICCs across the items, was used to evaluate the impact of items on scale scores. Likewise, two TCCs were plotted, one using the ACS-LT and the other using the standardized item parameters. In this plot, the more overlapping the two curves, the smaller the impact on the scale scores. The differential test functioning (DTF) index was computed to quantify the area (calculation in ‘Test differential functioning (DTF)’ in Appendix 1). Additionally, a third index, the compensatory differential item functioning (CDIF) was computed to evaluate the relative contribution of each item score difference to the scale score difference [32], thus relating the NCDIF with the DTF (calculation in ‘Compensatory differential item functioning’ in Appendix 1). Specifically, an item with a large positive CDIF contributes a lot to the DTF; on the contrary, an item with a small or negative CDIF contributes little or helps to eliminate the DTF. The statistical language R [45] was used to compute these DFIT indices based on the formulae illustrated in the appendix.

Finally, we examined the difference between the ACS-specific and the standardized IRT-based theta scores. To make them comparable, we performed the linear transformation using the slope (A) and intercept (B) derived in Step 2. After the linear transformation, we constructed a scatter plot of the standardized and the linearly transformed ACS-specific (ACS-LT) theta scores and evaluated their correlation and differences.

Logistic regression DIF

The widely used logistic regression DIF analysis was performed for validation and to enable comparisons with the previous DFIT analyses [10]. In this regression, the item response was treated as the dependent variable and the scale score was treated as the independent variable. We conducted the analyses using both the summed scale and IRT-based theta scores as the matching independent variable, respectively. A binary grouping variable (dummy coded for the ACS and the standardization sample) and an interaction term (group by scale score) were added to the model. An item was considered to exhibit DIF if the model fit was significantly improved as a consequence. We detected DIF items based on the criteria of change in the proportion of explained variance (ΔR 2 > 0.02) and change in the regression beta coefficient (Δß > 10 %) [33, 34]. The Chi-square statistic was also computed but was considered only as a secondary criterion due to its over-sensitivity to large sample sizes. The R package lordif [51] was used to perform the logistic regression.

Results

Participant demographics

The two samples were statistically different on age, gender, race/ethnicity, educational level, and marital status; the ACS sample was older, more likely to be male, non-Hispanic black, less educated, and divorced or separated (Table 1).

Table 1 Demographic characteristics of the QDIS standardization sample in the DICAT and the ACS sample in the TRACE-CORE, N (%) or M ± SD

Descriptive statistics

Table 2 summarizes item-level descriptive statistics and item content. Compared to the standardization sample, ACS patients had higher means indicating greater disease burden, and larger SD indicating greater variability (a more heterogeneous group). Lower item-total correlations indicating weaker discrimination across the items were also observed. The floor and ceiling percentages were 12.6 and 0.86 % with the ACS patients (mean = 15.6, SD = 7.1 of summed scores), versus 24.8 and 0.35 % for the standardization sample (mean = 12.8, SD = 6.5 of summed scores), respectively, indicating that a larger proportion of the ACS sample reported at least some degree of disease burden. The scale internal consistency estimates were high and similar (Cronbach’s α = 0.91 and 0.94, for the ACS and standardization sample, respectively).

Table 2 Item description, statistics, and IRT invariance analyses across the ACS and QDIS standardization samples for the QDIS-7

Differential functioning of items and tests (DFIT)

Step 1. ACS-specific IRT calibration

While the Chi-square item fit statistics were all large and significant, the IRT graphical residual analyses (‘Appendix 2’) suggest a reasonable fit between the IRT model and the new ACS sample: The observed values fell within the 95 % confidence bands of IRT model-expected values. Item 6 presented a poorer fit with a few values falling outside the confidence bands. The standardized residuals of all items were approximately normally distributed (mean = −0.05, SD = 1.63), suggesting an acceptable model fit at the scale level. The ACS-specific IRT a-parameter ranged 0.86–2.10, and b-parameters ranged −0.24–1.83 (Table 2).

Step 2. IRT parameter linking

The slope and the intercept of the linear transformation were computed: A = 0.979 and B = 0.485, based on the theta score distribution of ACS patients using the ACS-specific (mean = −0.010, SD = 0.937) and the standardized (mean = 0.476, SD = 0.917) IRT item parameters. After linking, there were three sets of IRT item parameters for each item: the standardized (derived from the standardization sample provided by the QDIS developer), ACS-specific (before linear transformation), and ACS-LT (after linear transformation) item parameters (Table 2).

Step 3. Differences in item, scale, and theta scores

The standardized and ACS-LT IRT item parameters were highly correlated, r = 0.89 and 0.90, for a-, and b-parameters, respectively (Fig. 2). For a-parameters, all items had a significantly smaller value in the ACS-LT than the standardized version; Item 2 (“Everyday work school chores”) and Item 3 (“Social family friends”) had the largest drift (standardized distances of 3.24 and 2.97); The b-parameters appear to be far more stable across the standardized and ACS-LT versions, although Item 6 (“General health, worry”) had a greater drift in one threshold (b 3, the threshold between the response categories of “Sometimes” and “Often,” standardized distance of 2.36).

Fig. 2
figure 2

Scatter plot of standardized and linearly transformed ACS-specific (ACS-LT) IRT item slope (top) and item threshold (bottom) parameters for the QDIS-7. The numbers in the plots (1–7) are the indices representing item 1–7

The areas between the ICC of the standardized and ACS-LT IRT item parameters were, however, small across the items (Fig. 3). The area index NCDIF (Table 2) was smaller than the suggested threshold of 0.096 for all items; Items 2 and 6, again, showed the largest values (NCDIF = 0.074 and 0.063, respectively). The TCC (bottom right corner, Fig. 3) had only small discrepancies. The area index DTF is 0.37. The CDIF index (Table 2) showed that Item 6 (CDIF = 0.108) contributed the most to, while Item 2 (CDIF = −0.032) actually eliminated the scale score difference. This is consistent with Fig. 2, where Item 2, despite a relatively large area between the curves, was offset by differences in an opposite direction for the other items—the curve of ACS-LT on the top of that of the standardized item parameter.

Fig. 3
figure 3

Item and test characteristic curves of standardized and linearly transformed ACS-specific (ACS-LT) IRT item parameters for the QDIS-7. There are seven plots of item characteristic curves for item 1–7, respectively, and one plot of test characteristic curve for the total scale (bottom right corner)

The theta scores of ACS patients using the standardized and ACS-LT IRT item parameters were very highly correlated (r = 0.995) and presented a strong linear relationship (Fig. 4, top). The mean difference between the standardized and ACS-LT theta scores was plotted as a function of the ACS-LT theta score (Fig. 4, bottom). The standardized theta estimates were greater than the ACS-LT estimate on the lower end of theta scale, and lower on the higher end of the theta scale. Larger residuals (up to 0.23 SD) were observed at the two extreme ends. The root-mean-squared difference (RMSD) was 0.09. When estimating using the standardized item parameters, the ACS patients’ standardized theta scores had mean = 0.476, SD = 0.917, one-half SD higher (greater disease burden) than the standardization sample that had mean = 0 and SD = 1.

Fig. 4
figure 4

Scatter plot of IRT-based theta scores of ACS patients using standardized and linearly transformed ACS-specific (ACS-LT) IRT item parameters for the QDIS-7 (top). Mean difference between theta scores using standardized versus ACS-LT IRT item parameters as a function of ACS-LT theta scores (bottom)

Logistic regression DIF

While the Chi-square statistics were all large and significant as expected for a large sample, no items were detected with meaningful DIF using the criterion of ΔR 2 > 0.02 or Δß > 10 %. The results based on the summed scores and IRT theta scores were highly similar (differences in ΔR 2 or Δß only on the third decimal or smaller). Table 2 displays the latter, where Item 2 had the largest ΔR 2 (0.009) and Item 4 (“Physical function walking”) had the largest Δß (0.024).

Discussion and conclusions

Our study evaluated a fundamental psychometric assumption of IRT invariance for a short-form of the IRT-based Quality-of-Life Disease Impact Scale, the QDIS-7, across the original standardization sample that completed surveys via the Internet and an independent clinical sample that was interviewed after a month following hospitalization for ACS. We compared the consistency between the standardized (pre-calibrated in household) and the ACS-specific (freely calibrated in the independent sample) IRT item parameters, and evaluated the consequences of any differential functioning at various levels. While the ACS-specific item discrimination parameters were systematically lower than the standardized parameters, item threshold parameters and theta scores were consistent, supporting the future use of the standardized IRT item parameters in scoring the QDIS-7 for ACS patients.

PRO measures of QOL impact are increasingly relied upon in population health and clinical research. Current disease-specific QOL measures differ widely in item content and scoring and often do not represent content areas shown to differ between those with and without specific conditions. For example, the SAQ measure does not include items measuring limitations in role and social activities, commonly shown to be affected using general measures administered in studies of heart disease. Developed using advanced psychometric models, the QDIS enables more standardized comparisons of QOL impact attributed to a specific condition across conditions. If this standardization is justified, it could be a noteworthy advance with implications for disease-specific PRO measures in QOL research. For example, using the QDIS norming sample, which is a representative sample of the US chronically ill population (N = 4,120) in their naturally occurring proportions, it is possible to compare the QOL impact that ACS patients attributed to their condition to the impact for chronically ill adults in general [7]. When compared to QDIS norms, we found that, 1 month following hospital discharge, ACS patients reported, on average, worse QOL than about 80 % of the chronically ill US adults.

Our findings supported the invariance property of the pre-calibrated standardized IRT parameters in the independent ACS sample. Generally we observed very small differences between item and scale scores using the standardized vs. ACS-specific IRT item parameters. We did find consistently larger item slope (a) parameters in the standardization sample (Table 2), indicating that the items were more discriminating in the standardization sample than in ACS sample (also see the steeper curves in Fig. 2). This may be due to one or more different characteristics across the samples, for example, different wordings of disease attribution, number of items, and interviewer- versus self-administration. One implication of slope parameter differences is reflected in very high and low theta score estimates. For example, for a given ACS patient on the low end of the theta scale, the theta score estimate using the standardized item parameters was likely to be greater than that using the ACS-LT item parameters, but vice versa for patients on the high end of the theta scale (Fig. 4). The larger difference at the two ends is likely due to larger measurement errors for extreme theta values. Additionally, the standard errors of theta estimates of ACS patients calculated by test information function are likely to be underestimated if using the standardized item parameters. On the contrary, the item threshold (b) parameters were far more consistent across the samples, which explain the two sets of very highly correlated theta scores with a strong linear relationship.

We tested IRT parameter invariance via the DFIT and compared it with the logistic regression analyses. IRT invariance is increasingly important along with growing IRT and computerized adaptive testing (CAT) applications in PRO measures. One major advantage of IRT and CAT is allowing items administrated in different versions or lengths (e.g., long, short, adaptive, and static forms), maintaining score comparability. Different survey versions across samples present an additional DIF threat, making IRT invariance a more prominent question to investigate. In our study, the logistic regression method analyzed the same version of QDIS-7 across samples. In contrast, the DFIT evaluated IRT item parameters from two different versions (QDIS-49 vs. QDIS-7) across samples. Thus, the latter provided more stringent evidence for score comparability not only across samples but also across survey versions.

We found the DFIT provided additional information. First, it evaluates practical significance of IRT item parameters on score differences. DTF was computed as MSD between the summed scores using the two sets of IRT item parameters, which has the unit squared of the QDIS-7 summed score (mean = 15.6, SD = 7.1). In this study, DTF is 0.37 and its square root is 0.6, which is less than one score point as well as one-tenth of SD. Thus both the practical and statistical difference tended to be very small. We observed the same small differences for the theta scores. Secondly, the (non-)compensatory DIF indices provided by DFIT (namely NCDIF and CDIF) enable the evaluation of relative contribution of each item to scale score difference. Of note, we found an example of DIF offset; Item 2 actually eliminated scale-level score difference despite its relatively large item-level difference. Developers can utilize such information to cancel out DIF effects when it is not practical to develop a DIF-free scale [17]. Finally, since DFIT does not require item-level response data of both samples, it provides some flexibility for researchers who do not have item-level data of development sample, but can obtain their IRT item parameters from published literature (e.g., the PROMIS measures [35, 36]). It is noteworthy that DFIT has both advantages and limitations. One limitation is that DFIT requires item parameters from separate IRT calibrations, which, however, are not readily compared and need to be linked. In the case that an initial DFIT analysis identifies any item with DIF, it is recommended to have a two-stage linking, in which the scale is purified by a second-stage linking based on the remaining non-DIF items [46, 47]. Another limitation is the lack of generalizability of significance tests using the pre-determined cutoffs [48]. Newer methods were recently developed (e.g., the item parameter replication [IPR] method [49]), and they deserve more studies.

Our study utilized diverse samples recruited from two large NIH-funded projects. Given that there are other methods widely known for examining measurement invariance (e.g., the multi-group confirmatory factor analysis [3739]), it would be desirable to compare the different methods. Nevertheless, we evaluated IRT item slope and threshold parameters, which are analogous to the factor loadings and intercepts in the CFA framework. Furthermore, we examined their joint impacts on item, scale, and person scores in the IRT framework, upon which the QDIS was developed and scored, which should provide a more direct evaluation than methods in other frameworks. Given that QDIS items were administered using different modes across the samples, it is noteworthy that although the literature reports little or no impact of different self-administered methods on item responses [40, 41], the interview mode has tended to yield more favorable responses, which may have important implications in interpreting the difference scores across samples and deserves future investigation.

In conclusion, the seven QDIS items administered in an independent clinical sample by telephone interviews showed lower IRT item discrimination parameters than in the self-administered standardization sample via the Internet. However, differences in their IRT threshold parameters and patient scores were small. Our findings support the use of QDIS-7 with the developer-provided standardized IRT item parameters for future studies comparing QOL impact attributed to coronary heart disease with the impact attributed to other chronic conditions.