Introduction

End-stage renal disease (ESRD) is a major public health problem all over the world [1], with its prevalence increasing by 8 % from 2007 to 2012 [2]. Although maintenance hemodialysis (HD) is an effective life-preserving treatment for patients with ESRD [3], it poses a substantial burden on patients and their families [4] and also adversely affects the patients’ health-related quality of life (HRQoL) [58]. While previous studies have mainly focused on improving the patients’ survival, optimizing their HRQoL is now increasingly investigated [4, 8]. The importance of measuring HRQoL in patients on HD is due to higher rates of mortality and hospitalization among patients with lower HRQoL scores [913]. In addition, measuring HRQoL provides valuable information regarding the effect of disease and its treatment on the physical, psychological and social well-being of the patients with chronic conditions [14].

Several biological and psychological factors such as malnutrition, the presence of concomitant diseases, sleep complaints, anemia, anxiety, depression, dependency on caregivers, uncertainty about future and difficulties in undertaking and maintaining their job contribute to impaired HRQoL in patients on HD [4, 8, 9, 1519]. Accordingly, previous studies have shown that do patients on HD evaluate their HRQoL scores not only lower than healthy people [17, 2025], but also lower than patients with renal transplant, breast cancer, colon cancer, and leukemia [26]. However, these cross-group comparisons are meaningful and valid as long as a prerequisite assumption which is termed as measurement equivalence is established [27]. Measurement equivalence of an instrument, which is examined by differential item functioning (DIF) analysis, means that respondents from different groups interpret items in a HRQoL questionnaire similarly, given the same level of underlying HRQoL [27, 28]. If measurement equivalence is not satisfactory, it is not clear whether the observed disparity in HRQoL scores of the studied groups is a real difference in the underlying construct of interest or, it is due to different interpretations of items in the questionnaire [27, 29].

The SF-36 is one of the most frequently used questionnaires to measure HRQoL of patients on HD, with its reliability and validity being confirmed in several studies [3035]. Although in the previous research the measurement equivalence of the SF-36 has been evaluated across people with different cultural and socio-demographic backgrounds and also with different chronic conditions [3642], this issue has never been examined across patients on HD and healthy people. Hence, it is not clear whether the observed disparity in their HRQoL scores which was reported in the previous studies is a real difference or not. This study aimed to assess whether patients on HD and healthy people perceive the meaning of the items in SF-36 differently or not, using multiple-group multiple-indicator multiple-causes (MG-MIMIC) model.

Methods

Patients and measurements

A total of 150 patients on HD from the Sadra dialysis center and two affiliated hospitals of Shiraz University of Medical Sciences in Shiraz (southern part of Iran) were recruited from July 2011 to December 2011. The patients who received HD on a thrice per week dialysis regime, aged over 18 years, undergoing HD for at least 6 months were included in our study [43]. A simulation study showed that when the sample size in the focal group (generally referred to as patients group) is around 100, in order to have MIMIC results with adequate type I error rate and power, the sample size in the reference group (generally referred to as healthy) should be greater than 500 [44]. Therefore, we randomly selected 642 healthy individuals aged over 18 from a general healthy population and those who had any chronic diseases or health conditions were excluded from the study. Written informed consent was obtained from the participants prior to enrollment in the study. The study was approved by the ethical committee of our institution, Shiraz University of Medical Sciences. The Persian version of the SF-36, which was previously translated and validated in Iran [45, 46], was filled out by both healthy individuals and patients on HD, after completing consent form.

The SF-36 is a well-know generic questionnaire that consists of eight subscales, including physical functioning (ten items), role limitations due to physical problems (four items), bodily pain (two items), general health perceptions (five items), vitality (four items), social functioning (two items), role limitations due to emotional problems (three items), and perceived mental health (five items). Each scale was directly transformed into a 0–100 scale based on the assumption of equal weight for each item, with zero indicating the worst HRQoL and 100 indicating the best possible score. Then the T-scores being a linear transformation of the 0–100 possible range and having a mean of 50 and standard deviation of 10 for every subscale were used for subsequent analyses.

Statistical analysis

In this study, MG-MIMIC model was used to assess DIF. In technical point of view, identifying DIF through MG-MIMIC model is the same as a multiple-group confirmatory factor analysis (MG-CFA) analysis with covariate [47]. When assessing DIF, two types of DIF can occur, namely uniform and non-uniform DIF. Uniform DIF means a constancy of difference in item response probabilities of two groups across the scale. When there is a different direction of DIF in different parts of the construct scale, non-uniform DIF occurs [48, 49]. Uniform DIF was detected when a significant difference in thresholds of a categorical item between the two groups was observed, and non-uniform DIF was identified when there was a significant difference in the loadings of an item across the two groups [50]. The process of detecting DIF was an iterative procedure that involved comparing a series of nested models. In the first step, the most constraint model with equal factor loadings, thresholds and residual variance of all items, variance of latent trait, and scaling factor for both groups fit as a baseline model. In this model, the only parameter which was freely estimated for both groups was the latent variables’ mean. The information from modification indices associated with threshold parameters and factor loadings was utilized to detect items with uniform and non-uniform DIF. In the next step, a model that relaxed a constraint on an item’s threshold (which had the largest magnitude of modification indices associated with threshold parameters) was estimated and compared to the baseline model. Then, a model that relaxed constraint on factor loading of an item (which has the largest magnitude of modification indices associated with factor loading) was fitted and compared to the baseline model. Uniform or non-uniform DIF was detected based on comparison of the reduction in the value of Chi-square statistics for these two refinements. If relaxing of threshold parameters resulted in larger improvement and became significant, uniform DIF was detected, and if relaxing of factor loading led to larger improvement, non-uniform DIF was identified. The resultant model, which was considered as new baseline model, was fitted, then the resulting modification indices examined, and the above-mentioned steps were repeated until no significant model modifications were recognized.

In our study, age and gender were considered as the confounding variables since they differed significantly among patients and healthy individuals. We applied the mean- and variance-adjusted weighted least square (WLSMV) estimation procedure being proposed for ordinal indicators in the Mplus 6.1 software to fit MG-MIMIC models.

To compare the SF-36 subscale scores between healthy population and patients on HD, Student’s t test was used, with calculating Cohen’s d effect size. The magnitude of effect size <0.2, 0.2–0.49, 0.5–0.79, and >0.8 is defined as negligible, small, moderate, and large, respectively [51]. To assess the impact of DIF, the items with uniform DIF were eliminated from the corresponding subscale and the subscale score was computed again.

Results

Descriptive statistics

The mean (±SD) age of the healthy population was 41.51 ± 13.10, and 248 out of 642 (38.6 %) healthy population were male. Table 1 describes the demographic characteristics of the patients on HD in our study. As shown, the patients on HD are significantly older than the healthy ones (p value < 0.001) and the number of males was significantly higher among patients compared with the healthy (p value = 0.048).

Table 1 Demographic characteristics of patients on HD

DIF assessment

Table 2 shows the items’ parameters, including factor loadings and thresholds resulting from MG-MIMIC model for items identified with DIF across patients on HD and healthy general population. As mentioned earlier, uniform DIF is detected by different threshold parameters among the groups and non-uniform DIF is identified with different factor loadings. In general, 16 out of 36 (44.4 %) items were identified with DIF. Six out of 16 DIF items (37.5 %) were flagged with uniform DIF, nine items (56.2 %) with non-uniform DIF, and one item (6.2 %) with both uniform DIF and non-uniform DIF. In the physical functioning subscale, there are four items (#3, #5, #9, and #12) with non-uniform DIF and two items (#7 and #11) with uniform DIF. One non-uniform DIF item (#19) was associated with the role limitation due to emotional problems subscale, and two items, one uniform (#23) and one non-uniform DIF (#29), were associated with the vitality subscale. In addition, in the perceived mental health subscale, one item (#28) was detected as uniform DIF, one (#30) as non-uniform DIF, and one (#26) as both uniform DIF and non-uniform DIF. In the social functioning subscale, one item (#20) with non-uniform DIF and in the general health subscale, two items (#1 and #35) with uniform DIF and one item (#33) with non-uniform DIF were identified by MG-MIMIC. No DIF items were detected in the role limitation due to physical problems and bodily pain subscales.

Table 2 Item parameters and standard errors for uniform and non-uniform DIF items in the SF-36 questionnaire across patients on HD and healthy general population

For the physical functioning subscale, the factor loading of items 3 (Vigorous activities, such as running, lifting…,), 5 (Lifting or carrying groceries), and 12 (Bathing or dressing yourself) was greater for healthy individuals than for patients on HD, indicating that these three items have a greater ability to discriminate under and above a certain level of underlying physical functioning for healthy people as opposed to patients on HD. In contrast, the factor loading of item 9 (Walking more than a kilometer) was greater for patients than for healthy people, meaning that this item possesses a greater ability to discriminate under and above a certain level of underlying physical functioning for patients on HD than for healthy people. In addition in this subscale, the threshold parameters of items 7 (Climbing one flight of stairs) and 11(Walking 100 m) were smaller for patients on HD than for healthy people, implying that given a higher level of underlying physical functioning, patients on HD probably reported fewer problems than healthy individuals. The factor loading of item 19 (Didn’t do work or other activity as carefully as usual) in the role limitation due to emotional problems, items 26 (Felt calm and peaceful), and 30 (Been a happy person) in the perceived mental health and item 20 (Extent health problems interfered with normal social activities) in the social functioning subscales were greater for patients on HD as compared to healthy population. However, a reverse pattern was observed for item 29 (Feel worn out) in the vitality subscale in which the factor loading was greater for healthy individuals.

Furthermore, the threshold parameters of items 23 (Feel full of life) in the vitality, item 26 (Felt calm and peaceful) in the perceived mental health, and item 1 (In general, is your health…?) in the perceived general health were smaller for healthy people than for patients on HD, meaning that given a higher level of underlying physical functioning, healthy individuals probably reported fewer problems than patients on HD. Conversely, this pattern was reverse for item 28 (Felt down) in the perceived mental health and item 35 (I expect my health to get worse) in the general health subscales.

Impact of DIF

Table 3 presents the HRQoL scores (mean ± SD) of general healthy population and patients on HD in each subscale of SF-36 before and after removing the items with uniform DIF. Before removing DIF items, the HRQoL score of patients was significantly lower than that of the healthy individuals in all subscales except for the role limitation due to emotional problems which did not differ significantly in different groups. The values of effect size showed that the discrepancy was negligible for role limitation due to emotional problems and perceived mental health (0.02 and 0.15, respectively), small for role limitation due to physical problems, bodily pain, general health perception, vitality, and social functioning (effect size ranged from 0.23 to 0.39), and moderate for physical functioning with the effect size being 0.51.

Table 3 Mean comparison of the HRQoL scores in the SF-36 questionnaire’s subscale between patients on HD and healthy general population before and after DIF calibration

After removing the items with uniform DIF, the HRQoL scores of the patients on HD were also significantly lower for the physical functioning and vitality subscales with the moderate (0.55) and negligible (0.18) effect sizes, respectively. We did not remove the items with uniform DIF in the social functioning subscale because there are only two items in these subscales. In addition, in the general health and perceived mental health subscales the effect of one item with uniform DIF can be canceled out by another uniform DIF item in the opposite direction with respect to the coefficients of threshold parameters reported in Table 2, so these subscales’ scores were not recomputed, as well.

Discussion

The results of this study provide a perspective for researchers to find out whether SF-36 is an invariant measure for cross-group comparison across patients on HD and healthy people, the issue which has never been investigated in the previous studies. Our results revealed that 16 out of 36 items were flagged with DIF, implying that patients on HD and healthy individual perceived the meaning of these items differently.

For uniform DIF items, the threshold parameters of the items were greater or smaller for patients on HD than healthy people, suggesting that given a certain level of underlying HRQoL, patients on HD reported greater or fewer problems than healthy people. For non-uniform DIF items, factor loadings of the items were greater or smaller for healthy people compared with patients on HD, implying that these items had more or less discrimination ability under and above a certain level of underlying HRQoL for healthy people than for patients on HD.

Most of DIF items (six out of 15) were associated with the physical functioning subscale. This different perception of the items by patients on HD as oppose to healthy population may partly be explained by fatigue, reduced energy, limitation in the use of arm or leg, and also restriction on their daily activities which are frequent among patients on HD [8, 52, 53]. Furthermore, in the vitality subscale patients on HD and healthy individuals responded differently to two out of four items. This discrepancy could be attributed to fatigue and exhaustion which prevail in patients on HD. Interestingly, a previous research showed that vitality grows in proportion to the physical health more than to the mental health [54]. In both the perceived mental health and general health subscales, three out five items were identified with DIF. A possible explanation for this could be that patients on HD usually suffer from psychological problems; consequently, a high prevalence of depression and anxiety among patients on HD was reported in previous research [16, 17, 55]. Hence, these problems can adversely affect their perception of HRQoL concept.

In contrast, surprisingly, patients on HD and healthy population perceived the meaning of the items in the role limitations due to physical problems and bodily pain subscales similarly. From a cognitive and psychological view point, this may be due to the fact that patients on HD may change their personal standards or their expectations when assessing their daily function. Moreover, the adverse effect of their chronic disease may become blunted because of psychological adaptation [8, 50, 52]. With regard to bodily pain, it is worth mentioning that patients on HD can manage their pain and have higher pain threshold [52]. Since this is the first study appraising DIF across patients on HD and healthy people, there were no comparable studies in this context. However, our findings are in accordance with those of previous studies, indicating that health status of people can affect the way they respond to the items of HRQoL questionnaires [3642]. For instance, in a previous research assessing DIF of the SF-36 across healthy adults with and without functional limitations (e.g., deafness and blindness), almost all of the items in the physical functioning subscale exhibited DIF [37]. Dallmeijer et al. [38] also reported that certain items of SF-36 showed DIF across patients with stroke, multiple sclerosis and amyotrophic lateral sclerosis [38]. Moreover, in another study which examined DIF in the SF-36 across patients with different chronic conditions (namely, hypertension, rheumatic conditions, diabetes, respiratory diseases, and depression), a few items with DIF were distinguished, even though the effect of DIF on cross-group comparison was minimal [42].

It should be noted that detecting DIF is one issue and its effect on the subscale scores is another. In the present study, we used a removing and retaining strategy to assess whether mean group differences in subscales scores alter statistically, with and without considering DIF items. Since removing a large number of items with DIF could affect the content validity of the measure, we only removed those with uniform DIF [5658]. Our findings revealed that removing or retaining items with DIF from the physical functioning and vitality subscales did not change the results principally. For the other subscales, DIF cancelation occurred at the subscale level, so there was no indication to perform removing and retaining strategy. However, a recent study reported that full DIF cancelation rarely arises in practice [59]. Violation of the measurement equivalence assumption in the present study, which indicates that the SF-36 functioned differently across the healthy people and the patients on HD, could be a threat to generalizability of the measure to heterogeneous groups [60]. Hence, considerable caution should be warranted when using the SF-36 to compare HRQoL scores across the two groups. However, it should be mentioned that the SF-36 is a well-known questionnaire with sound psychometric properties in different samples and languages [3035, 45, 46]; consequently, detecting DIF does not necessarily undermine the validity of the SF-36 when we intend to measure HRQoL in a specific group [60].

An important strength of our study is that the impact of age and gender as confounding variables on DIF analysis was controlled by MG-MIMIC method, which is the distinct advantage of MG-MIMIC over other methods, such as item response theory (IRT) or MG-CFA analysis [61]. The importance of this issue is owing to the fact that without controlling the effect of confounding variables, DIF detection procedure and also subsequent group comparison regarding HRQoL score may distort. Although ordinal logistic regression (OLR) and MIMIC models also possess this distinct advantage, each has its own limitations. In MIMIC model, only uniform DIF can be detected [61], and in OLR, the observed score, not the latent one, is used in DIF detection procedure [62].

In addition, another advantage of MG-MIMIC model is that the purification of anchor items is considered in DIF detection process. Purification is an iterative process of removing items currently identified as DIF with the aim of obtaining a purified set of items, unaffected by DIF. This issue is important since overlooking purification results in over- or under-estimation in the number of DIF items [47, 50]. In MG-MIMIC approach, parameters detecting and controlling for DIF are estimated in an iterative process, so the underlying HRQoL scores are purified effectively in this study [47].

Our study had some limitations that merit attention when interpreting our findings. First, the effects of other demographic variables such as income and education on the DIF analysis were not taken into account in this study owning to lack of information in this regard. Previous research showed that higher levels of education and income can lead to improvement in HRQoL score of patients on HD [52, 63]. Furthermore, it is generally accepted that anxiety and depression are the most common psychological disorders that patients on HD suffer from [17, 18, 53, 6365]; this may influence their perception of HRQoL questionnaires’ items. Therefore, further in-depth studies should be undertaken to evaluate the effects of these factors on the DIF analysis of SF-36 questionnaire in these two groups.

In conclusion, our findings revealed that SF-36 is not an invariant measure across patients on HD and healthy people; hence, caution should be warranted for comparing HRQoL scores between healthy people and the patients on HD. As researchers have recently aimed for measures that function similarly across heterogeneous populations, we recommended revising the items that exhibit DIF, so the questionnaire can be used across different groups. Yet, identifying the underlying causes of DIF using qualitative methodology, such as cognitive interviewing techniques, can provide valuable information regarding how the DIF items are interpreted differently by patients on HD and healthy people. Finally, detecting DIF may vary from measure to measure and from one sample to another. Hence, future studies should replicate our study of SF-36 across people with other health conditions and also across different cultures or extend it to other generic HRQOL instruments such as GHQ and WHOQOL-BREF.