Introduction

There are over 670,000 individuals with end-stage renal disease (ESRD), or kidney failure, in the United States (U.S.) [1]. ESRD patients must have medical therapy to replace their kidney function to remain living. Over 70% of these patients are on dialysis, where a machine filters toxins from the blood [1]. There is a large disparity in the likelihood of developing ESRD between Black and White individuals. Blacks are 3.9 times more likely to develop ESRD than Whites [1]. Potentially driving this disparity are differences between Black and White patients in prevalence of underlying causes of ESRD, like diabetes and hypertension [1,2,3], and this problem is growing. The incidence of ESRD due to diabetes increased by 33% between 2000 and 2012 for Blacks, but only 2% for Whites during the same period [1].

Though lifesaving, dialysis has been associated with poor health-related quality of life (HRQOL). For example, in a study comparing SF-36 Physical Component Summary (PCS) scores across multiple chronic and infectious conditions including AIDS, epilepsy, diabetes, and others, ESRD patients on dialysis had lower (worse) scores than patients with any other condition except multiple sclerosis [4]. For this reason, significant efforts are made to improve HRQOL in dialysis patients. In their Conditions for Coverage (42 CFR § 494.90), the Centers for Medicare and Medicaid Services (CMS) mandated that each U.S. dialysis patient’s physical and mental health be monitored, and this often occurs with the use of a standardized HRQOL measure [5].

The most common instrument for implementing this requirement is the Kidney Disease Quality of Life 36-item survey (KDQOL-36™) [6]. The KDQOL-36 is a subset of items and scales from a larger, parent instrument, the KDQOL-SF [7]. The KDQOL-SF features 12 scales covering multiple dimensions of HRQOL, including generic HRQOL (SF-36), symptoms associated with kidney disease, effects of kidney disease, ability to work, cognitive function, quality of social interactions, sexual function, sleep quality, social support, encouragement from dialysis staff, and satisfaction with care. The KDQOL-36 short form survey reduces the SF-36 to the SF-12 and retains the Burden of Kidney Disease, Symptoms and Problems of Kidney Disease, and Effects of Kidney Disease scales [6].

Over the past few decades, evidence has emerged suggesting that Black dialysis patients may have better HRQOL than Whites [8, 9]. For example, in the Dialysis Outcomes and Practice Patterns Study (DOPPS), a nationally representative surveillance study of dialysis patients, Black patients had significantly better HRQOL than Whites on SF-36 PCS, Mental Component Summary (MCS), and the Kidney Disease Component Summary (KDCS), a composite of the KDQOL-SF scales [10]. If accurate, these findings could have considerable clinical importance, leading to potentially different treatment decisions for Black and White patients. The absolute low levels of HRQOL on dialysis is often the justification for seeking alternative treatments associated with superior outcomes, like kidney transplantation, or to adjust the care plan of the patient while remaining on dialysis.

Before making clinical decisions based on evidence of differences in HRQOL between Black and White patients, the measurement invariance of HRQOL measures between Blacks and Whites needs to be established. Measurement invariance refers to whether the items within a measure tap the same construct equally well, and ultimately have the same meaning for different subgroups of patients. Violation of measurement invariance occurs when some items are not performing equally across groups, i.e., exhibit differential item functioning (DIF). In order to examine whether reported HRQOL differences between Black and White dialysis patients are genuine, this paper evaluates whether there is measurement invariance between Black and White respondents for the KDQOL-36 scales.

Methods

Patient sample

The data used in this study were collected between 6/1/2015 and 5/31/2016 as part of the Medical Education Institute’s KDQOL-Complete program, a program that supports dialysis providers in assessing their patients with the KDQOL-36 to meet the aforementioned CMS requirement. During this period, 77,072 KDQOL-36 assessments were entered. After omitting duplicate records of patients, excluding patient records with missing assessment dates and incomplete KDQOL-36 assessments, excluding records for patients who were not adults, and were not on dialysis, 70,786 remained. The details of this patient selection procedure are described elsewhere [6]. Of these, we retained 39,843 patients for analysis: Black = 18,404 and White = 21,439. The UCLA Human Subjects Protection Committee (UCLA IRB #17–000313) granted an IRB exemption.

Measures

The KDQOL-36 items are categorical and are combined to create multi-item scales. A generic scale, the SF-12, is embedded within the KDQOL-36 and its Physical and Mental Health Component Summary measures (PCS and MCS) are scored on a T-score (mean = 50, SD = 10, in U.S. general population). It also includes 3 kidney-targeted scales: Burden of Kidney Disease, Symptoms and Problems of Kidney Disease, and Effects of Kidney Disease. Each of these is scored by linearly transforming all items to a 0-100 possible range and averaging the items. KDQOL-36 items are all scored so that higher scores indicate better health.

In addition to the KDQOL-36, we obtained demographic and clinical data from CMS Form-2728 or from the dialysis centers. These variables included race (Black vs. White), age, whether etiology of ESRD was diabetes, dialysis type [in-center hemodialysis, peritoneal dialysis (PD), conventional home hemodialysis], dialysis access site [arteriovenous (AV) fistula, AV graft, venous catheter, PD catheter], and employment status.

Tests of measurement invariance

Tests of measurement invariance for the KDQOL-36 Burden of Kidney Disease, Symptoms and Problems of Kidney Disease, and Effects of Kidney Disease scales were based on correlated factors confirmatory factor analysis (CFA) models (KDQOL-36 Burden of Kidney Disease, Symptoms and Problems of Kidney Disease, and Effects of Kidney Disease modeled as 3 correlated latent factors). In the multiple CFA models described below, several model parameters are constrained or freed, including factor loadings (correlation between items and latent factors), item category thresholds (standardized score determining which category each individual is in based on item response), item intercepts (when estimated, expected value of each item when the latent factor is equal to 0), and residual variances (variance of observed item responses not explained by the latent factor).

Tests of sequential, nested, multi-group CFA models for categorical items were conducted using the Mplus software Version 8 [11]. Based on recent recommendations for testing measurement invariance with categorical data [12], we examined results from multiple model estimators, including robust ML estimation (MLR) and weighted least squares with mean and variance adjusted estimation (WLSMV). For MLR estimation, we used standard approaches to factorial invariance testing. In the results, these models are referred to as MLR models. First, we estimated Configural models wherein all model parameters were freely estimated across groups. This step shows whether the factor structure is the same for each measure across each group. Second, if Configural invariance was found, then Metric invariance was examined by constraining the factor loadings to be equal across groups. Third, if Metric invariance was found, then Scalar invariance was examined by constraining intercepts (equal) as well as factor loadings across the groups. This step in the procedure determines if patients from the different groups provide similar ratings on each item or not.

The approach to model identification and Metric setting varies among each of these models. First, the Configural models sets 1 loading per factor to 1 and sets factor means to 0. The Metric model is similar, except all loadings are constrained to be equal across groups. For the Scalar model, where all factor loadings and intercepts are constrained across groups, factor means are estimated in 1 group (Black). If measurement invariance was indicated for any models, the relevant item parameters were sequentially freed until no further invariance was found.

For the MLR models, model fit was examined with several fit indices. Comparative fit index (CFI) values of 0.95 or above [13], root mean square error of approximation (RMSEA) value of 0.06 or less [13, 14], and standardized root mean square residual (SRMR) of 0.08 or less [13]. Due to the large sample size used in these analyses, we expect χ2 tests of model fit to be significant even in the presence of good model fit. In each of these series of nested models, the corrected χ2 difference test, change in CFI (ΔCFI), change in RMSEA (ΔRMSEA), and change in SRMR (ΔSRMR) were examined. Each change in goodness of fit index test has its own cut-off criterion suggesting non-invariance. Meade et al’s criteria were used for ΔCFI (≥ − 0.002), which is calculated based on the number of factors and items in the model [15]. We used Chen’s criteria for ΔRMSEA (≥ 0.015 for Metric and Scalar invariance) and ΔSRMR (≥ 0.015 for Metric invariance and ≥ 0.010 for Scalar invariance) [16]. Chen’s criteria for ΔRMSEA and ΔSRMR should be interpreted when there is also a ΔCFI ≥ − 0.01. In examining differences in standardized factor loadings across groups, the standards set by Yoon and Millsap were adopted: ≥ 0.1 = small; ≥ 0.2 = medium; ≥ 0.3 = large [17].

In addition to these approaches, we employed Wu and Estabrook’s method for categorical items with ordered responses [18]. This method holds that the standard measurement invariance approach of adding increasingly strict invariance constraints to the Configural model is not appropriate because the conditions established to identify latent responses’ scales for the Configural Model may not be appropriate for subsequent models (i.e., Metric, Scalar), leading to incorrect conclusions about invariance of models. Instead, Wu and Estabrook recommend selecting a series of nested models wherein the identification conditions and invariance constraints are explicitly identified so that when model parameters are increasingly constrained in invariance testing, the scale of the underlying factors is comparable.

We implemented the Wu and Estabrook method using the WLSMV estimator in Mplus with theta parameterization. In the results, the models are referred to as WLSMV models. We selected a baseline model (WLSMV1) with non-invariant thresholds, loadings, and intercepts, but with invariant residual variances. In addition, some parameters were not compared, including factor covariances (estimated), factor means (invariant), and factor variances (invariant). The metric was set by setting the first factor loading in the Black group to 1 and setting factor variances to 1 in the White group, and the model is identified by setting factors means to 0. The second model (WLSMV2) constrains loadings to be invariant. This model sets the first threshold of each item to 1, maintains the same constraints on factor variances as WLSMV1, and the factor means are also set to 0. The third model (WLSMV3) constrains thresholds to be invariant. This model maintains the factor variance constraints, and sets factor means to 0. Two models, WLSMV4a and WLSMV4b, hold invariant thresholds, loadings, intercepts, and residual variances. WLSMV4b alters the identification conditions so that factor variances are set to 1 and factor means are estimated in the Black group.

For WLSMV1-WLSMV4b, each model was compared with two different χ2 statistics. The first of these, \({\chi }_{T}^{2}\), is a standard model χ2 calculated as the fit function minimum*(2n), where n is the total sample size [19]. \({\chi }_{T}^{2}\) is known to not be χ2 distributed, and therefore does not lend itself to calculation of p values. The second, \({\chi }_{T3}^{2}\), theoretically should allow for improved p value calculation but is known to misbehave by not increasing monotonically with nested, more restricted, models [20]. The WLSMV1-4b models were compared with a corrected χ2 difference test based on \({\chi }_{T3}^{2}\)in Mplus (DIFFTEST command) [21], but \({\chi }_{T}^{2}\) is also reported to verify increasing lack of fit with increasingly restricted models. For each of these models, RMSEA and CFI based on \({\chi }_{T3}^{2}\) were calculated to indicate model fit. However, comparing these RMSEAs and CFIs for nested models is inappropriate.

Then, we compared model WLSMV4b to a series of models wherein each item’s loadings and thresholds were released item-by-item. In these models, only the loadings and thresholds for the respective model are released; loadings and thresholds were not released cumulatively across the sequence of models. Each of these models was compared statistically to WLSMV4b with the corrected χ2 difference test. In addition, estimated factor means for the Black group from each model were compared to WLSMV4b.

Results

Participants

There were some differences in background characteristics between Black and White patients (Table 1). White patients tended to be older, more likely to use peritoneal dialysis, and more likely to be retired. A higher proportion of Black patients were unemployed.

Table 1 Patient characteristics by race

Measurement invariance with MLR models

Table 2 shows the loadings for White versus Black patients for KDQOL-36 Burden, Symptoms and Problems, and Effects of Kidney Disease scales from the configural MLR model. Many loadings were very similar across the racial groups and all fell below the range of small differences (< 0.1) [17]. In Table 3, formal tests of measurement invariance from the Configural MLR model are shown. The Configural model (MLR1) showed good fit on RMSEA and SRMR but fell below standard for CFI: χ2 = 37,293.22, df = 498, p < 0.001; CFI = 0.886; RMSEA = 0.054; and SRMR = 0.041. In comparison to the Configural model, the Metric model (MLR2) constrained 21 factor loadings to be equal between the Black and White groups to test the factor loading invariance hypothesis. This resulted in a model with similar fit to the Configural model: χ2 = 37,620.77, df = 519, p < 0.001; CFI = 0.885; RMSEA = 0.053; and SRMR = 0.042. Finally, in comparison to the Metric model, the Scalar model (MLR3) estimated 3 additional factor means and 24 fewer intercepts: χ2 = 39,681.72, df = 540, p < 0.001; CFI = 0.879; RMSEA = 0.053; and SRMR = 0.044. This difference exceeded the ΔCFI cut-off for invariance (− 0.002). Therefore, to test partial Scalar invariance, the intercepts were freed for the KDQOL Burdens scale (MLR4), and for the KDQOL Burdens and Symptoms and Problems scales simultaneously (MLR5). The second of these models demonstrated partial Scalar invariance.

Table 2 Configural model standardized factor loadings for White versus Black patients on KDQOL-36 Burden, Symptoms and Problems, and Effects of Kidney Disease scales—robust maximum likelihood estimation
Table 3 Measurement invariance tests of White versus Black patients for KDQOL-36 Burden, Symptoms and Problems, and Effects of Kidney Disease scales—robust maximum likelihood estimation

Measurement invariance with WLSMV models

Looking at the Wu and Estabrook method results, model WLSMV1 estimated 45 factor loadings, 24 item intercepts, 168 thresholds, 3 factor variances, and 6 factor covariances with \({\chi }_{T3}^{2}\) = 70,723.93, df = 498, p < 0.001 (Table 4). Model WLSMV2 estimated 21 less factor loadings than WLSMV1, but was otherwise the same. Model WLSMV2’s \({\chi }_{T3}^{2}\) was smaller than WLSMV1’s:\({\chi }_{T3}^{2}\) = 49,577.68, df = 519, p < 0.001. In comparison to WLSMV2, WLSMV3 estimated 72 fewer thresholds. WLSMV3’s \({\chi }_{T3}^{2}\) was larger than WLSMV2’s: \({\chi }_{T3}^{2}\) = 51,652.15, df = 591, p < 0.001. Model WLSMV4a estimated 24 fewer intercepts than WLSMV3, and its \({\chi }_{T3}^{2}\) was smaller: \({\chi }_{T3}^{2}\) = 49,519.66, df = 615, p < 0.001. Model WLSMV4b estimated the same number of parameters as WLSMV4a, but freed 3 factor means in the Black group and constrained 3 factor variances. This model had a smaller \({\chi }_{T3}^{2}\) than WLSMV3: \({\chi }_{T3}^{2}\) = 44,970.01, df = 615, p < 0.001. As expected, \({\varvec{\chi }}_{\varvec{T}1}^{2}\) values were ordered and ascending from WLSMV1-WLSMV4b, indicating worse fit with increasingly constrained models. All corrected χ2 difference tests were statistically significant, RMSEA ranged between 0.051 and 0.074, and CFI ranged between 0.925 and 0.953, indicating borderline or good fit (Table 4).

Table 4 Measurement invariance tests of White versus Black patients for KDQOL-36 Burden, Symptoms and Problems, and Effects of Kidney Disease scales—Wu and Estabrook method using weighted least squares with mean and variance adjustment estimation

Table 5 shows sequential item-by-item released models (loadings and thresholds) in comparison to model WLSMV4b. With the exception of the i27 released model, all corrected χ2 difference test were significant at p < 0.001. However, differences in factor means were small.

Table 5 Difference in factor means between fully invariant and released loadings and thresholds item-by-item models using weighted least squares with mean and variance adjustment estimation

Discussion

A recent report provided support for the dimensional structure of the KDQOL-36 among U.S. dialysis patients [6]. Using the same dataset, this study found minimal impact of DIF between Black and White patients on KDQOL-36 score estimation. Though some DIF was observed, the KDQOL-36 scales have largely similar measurement properties for U.S. Black and White dialysis patients. These results indicate that substantive comparisons between Black and White patients using the KDQOL are appropriate.

In the MLR analyses, Configural and Metric invariance was observed, indicating that each KDQOL-36 item has similar importance in its contribution to measuring HRQOL among Black and White patients. This level of invariance is most important for allowing valid substantive comparisons of HRQOL between Black and White patients. However, only partial Scalar invariance was evidenced, indicating the Black and White patients may start from significantly different levels of HRQOL when responding to the KDQOL-36 items. This concern is not to be overlooked, though it is somewhat allayed when the WLSMV models are considered.

The WLSMV models showed mixed results regarding model fit with increasing invariance constraints. The \({\varvec{\chi }}_{\varvec{T}1}^{2}\) values increased across these models (indicating worse fit), while the values of the fit indexes decreased (indicating better fit), though these fit indexes may not be valid for comparison of the fit of nested models. However, when any one item’s loadings and thresholds were released, changes in factor means were trivial. A similar set of results was found when the loadings and thresholds of each kidney-targeted KDQOL scale were released simultaneously. These results indicate that whatever DIF was present, it seems not to have an appreciable impact on average levels of HRQOL when estimated by the KDQOL-36. Nonetheless, additional studies examining measurement invariance between Black and White dialysis patients on the KDQOL-36 scales would be useful to provide further evidence.

A previous study examined measurement invariance of the KDQOL-SF scales between Veterans and non-Veterans [22]. That study found that there may be DIF on some of the kidney-targeted scales in the KDQOL survey. Like in the current study, these scales evidenced Metric invariance (invariant factor loadings), but not Scalar invariance, indicating difference in intercepts between groups. The authors concluded that additional study is needed before DIF between Veterans and non-Veterans could be confirmed.

Black patients had higher estimated factor means in comparison to Whites, indicating better HRQOL. These results are consistent with the results of the DOPPS study, which found that Black dialysis patients had higher scores on the KDQOL Burden and Effects of Kidney Disease scales [9]. Additionally, another large study, the Hemodialysis (HEMO) Study, which included 1,156 Black and 657 non-Black dialysis patients, found that Black patients had significantly higher scores on the KDQOL Symptoms and Problems of Kidney Disease, Effects of Kidney Disease, and Burden of Kidney Disease scales in comparison to non-Black patients [23]. One smaller study of Black and White dialysis patients found that Whites perceived dialysis to be more burdensome than Black patients and were less satisfied with their lives on dialysis [24]. This evidence indicates that Black dialysis patients have better HRQOL than White dialysis patients.

Like all studies, this study has limitations to consider when interpreting its results. First, at 0.866, the CFI value for model MLR1 (Configural model) is somewhat lower than the threshold indicating good fit (> 0.95). In tests of measurement invariance, severe misfit of the Configural model can lead to non-meaningful results from subsequent nested models. We argue that though the value is somewhat low, it does not indicate severe misfit, especially considering RMSEA = 0.054 and SRMR = 0.041. Therefore, despite this lower than ideal CFI value, estimation of additional models in this paper is valid. Next, though sample size was very large, the patients participating in this study are a convenience sample and may be subject to bias. Nationally representative studies of dialysis patients’ HRQOL could be used to confirm our findings from the current study. Finally, as we have pointed-out elsewhere, the data used in this study were not collected for the study’s objectives [6].

Conclusion

This paper examined whether the KDQOL-36, a widely used HRQOL measure among dialysis patients, had equivalent measurement properties between Black and White patients. Since we found negligible impact of measurement non-invariance on KDQOL-36 factor mean estimation between these groups, substantive comparisons of HRQOL between Black and White dialysis patients can be made using the KDQOL-36. The ability to make such comparisons is critical for epidemiological surveillance of dialysis patients’ health, and to use in interventions to improve HRQOL among dialysis patients.