Purpose

Latinos are individuals born into or descended from a Spanish-speaking community [1, 2]. Latinos as a racial/ethnic group stand apart from other racial/ethnic groups in terms of their population growth, socioeconomic characteristics, and health-related issues [35]. The US Bureau of Census projects that the proportion of Latinos in the general United States population will increase dramatically with immigration and greater longevity, higher birth rates, and lower infant mortality rates. In fact, the Latino population in the United States was 5 % of the total population in 1970, 12 % in 2000, and 16 % in 2010 [68]. The census office estimates that by 2050, 30 % of the United States population will be Latino [810]. A larger percentage of the elder population will consist of minorities, especially Latinos, the fastest-growing minority group among older adults [11].

Physical functioning is an especially important indicator of health for older individuals and one of the strongest predictors of health care utilization and mortality [1214]. A physical functioning item bank was created for the Patient-Reported Outcomes Measurement Information System (PROMIS®) project [14]. Because interpretation of results requires equivalent responses for different subgroups, differential item functioning (DIF) was previously evaluated for age, gender, and educational attainment for English-language respondents [15]. This paper extends that work by evaluating DIF for the PROMIS® Spanish- and English-language physical functioning items.

Methods

Data sources and measures

English language

The PROMIS® wave 1 English-language physical functioning bank consists of 124 items that assess mobility (lower extremity), dexterity (upper extremity), axial or central (neck and back function), and complex activities that overlap more than one domain (daily living activities) [1214]. Because of the large number of physical functioning items, two sets of 56 items (112 items) were administered to a randomly selected subgroup of English-language individuals and one set of 56 items to another subgroup (another PROMIS® item bank of 56 items was also administered to this subgroup). The analysis sample consisted of the 728 and 776 individuals in the first and second subgroups that completed at least half of the physical functioning items they were administered (n = 1,504 overall).

Respondents were recruited by YouGovPolimetrix, a polling firm based in Palo Alto, CA [16]. This firm uses a sample-matching procedure to select a representative sample of the population [17]. The PROMIS® online panel was found to have similar demographic characteristics as the United States census, except that the online panel tended to have more educated individuals [18].

Spanish language

The PROMIS® physical functioning item bank was translated into Spanish using a universal approach for translations and cultural adaptation of instruments [1922]. The process involved 2 initial forward translations, 1 reconciled version, 1 back translation by a native English speaker, comparison of original with back translation, and reviews by 3 bilingual experts from different Spanish-speaking countries. Fifteen cognitive interviews with native Spanish speakers followed to evaluate the comprehension of the items. The items were divided into three groups, and each group of items was evaluated by five subjects.

Only 114 of the 124 items were administered to the Spanish-language sample. The 10 items that were excluded were those for which the spread of responses was restricted in the English sample, requiring collapsing response categories for analyses. The Spanish-translated items were administered to 640 adult Spanish-speaking Latinos in the Toluna online panel, an independent survey technology provider [23]. All 640 respondents answered all of the 114 physical functioning items administered to the Spanish sample.

Statistical analysis

We estimated item frequencies, means and standard deviations for the Spanish-language physical functioning items. Unidimensionality of the items was examined by fitting a one-factor categorical confirmatory factor analysis model using Mplus [24]. Model fit was assessed by the Comparative Fit Index (CFI) and Tucker-Lewis Index (TLI), as well as the root mean square error of approximation (RMSEA). Good model fit is defined by the following cutoffs: CFI >0.95, TLI >0.95, and RMSEA <0.06 [25, 26]. Next, IRT assumptions of local independence and monotonicity were evaluated. Local independence was assessed based on residual correlations among items.

Multilog was used to estimate IRT item parameters (slope and threshold) for Samejima’s Graded Response Model [27]. This model yields one slope or discrimination parameter and (n − 1) threshold parameters for polytomous items with n response options. The slope parameter gives information regarding the discrimination of the item between adjoining trait levels. Higher values indicate that items are better able to discriminate between adjacent categories of trait level. The threshold parameter represents the point along the latent trait at which a subject has a 50 % chance of responding in that category or higher.

Differential item functioning (DIF) was assessed by comparing the Spanish language (n = 640) with the PROMIS® wave 1 English-language data (n = 1,504). DIF is present if the probability of selecting a particular response varies by language group when controlling for the underlying level of physical functioning. We evaluated DIF using ordinal logistic regression with IRT-based trait scores estimated from DIF-free “anchor” items (iterative purification) as the conditioning variable. A pseudo R 2 difference of <0.02 between nested models was used to identify potential anchor items. We examined the magnitude of DIF for English versus Spanish language using test characteristic curves separately for all physical functioning items and for the items identified as having DIF. For those items having DIF, we evaluated whether they had uniform DIF, in which DIF is in the same direction across the entire ability continuum of physical function (response curves for both groups do not cross); or non-uniform DIF, in which the probability of endorsing an item is higher for one group at lower physical functioning, and higher for the other group at higher physical functioning (response curves for both groups cross at a certain point along the continuum). We assessed DIF at the individual level by plotting theta estimates ignoring DIF versus theta estimates accounting for DIF. These analyses were run using Lordif software [28].

To evaluate the impact of using English-language calibrations versus Spanish-language calibrations on theta estimates, we put the Spanish-language item parameters (slopes and thresholds) on the same metric as the English-language parameters using STUIRT [29]. This program produces the Stocking-Lord linking constants used to linearly transform the Spanish item parameter estimates to the English metric. Then, we used Firestar to simulate CAT-based theta estimates from the two sets of item parameters and compared the estimates [30]. Options used for these simulations included imposing a minimum number of items of 5, a maximum number of items of 20, and stopping rule based on a standard error of 0.30.

Results

Descriptive statistics

The average age of the English-speaking sample that completed the physical functioning item bank was 51 with a range from 18 to 93. The average age of the Spanish-speaking sample was 38 with a range from 18 to 77. The English sample was relatively older with fifty-three percent (n = 802) being over 50-year old while only fourteen percent of the Spanish sample (n = 87) was 50 years or older. Fifty-two percent of the English-speaking sample was female while fifty-eight percent of the Spanish-speaking sample was female (see Table 1).

Table 1 Sociodemographic and clinical characteristics of Spanish (n = 640) and English (n = 1,504) physical function sample

Thirteen percent of the Spanish-speaking sample reported reading and speaking only Spanish, 48 % speaking Spanish better than English, 39 % reading and speaking both languages equally, and only 1 person reported reading and speaking English better than Spanish. Thirty-three percent reported speaking only Spanish at home, 51 % speaking more Spanish than English at home, and 15 % speaking both equally at home.

The Spanish-speaking sample was less educated with thirty-six percent (n = 233) reporting High school education/GED or less, while the English-speaking sample reported twenty percent (n = 293) with High School education/GED or less. Another noteworthy difference was the prevalence of some comorbidities being significantly higher in the English-speaking sample: high blood pressure (36 vs. 16 %) and depression (21 vs. 14 %). However, when looking only at those subjects with those comorbidities and who reported being limited in activities because of them, the prevalence was significantly higher in the Spanish sample: high blood pressure (8 % in the English sample versus 37 % in the Spanish sample) and depression (29 % in the English sample versus 55 % in the Spanish sample) (see Table 1).

The 114 physical function items (see Online resource 1) administered to the Spanish sample had 5 response options each: 1 = worst physical function (cannot do or unable to do activity) and 5 = best physical function (health does not limit at all in doing activity). The Spanish sample had no missing data and the range for item means (standard deviation) was 2.53 (1.36) to 4.62 (0.82). The frequencies for the items are provided in Online resource 2. Twenty-nine items had categories with less than 5 responses (unable to do or cannot do), see Online resource 3. These categories in these items were collapsed with neighbor categories for analysis. For the 114 items, coefficient alpha was 0.99 and item-rest correlations ranged from 0.41 to 0.89.

Dimensionality (Spanish sample)

A one-factor categorical model was found to fit the data well (CFI = 0.971, TLI = 0.970, and RMSEA = 0.052). Factor loadings (see Table 2) ranged from 0.329 for item PFC33 (Are you able to run 10 miles (16 km)?) to 0.939 for item PFB12 (Are you able to make a bed, including spreading and tucking in bed sheets?).

Table 2 Loadings for 114 physical function items in Spanish sample from categorical one-factor model

Item parameters

Marginal maximum likelihood estimates and standard errors of item parameters from the two-parameter graded response model for all the physical functioning items are provided in Table 3. The slope parameters for this model ranged from 0.45 to 4.50, and the threshold parameters ranged from −1.92 to 6.06. PFC33 (“Are you able to run 10 miles (16 km)?”) had the lowest slope while PFA44 (“Are you able to put on a shirt or blouse?”) had the highest slope. PFA7 (“How much do physical health problems now limit your usual physical activities (such as walking or climbing stairs)?”) had the smallest category threshold (between cannot do and quite a lot). PFC33 (“Are you able to run 10 miles (16 km)?”) had the largest category threshold (between without any difficulty and with a little difficulty).

Table 3 Spanish sample slope and threshold parameter estimates (SE) and items presenting DIF (20 with uniform DIF* and 30 with non-uniform DIF**)

Effect of item local dependency

Lagrange multiplier tests indicated that there was a very large residual correlation of 0.672 between PFC33 (Are you able to run 10 miles (16 km)?) and PFC7 (Are you able to run 5 miles (8 km)?). In fact, 11 items had a residual correlation of 0.20 or more: PFA13 (Are you able to exercise for an hour?), PFA19 (Are you able to run or jog for 2 miles (3 km)?), PFA33 (Are you able to exercise hard for half an hour?), PFA39 (Are you able to run at a fast pace for 2 miles (3 km)?), PFB5 (Does your health now limit you in hiking a couple of miles (3 km) on uneven surfaces, including hills?), PFB7 (Does your health now limit you in doing strenuous activities such as backpacking, skiing, playing tennis, bicycling or jogging?), PFB51 (Does your health now limit you in participating in active sports such as swimming, tennis, or basketball?), PFC7 (Are you able to run 5 miles (8 km)?), PFC13 (Are you able to run 100 yards (100 m)?), PFC33 (Are you able to run 10 miles (16 km)?), and PFC35 (Does your health now limit you in doing 8 h of physical labor?).

An iterative process was followed evaluating model fit and parameter changes after dropping one locally dependent item at a time. After dropping all 11 locally dependent items, model fit indices improved to CFI = 0.985, TLI = 0.984, and RMSEA = 0.041. Marginal maximum likelihood estimates of item parameters from a two-parameter graded response model after dropping the 11 local-dependent items revealed that the slope parameters ranged from 1.83 to 4.57 while the threshold parameters ranged from −1.86 to 1.70. Because of some difference in model fit, we evaluated DIF for the full set of items and then after dropping the locally dependent items (see below).

Identification of DIF

Fifty of the 114 items were flagged for DIF based on the R 2 of 2 % (0.0200) or above criterion; 20 uniform and 30 non-uniform (see Table 3). Examination of DIF restricted to 103 items (dropping the 11 items with high residual correlations) showed 44 items with DIF based on the R 2 of 2 % (0.0200) or above criterion. Thirty-three of these items were among the 50 showing DIF in the full set of 114 items, and 30 of them showed the same direction of DIF. We examine DIF impact for the full set of 114 items below.

DIF impact

The impact of language DIF items on test characteristic curves (TCCs) is shown in Fig. 1. The graph on the left of Fig. 1 shows the TCC for all 114 items while the graph on the right shows the TCC for just the 50 items with DIF. These curves indicate that the expected physical functioning total score is higher for Spanish language than English-language respondents. Figure 2 shows the difference between scores ignoring DIF (initial theta) and those that account for DIF (purified). The mean difference is indicated by the dotted line in the right panel (about −1.2), and the median is shown in the box plot in the left panel (about −1.4). These differences are much larger than the standard errors and are substantial in magnitude (greater than a standard deviation). Accounting for DIF tended to result in higher scores, especially for English-language respondents.

Fig. 1
figure 1

Impact of DIF on test characteristic curves

Fig. 2
figure 2

DIF impact at individual level

Stocking-Lord linking constants to transform linearly the 114 Spanish item parameter estimates to the English metric were as follows:

  • Spanish slopes = Spanish calibrated slope/1.21644

  • Spanish thresholds = (Spanish calibrated threshold * 1.21644) − 1.749875

The transformation equations for the 64 non-DIF items were as follows:

  • Spanish slopes = Spanish calibrated slope/0.995125

  • Spanish thresholds = (Spanish calibrated threshold * 0.995125) − 1.117342

The transformed Spanish parameters appear in Online resource 4.

Figure 3 shows the associations between CAT-based theta estimates in the Spanish sample (n = 640) based on English parameters (x-axis) and Spanish parameters (y-axis) for the 114 physical functioning items. Figure 4 provides the same but for the 64 non-DIF items. Product-moment correlations between the two CAT-based theta estimates were 0.90 (81 % common variance) and 0.96 (91 % common variance), respectively. Intraclass correlations for the 114 items and for the 64 non-DIF items were 0.89 (CI = 0.87–0.91) and 0. 96 (CI = 0.95–0.97), respectively. Hence, the DIF items have a noteworthy effect on the estimated thetas.

Fig. 3
figure 3

CAT-based theta estimates using English (x-axis) and Spanish (y-axis) parameters for 114 items in Spanish sample (n = 640, ICC = 0.89)

Fig. 4
figure 4

CAT-based theta estimates using English (x-axis) and Spanish (y-axis) parameters for 64 non-DIF items in Spanish sample (n = 640, ICC = 0.96)

We recommend using English calibrations for the 64 non-DIF items and Spanish calibrations (transformed linearly to English metric) for the other physical functioning items.

Discussion

One of the goals of PROMIS® is to improve precision and enhance the comparability of health outcomes measures [12]. Comparison between different language groups assumes items mean the same to people from the different groups. If subjects respond differently depending on an external variable, group comparisons are problematic. Comparisons of different groups require equivalence in the groups or statistical adjustment for lack of equivalence. Fifty of the 114 items showed differential item functioning in this study of subjects responding to a physical function item bank in English versus Spanish. This indicates that English-language and Spanish-language respondents with the same level of physical functioning respond differently to a substantial number of these items.

One possible explanation for such a high number of items presenting DIF is the validity of the translation. However, the FACIT translation methodology used by PROMIS® to translate the item bank into Spanish [31] is a rigorous and commonly used method. It included representation of people from different countries, which is especially important in Spanish translations to be used in the United States given the diversity of the Spanish-speaking population. A readability analysis of the Spanish item bank was not performed.

If the results of this study are to be generalized to the general United States Latino population, the extent to which the sample of Spanish-language study participants is representative of the United States Latino population is also relevant. According to 2010 Census Data, the United States Latino population has 46 % having a High School diploma or some college and 36 % having less than a completed High School degree, while our sample has 53 % having a High School diploma or some college, and 14 % having less than a completed High School education [32]. We know that Spanish was the language of preference, but we do not know the heritage of the subjects who chose to respond to this online survey. Of the total Latino panel, only 2 % selected Spanish as their language of preference, which makes it harder to understand and characterize the sample. Therefore, the generalizability of the results to the United States Latino population requires future study. Our sample had relatively higher levels of education and low levels of acculturation. In addition, these subjects chose to be part of an online panel, possibly compromising the generalizability to older Latinos or those with lower levels of education. Specifically, online sampling might introduce some bias related with computer literacy and its relationship to higher education and socioeconomic status.

Another factor that might need further study is the generalizability of results to older Latinos. In addition, our results cannot be generalized either to a clinical sample with a skewed distribution since our results are based on the sample we studied. It is interesting to note that only in 9 items Spanish-speaking respondents had higher percentages responding to the highest score. When examining these items, all of them correspond to items asking about more strenuous activities; that is, PFC7 “Are you able to run 5 miles (8 km)?” PFC32 “Are you able to climb up 5 flights of stairs?” and PFC33 “Are you able to run 10 miles (16 km)?”

When looking at language differences, it appears that Spanish respondents are much more likely to use response options with the word “some”—that is, “With some difficulty” or “Somewhat.” In fact, for all items, except PFB51 “Does your health now limit you in participating in active sports such as swimming, tennis, or basketball?” in which frequencies are the same for both languages, Spanish respondents selected the response with the word “some” more often than English respondents. However, even if not seen in other studies, this could also show that Spanish respondents prefer the middle response more frequently than English respondents. This could show some level of less determination not really committing to an extreme response option. This would be a cultural difference needing further study.

Even though some items had lower discrimination than others, it is still acceptable to include them in the item bank since they are reasonable physical functioning items. However, it is important to note that items with higher discrimination will be selected first in CAT administrations. In order to retain all items in the physical functioning bank, a recalibration was done using a hybrid approach in which English calibration metrics were used for non-DIF items and Spanish calibrations were used for those items presenting DIF. Because some items showed significant language DIF, using English calibrations for all items would lead to inaccurate theta estimates in the Spanish items. Furthermore, there were too many items presenting DIF, so excluding those items was not an option.

As the aging population continues to grow, future studies need to focus in specific physical functioning attributes that decline with aging. Item banks need to ensure that they provide sufficient information regarding the elder population and how the different aspects of physical function decline and are interrelated.