Introduction

The World Health Organization defines health as not only the absence of disease but also the presence of physical, mental, and social well-being [1]. Role functioning (RF) is a key component of social well-being and thus an important outcome in health research. Within this framework, social roles can be viewed as the building blocks for an individual’s societal involvement as a player in socially recognized areas of human effort [2], and the importance of the impact of poor health on this area cannot be overstated.

Role impairment and role disability are associated with a large number of chronic and mental health conditions. A recent study of the impact of mental and physical conditions on role disability using nationally representative data revealed that 53% of US adults have one or more of the 30 studied conditions [3]. Afflicted respondents reported on average 32.1 more disability days in a year compared to healthy matched controls, which was translated to 3.6 billion days of role disability days in the population [3]. Both mental and chronic medical disorders lead to significant role impairment but on different aspects of RF [4]. These results have led to the increased recognition of role disability as a major source of indirect cost of illness and disease burden [3], making it an important construct to assess accurately.

A number of tools measuring the severity of role impairment and functioning have been proposed. Existing measures are quite diverse in their theoretical approach, format, focus of assessment, comprehensiveness, psychometric properties, areas of use, and popularity with researchers [5]. In our recent work, we have been using these tools to develop a generic role functioning item bank using a multistage process, which is described briefly in the methods section. We have already reported the results of the qualitative stages of this work, resulting in the formulation of 75 generic items assessing health impact on functioning across 3 role areas(family life, occupational life and social life) [5]. Some of the developed items used positive wording aiming to broaden the measurement range of the bank (see Table 2 for item text).

The main focus of this paper is to describe the results of the next steps in the process of item bank development, namely the exploration of the factorial structure of the newly developed item bank and the evaluation of the assumption of sufficient unidimensionality for item response theory (IRT) model estimation. IRT models allow the selection of the most appropriate items for each respondent through computerized adaptive testing (CAT) [69] of RF. This is a promising new strategy in the development of improved health outcome measures [10, 11].

Many approaches have been employed in the evaluation of dimensionality [12, 13]. Both exploratory and confirmatory factor analyses (EFA and CFA) have often been used for that purpose, with CFA having the advantage of allowing hypothesis testing [13]. Supplemental strategies in the assessment of dimensionality are bi-factor analysis [14], parallel analyses [15], comparisons between IRT slope parameters computed from larger and smaller sets of the items [16] and tests of methods effects based on multi-trait multi-method analyses [1719].

We have favored an investigative approach in our work in order to build a model that is conceptually meaningful, while providing sufficient information regarding the measurement requirement of unidimensionality. We used a series of CFA that include bi-factor models in our exploration of the bank’s dimensionality. We hypothesized that both item content and item wording can have an effect on the item bank’s factorial structure. The goal of this study was to determine whether these effects would allow the role functioning bank to be considered sufficiently unidimensional for IRT applications. Details about the tested models are provided in the methods section.

Methods

Participants

A total of 2,500 participants were recruited for the study via the Internet by a panel company (www.YouGovPolimetrix.com) in the winter of 2008. To ensure higher variability of responses and better generalizability of results, we aimed to recruit a sample that was equally stratified across age groups (18–34, 35–54, 55+) with gender, race and ethnicity quotas representing the US population. A little more than half of the sample (n = 1,445) included participants with selected chronic conditions and the other half was recruited as general population, but excluding participants with these conditions (n = 1,055). As a result of the skip pattern, there were items with systematically missing data. The models exploring the dimensionality of assessment were developed with the data from the subsample (SS1) of participants who responded to questions from all conceptual domains but excluded items on parent and partner role (N = 1193). Several of the best fitting models were tested in a second subsample (SS2) of participants (N = 414) who were living with a partner and taking care of a child and thus answered additional questions on the parent and partner role as well.

All participants completed a consent form and received an incentive for their participation in the study in the form of “polling points”.

Instruments

The bank included newly developed items assessing role functioning, items from an existing short role functioning bank and 4 items from the Role Physical Scale of the SF-36 Health Survey [20]. All items were presented consecutively one item per screen.

Item development process

A multi-stage process of item bank development is required for the implementation of a computerized test. The details, challenges, and analytic decision required in the process have been described in detail in earlier publications [10, 16, 21]. Briefly, the process goes through the following steps: 1/Construct definition—the construct is defined based on theory and previous empirical results; 2/Item development—review of existing measures and formulation of new items; 3/Data collection for item calibration and testing; 3/Evaluation of assumptions of unidimensionality and local independence; 4/Psychometric analyses and test of model fit; 5/CAT simulations and specifications of a final CAT. We have followed this process in the development of a generic RF item bank.

Conceptual model

Our theoretical conceptualization of the role functioning construct was inspired by the biopsychosocial model of health and disability and the International Classification of Functioning, Disability and Health (ICF) [1, 5]. The conceptual model defined role functioning as involvement in life situations related to family life, partner relationship, household chores, work for pay, studies, social life (including interactions with friends), leisure time activities, community involvement (including volunteer work), and everyday living activities. Consistent with the ICF, the influences of personal and environmental factors on role functioning were postulated, focusing specifically on the effects of life stage, choice, and opportunity for participation in specific roles [5].

We used review of the literature, including review of existing measures and the results of a focus group study in the formulation of the measurement model and item bank items [5]. The measurement model for the construct of health impact on role functioning covers three interrelated domains of social, occupational and family roles and functioning. The social domain includes all role activities related to the community, leisure, social life and entertainment. The occupational role domain includes employment roles and activities as well as the student role. The family domain includes family roles, including the role of a partner as well as household chores and activities.

New items

Seventy-five items assessing the impact of health on role functioning across the model domains (family life (k = 29, 4 positive), occupational life (k = 28, 2 positive), social life (k = 18, 3 positive) were developed. Each item was designed to cover only one of the content domains. All of the items included health attribution (e.g. “health” was specified as the factor influencing the level of functioning) and a 4-week recall period. Most of the items were worded with negative attribution to health e.g. “Did your health make it hard to start work on time over the past 4 weeks?”, but some positively worded items were also included e.g. “Did your health allow you to perform at your best during the past 4 weeks?” All of the new items were presented one per screen. Skip patterns were used for items assessing role functioning in areas that were not relevant for participants (e.g. unemployed people were not asked questions on occupation).

Short RF bank [22]

Eight items were included from a previously developed item bank on role functioning. The items included attributions to health and a 4 week recall. Each of the items covered more than one of the identified content domains e.g. “In the past 4 weeks, how much of the time have you had difficulty in performing work or other daily activities, because of your health?”

Role physical scale items

Four items of the Role Physical Scale of the SF-36 Health Survey were also included in the bank.

Statistical analyses

We used an investigative approach [13] to the assessment of dimensionality including eigenvalue analyses and a series of confirmatory factor analyses (CFA) models to assess the factorial structure of the item bank and the potential presence of wording method effects. Three types of CFA models were hypothesized and subsequently tested. First, we tested models that included only conceptual factors (Models 1a–1d). As a second step, we tested a number of models including both conceptual factors and methods factors aiming to test the presence of wording effects in the items (Models 2a–2c), following an approach that has been previously used in a number of studies on various content scales [19, 23]. As a final step, we tested bi-factor solutions (Models 3a–3c), following previous recommendations to use this approach in the assessment of sufficient unidimensionality of item banks with theoretically driven sub-domains [14]. More specifically, we tested the following model types:

Health impact concepts models (Models 1a–1d)

We hypothesized four different models representing only substantive factors of role functioning (Fig. 1). Model 1a represents a single (unidimensional) factor model of health impact on role functioning. Model 1b is a hierarchical model representing three content domains of role functioning identified as important in previous studies, namely family life, occupational life, and social life [5]. Model 1c postulates one additional “generic” domain, since the bank included items from the Short Role Functioning Bank and SF-36 Role Physical items that were covering more than one of the three content areas described above. Models 1a–1c are nested in models 2a–2c below. Model 1d is a two factor model with two separate substantive factors for positive and negative health impact corresponding to positively and negatively worded items.

Fig. 1
figure 1

Health impact concepts models (Models 1a–1d)

Health impact concepts and methods components models (Models 2a–2c)

Correlated traits, correlated methods framework was used for these models [23] specified by adding two wording method factors for the one, three and four health impact concepts models described above (Fig. 2). All items loaded both on a content factor and on either positive or negative methods factor depending on their wording. All method factor loadings were constrained to be equal, and correlations between method and conceptual factors were fixed to zero.

Fig. 2
figure 2

Health impact concepts and methods components models (Models 2a–2c)

Bi-factor models (Models 3a–3c)

We also used a series of bi-factor models to evaluate unidimensionality and the effect of positive wording of items (Fig. 3). We planned to test a bi-factor model with one general impact factor and four subdomain group factors as described above in the four factor hierarchical model (Model 3a, 3c). All factor item loadings in the models were freely estimated, while factor variances were fixed to identify the model. Factors were assumed to be uncorrelated. We also planned to combine the two approaches and test a bi-factor model that is also taking into account the difference in the item wording through additional methods factors. Model 3a is nested in Model 3b, which strictly speaking is not a bi-factor model.

Fig. 3
figure 3

Bi-factor models (Models 3a–3c). Note: model 3c has the same factor structure as model 3a, but excludes positively worded items

Models including items on partner and children (Models 4a–4d)

We planned to test a selected number of models with SS2 based on the results of the analyses completed with SS1. As results strongly suggested the presence of wording methods effects, we included models that reflected these results. Conceptually models 4a–4d in SS2 correspond to models 2a–2c, and 3c, but as stated earlier they included some additional items in the family domain. All analyses were conducted using MPlus (3.11) software for analyses of categorical data and polychoric correlations with weighted least-squares with mean and variance adjustment (WLSMV) parameter estimation [24]. Despite the wide use of CFA in dimensionality testing, the results of a recent study have suggested that published criteria and cut-off scores for CFA fit indexes are influenced by the number of items and the non-normality of the data and thus may not be appropriate when used as a single approach in the evaluation of dimensionality of large item banks [13]. Thus, patterns of standardized item loadings were examined and used in the model evaluation along with two widely used fit indexes—the comparative fit index CFI [25] and the root-mean-square error of approximation [25] with values of CFI > .90 and RMSEA < .10 for acceptable model fit [26, 27]. Finally, residual correlations were examined to evaluate local misfit. Once the best fitting model was identified, the question of sufficient unidimensionality was evaluated based on the model parameters, in particular the item loadings [12, 14, 28, 29].

Results

Sample characteristics

The mean age of respondents in the entire sample of 2,500 participants was 49 years (range 18–94), 51% were females, 76% Caucasian, 19% of the sample had an educational level lower than high school, 34% were college graduates, and 17% had post-graduate degrees. Proportions of people with chronic conditions were asthma n = 416 (17%), heart disease n = 333 (13%), diabetes n = 367 (15%), auto-immune diseases n = 329 (13%). Participants in SS1 had slightly lower mean age of 47 years (range 18–94) but were comparable on other demographic characteristics. Proportions of people with chronic conditions in this subsample were asthma n = 212 (18%), heart disease n = 116 (10%), diabetes n = 173 (15%), auto-immune diseases n = 107(9%). Participants in SS2 were younger (mean age 44), predominantly male (55%) and had higher proportion of people with asthma n = 97 (23%) and lower proportions of people with heart disease 31 (8%) and auto-immune conditions n = 28 (7%).

Item characteristics

We defined high scores to indicate higher level of functioning, and all items (rating scale 1–5) were scored in that direction. Descriptive statistics showed skewed distribution for the data (mean = 4.20, standard deviation = 1.08, skewness = −1.48, kurtosis = 1.70). Due to the skewed responses, some of the items had very low frequency count cell for some of the response options, so we collapsed response categories for 11 of the items before proceeding with the analyses.

Factor analyses results by model

In eigenvalue analyses, 4 components had eigenvalues greater than one (53.2, 2.6, 1.9 and 1.7), and the large ratio of the first value to the second suggested unidimensional structure of the bank. The results of the hypothesized CFA’s are presented below, and goodness of fit indexes for all CFA models are presented in Table 1.

Table 1 Fit indexes of CFA models of health impact on role functioning item bank

Health impact concepts models (Models 1a–1d)

Model 1a: The fit of the one-factor concept model was poor (CFI = .840, RMSEA = .158). Standardized factor loadings ranged from .680 (item soc15) to .946 (item RB4, RB5).

Model 1b: The hierarchical model with three content domains, where positive items were not modeled, did not improve model fit (CFI = .861, RMSEA = .134). The three domains had item loadings in ranges from .755 (item fam4) to .986 (item RB2) in family life, from .709 (item soc15) to .962 (item soc12) in social life, and from .714 (item work29) to .960 (item RB4, RB5) in occupational life. Second order factor loadings ranged from .935 to .977.

Model 1c: The fit of the four factor hierarchical model was similar to the two factor conceptual model (CFI = .872, RMSEA = .122). Item loadings ranged from .773 to .926 for family life, .727 to .941 for occupational life, .711 to .960 for social life and .884 to .967 for the generic domain. The second-order factor loadings were high (.929–.970).

Model 1d: Modeling positive items as a separate factor of positive health slightly improved fit indexes, but the overall fit of the model was still unacceptable (CFI = .883, RMSEA = .132). Standardized factor loading ranged from 0.847 (item soc15) to .938 (item work28) in the positive health factor, and from .729 (item Fam4) to .947(item RB4, RB5) in the negative health factor. The correlation between the two factors was .721.

Health impact concepts and methods components models (Models 2a–2c)

Model 2a: The addition of methods effects to the conceptual models improved model fit. However, the one concept model with 2 wording effects still had a relatively poor fit (CFI = .884, RMSEA = .131), item loadings range .476–.890.

Model2b: The model with three conceptual factors and 2 wording methods had a marginally acceptable fit (CFI = .908, RMSEA = .107), with item loading ranges of .633–.893 in family life, .720–.866 in social life, and from .642 to .865 in occupational life. The range of second-order factor loadings was .927–970.

Model2c: The hierarchical model with 4 conceptual factors and two methods effect had an acceptable fit (CFI = .925, RMSEA = .09). The item loadings for the four factors, respectively, were for family life from .741 to .898, for social life from .506 to .932, for occupational life from .680 to .915 and for the generic impact factor from .470 to .968. The range of factor loadings on the impact factor was .914–.965.

Bi-factor models (Models 3a–3c)

Model 3a: The bi-factor model with 4 group factors had a marginally acceptable fit (CFI = .895, RMSEA = .108). The item loadings for the overall health impact factor were in general higher (item loadings range .588–.946) than the item loadings in the four group factors which, respectively, were for family life from .217 to .576, for social life from .116 to .696, for occupational life from −.023 to .464 and for the generic impact factor from .084 to .437. The highest loadings in the group factors were for the positively worded items.

Model 3b: For this model, we added a wording methods effect to Model 3a. In this stage we decided to model only the positively worded items because they had higher items loadings in Model 2c. In addition, residual correlations higher than .2 were present in Model 3a only among positively worded items and they were the items with highest loadings on group factors. The bi-factor model with methods effect factor for positively worded items had acceptable fit (CFI = .933, RMSEA = .084). The item loadings for the overall health impact factor were higher (item loadings range .587–.953) than the item loadings in the four group factors which, respectively, were for family life from .208 to .526, for social life from .200 to .403, for occupational life from .016 to .424, for the generic factor from .044 to .430. The item loadings for the positively worded items were high (.599).

Model 3c: The bi-factor model with positive wording factor fit the data best, but the high factor loadings on the methods factor suggested problems with the fit of these items in the underlying unidimensional construct. In addition, comparison of item loadings in the unidimensional model and the general impact factor of the bi-factor solution revealed some large differences, suggesting that positively worded items are deviating from the underlying unidimensional structure. As a result, we excluded the positively worded items and tested a bi-factor model with 4 group factors again. The model had acceptable fit (CFI = .925, RMSEA = .088). In evaluation of local misfit, all residual correlations in the bi-factor model were below .2, thus supporting the fit of this model. The item loadings for the overall health impact factor were again higher (item loadings range .752–.952) than the item loadings in the four group factors which, respectively, were for family life from .200 to .524, for social life from .209 to .405, for occupational life from .004 to .421, for the generic factor from .056 to .435.

To test whether the acceptable fit of the bi-factor model was due solely to the removal of the positively worded items, we reran model 1a, excluding the positively worded items. While all residual correlations were below .2, the overall fit of the reduced unidimensional model was not acceptable (Table 2). We therefore regard model 3c as the best description of the data. In the process of model building, 11 items were excluded due to low item loadings, high residual correlations or poor fit.

Table 2 Parameter estimates for unidimensional and bi-factor models

Models including items on partner and children (Models 4a–4d)

In the subsample of participants who answered questions in the family domain regarding roles related to partner and children, the results on fit mimicked the previous results. The bi-factor model excluding positive items provided the best fit in this subsample as well (Table 1).

Discussion

We used an investigative approach with previously suggested modeling strategies to determine the factorial structure of a new role functioning item bank. We found that a bi-factor model with group factors for family, work, social, and generic function had the best fit. In this model, loadings on the general factor were considerably higher than the loadings on the group factors. These results suggest that the item bank is sufficiently unidimensional for IRT analysis [12, 2830]. Further, when comparing the loadings on the general impact factor in the bi-factor model and the loadings in the unidimensional model (Table 2), we find only negligible differences (mean = −.016, range −.017 to .049), providing additional support to the sufficient unidimensionality of the item bank. Thus, we conclude that the role functioning item bank is sufficiently unidimensional for IRT modeling. However, the remaining covariation between items that can be explained by the additional group factors for family, occupation and social life, suggest that content balancing of these subdomains may be useful in test construction.

We intentionally included some items with positive wording in an attempt to expand the measurement range of the bank, allowing it to assess even minor role functioning disruptions. Previous research on self-assessments including both positively and negatively worded items has suggested such measures can be interpreted as having one underlying latent factor and methods effects reflecting different item wording. The approach has been most extensively discussed in the psychometric literature on the Self-Esteem Scale [31], which in early publications has been presented as having two substantive dimensions based on EFA results [32, 33]. However, in subsequent analyses by CFA approaches, a unidimensional model with methods effects [34] has been proposed as the superior model. Subsequent studies have supported these solutions, even though various degrees of interpretability were assigned to the methods factors [19, 35, 36]. CFA modeling with methods effects has also been applied to scales measuring worry [26], physical anxiety [37], and self-concept [38] with similar conclusions.

In our study, modeling positively worded items improved item fit in the CFA framework, but the model showed very high loadings for the positive items on the methods factors, and large differences in the item slopes from the unidimensional model. As a result, we excluded these items from the item bank. It is worth pointing out that previous studies on methods effects were performed with relatively short (no more than 15 items), unidimensional measures. The large number of items in the bank, as well as the presence of several subdomains, could have impacted the final results in our study. In addition, the decision to include health attribution in all items may also be accountable for the problems with retaining positively worded items.

These specifics of testing our item bank—use of health attribution in all items, inclusion of several related domains of role functioning and large number of items—had broader implications than the exclusion of positively worded items from the final bank. Therefore, we will discuss each of these decisions and their implications below.

Early in the item development process we decided to include health attribution in all the items. Role functioning is a very broad concept that encompasses a variety of indicators and factors that can influence the level of functioning (e.g. opportunity, choice, interest, health). This conceptual breadth has important psychometric implications, including the appropriateness of use of unidimensional versus multidimensional model [14]. We believe it was important to include health attribution in the items in order to narrow down the breath of the measured construct and confine it within the realms of health-related quality of life. While this decision made the construct more narrowly defined, it also had some negative consequences. Health attributions made items longer and more cumbersome. In addition the formulation of positively worded items, describing unaltered or improved role functioning due to health through the use of health attribution presented some challenges in the readability and logic of formulated items. Results of the current study suggested that such items could not be retained in the bank.

Finally, the exploration of the factorial structure and sufficient unidimensionality of an item bank presumes the use of large number of items. It has been pointed out that in these situations, the use of standards of fit indices established with short measures is impractical since the large number of items typical for item banks and the skewness of the data (both conditions present in our data) lead to larger errors and worse fit index values [13]. While the sensitivity of fit indices to factors other than the dimensionality of the data has been demonstrated, to our knowledge there have been no specific alternative cut-off values proposed. An investigative approach and use of bi-factor models has been suggested instead as a more conceptually viable approach for the evaluations of unidimensionality of item banks [13, 14]. In our study, the best fit was demonstrated by bi-factor Model 3a, which had an acceptable fit even though fit indices were slightly below the universally recommended values.

In summary, we explored the factorial structure of an item bank assessing health impact on role functioning and found the bank to be sufficiently unidimensional for IRT applications. The limitations of this study include the skewness of the data, the presence of systematically missing data for some of the items, and the presence of ceiling effect for some participants. The next steps of the item bank development will be the calibration of the items on a common metric and the development of a computerized test of health impact on role functioning.