Introduction

The Patient-Reported Outcomes Measurement Information System (PROMIS) initiative aims to measure patient-reported health status for physical, mental, and social well-being [1]. Self-efficacy is a subdomain of the mental health category on the PROMIS framework. This study is part of a larger study of the development and validation of self-efficacy for managing chronic disease in which five constructs were developed (managing symptoms, managing emotions, managing daily activities, managing medications and treatments, and managing social interactions). This paper focuses on one of the five constructs, the self-efficacy for managing daily activities. This construct was defined as assessing the subject’s self-reported level of confidence performing various basic and instrumental activities of daily living without assistance. Self-efficacy is critical for individuals with chronic conditions in order to successfully perform effective self-care of their conditions [2, 3].

While various self-efficacy scales have been developed to measure disease-specific self-efficacy, such as stroke [4], epilepsy [5], multiple sclerosis [6], cancer [7], arthritis [8], and sickle cell disease [9], only one scale, the Self-Efficacy for Managing Chronic Disease 6-Item Scale (SEMCD6) has been widely used to measure self-efficacy for managing chronic conditions across multiple health conditions [10, 11]. The SEMCD6 includes items related to different domains, such as fatigue, pain, emotional stress, symptoms, activity, and medication; however, it does not provide subscale scores. To address the concerns about the lack of different subscales relevant to managing chronic conditions, we developed five self-efficacy item pools: self-efficacy for managing symptoms, managing emotions, managing daily activities, managing medications and treatments, and managing social interactions. These five constructs target self-efficacy for self-management of various chronic medical conditions and include understudied neurological conditions such as epilepsy, multiple sclerosis, Parkinson’s disease, peripheral neuropathy and stroke.

In this paper, we report the psychometrics of the self-efficacy for managing daily activities instrument. The psychometric properties of the self-efficacy for managing daily activities construct were investigated with both classical test theory and item response theory (IRT) methods [12, 13]. Based on the two measurement theory approaches, the test items in the managing daily activities construct were evaluated at the test and item level to provide comprehensive information of item responses and their relationship to the construct. The summary statistics and scoring manuals for the five self-efficacy item banks are available at the PROMIS website (https://www.assessmentcenter.net/Manuals.aspx).

The overall goal of the study was to develop a patient-reported self-efficacy for managing chronic conditions scale that (1) could demonstrate sound psychometric properties of the IRT assumptions (unidimensionality, precision, and local independence), (2) could be used across different demographic groups, and (3) could demonstrate good reliability and validity.

Methods

Study participants

The field test was conducted in patients with diverse chronic conditions. A total of 1087 subjects completed the self-efficacy for managing daily activities item pool. Participants were recruited via two data collection modes between April 2013 and April 2014: a total of 837 participants with chronic neurologic conditions from the clinical setting (University of Maryland Neurology Ambulatory Center) and a total of 250 participants with general medical conditions from the internet survey completed the 35 test items. The neurologic conditions included epilepsy, multiple sclerosis, Parkinson’s disease, peripheral neuropathy, and stroke, recruited from an ambulatory care center at the University of Maryland in Baltimore. The inclusion criteria for the neurologic sample were as follows: 18 years of age or older, reside in the community, and diagnosed with any of chronic conditions under study by a treating neurologist. We applied the following exclusion criteria for the clinical sample: (1) cognitive impairment demonstrated by a score ≤20 on the Montreal Cognitive Assessment, (2) inability to give informed consent due to language (e.g., aphasia that interferes with ability to complete questionnaires, or insufficient understanding of the English language, etc.), (3) severe and unstable medical or psychiatric co-morbidities, and (4) pregnant women, prisoners and institutionalized patients.

The general medical conditions internet sample was recruited through a national online panel testing company, Op4G (see op4g.com for more detail). Participants were randomly selected from approximately 250,000 members of Op4G, and the selected sample completed Internet-based surveys using their home computers. Participants had to have at least one of the following chronic medical conditions: chest pain (angina), hardening of the arteries (coronary artery disease), heart failure or congestive heart failure, heart attack (myocardial infarction), stroke or transient ischemic attack (TIA), liver disease, hepatitis, or cirrhosis, kidney disease, arthritis or rheumatism, asthma, chronic lung disease (COPD), chronic bronchitis or emphysema, migraines or severe headaches, diabetes or high blood sugar or sugar in your urine, cancer (other than non-melanoma skin cancer), HIV or AIDS, spinal cord injury, multiple sclerosis, Parkinson’s disease, neuropathy, or epilepsy. They could have multiple conditions or others not listed here as long as they had one of these conditions. They also had to be 18 years of age or older and reside in the community.

This study was approved by the Institutional review boards (IRB) of the Medical University of South Carolina (#Pro00033397), the University of Maryland (#HP-000432550), and the University of Florida (#261-2010).

Self-efficacy for managing daily activities item pool

The initial item pool was developed using the qualitative research methodology approved by the PROMIS group, including literature review of self-efficacy scales, developing an item library, binning and winnowing items, expert researcher ratings (which were done using a Delphi technique), focus groups with patients, cognitive interviews with people with chronic conditions, and expert item revision [14, 15]. Finally, the 35 test items of the self-efficacy for managing daily activities domain were field tested.

The self-efficacy for managing daily activities item pool assesses patient-reported, current level of confidence performing various basic and instrumental activities of daily living without assistance. The item bank is generic rather than disease-specific, instructing the subject to consider all health conditions and all symptoms in their responses. The item pool consisted of 35 test items and a 5-point rating scale: 1 (I am not at all confident); 2 (I am a little confident); 3 (I am somewhat confident); 4 (I am quite confident); and 5 (I am very confident).

Statistical analysis

The psychometric methods employed were introduced by PROMIS [12, 16]. Descriptive statistics were used to describe the demographic characteristics of the sample (Chi-square test, independent t test or Wilcoxon rank-sum test), and traditional (i.e., classic) psychometric methods were used to evaluate the central tendency (mean) and spread (standard deviations) of the test items. SPSS version 21 was used for the descriptive statistics [17]. Item-total correlation and Cronbach’s coefficient alpha were used to evaluate internal consistency of the test items. Acceptable criteria for item-total correlation used were greater than 0.4 [18]. For Cronbach’s coefficient alpha, we used a criterion of between 0.90 and 0.95 as acceptable for individual level of measurement [19]. The ceiling and floor effect were investigated and less than 15 % of the sample scored at the extreme (i.e., minimum and maximum) was considered as a criterion [20].

We used confirmatory factor analyses (CFA) and exploratory factor analysis (EFA) to investigate the underlying structure (dimensionality) of the item bank. Mplus version 7.11 was used to perform the factor analyses [21]. CFA was selected over EFA as the first step [12] because the initial item pool was selected by experts and refined using qualitative research methodologies in order to represent self-efficacy for managing daily activities construct. The CFA was conducted with the weighted least squares with adjustments for the mean and variance (WLSMV) estimation and one-factor solution. The rating scales were treated as categorical variables [22]. Factor structures were analyzed using model fit indices, including comparative fit index (CFI > 0.95 for good fit), Tucker–Lewis Index (TLI > 0.95 for good fit), root mean square error of approximation (RMSEA < 0.08 for adequate fit and 0.06 for good fit), and standardized root mean square residual (SRMR < 0.08 for good fit) [12].

If CFA indicates poor model fit, we conducted EFA to investigate the magnitude of eigenvalues for the larger factors and factor loadings to detect the underlying structural patterns [12]. In the EFA, a polychoric correlations matrix was analyzed using the WLSMV estimation. The rating scale on the construct was treated as categorical variables [22]. We reported eigenvalues, the amount of variance explained by the model, the ratio of the first and second eigenvalues, and the factor loadings on the test items. The criteria for unidimensionality were as follows: at least 20 % of the variance explained by the first factor, greater than a value of 4.0 in the ratio of the first and second eigenvalues, and greater than a factor loading of 0.3 on test items [12].

The residual correlation matrix from the single factor CFA was investigated to evaluate local independence of the item bank. The criterion for the violation of the local independence was defined as a residual correlation greater than 0.2 with any of the remaining test items [12].

The IRT analyses were conducted using IRTPRO version 2.1 [23]. With a 2-parameter IRT model (the graded response model), the parameters (item discrimination and step thresholds) were estimated on each test item. We used S − X 2, a Pearson X 2 statistic, to investigate item fit to the measurement model [24, 25] and misfit criterion was a p value of less than 0.001 [16]. Standard error of measurement (SEM) was also calculated across theta values to test precision of the item bank. SEM is the reciprocal value of the test information function (TIF) at the estimated ability and is defined as SEM(θ) = \(\sqrt {(1/Ij(\theta j))}\), where θj is the estimated ability, Ij is information. When the theta values of test items were calibrated with mean 0 and standard deviation 1, cut-off of SEM is 0.3, which is equivalent to reliability of 0.90 [26]. The SEM values were presented graphically over the difficulty level of test items in order to investigate how much the item bank attains measurement precision over the range of self-efficacy scores (T scores).

In this study, Logistic Regression Differential Item Functioning using IRT was used to detect the presence of differential item functioning (DIF) based on trait estimates (thetas) [27]. Five DIF analyses were conducted for gender (male/female), race (white/non-white), ethnicity (Hispanic/non-Hispanic), the age of the sample (under 65/over 65 years), data collection mode(clinical setting/internet survey) variables, and neurological chronic conditions (neuropathy, stroke, multiple sclerosis, Parkinson’s disease and epilepsy). The comparison models were categorized as uniform (if the effect is constant), non-uniform (if the effect varies conditional on the trait level), and total DIF (sum of uniform and non-uniform DIF) [27]. The detection index for DIF was McFadden’s pseudo R 2. The detection criterion for McFadden’s pseudo R 2 was any values greater than 10 % [28].

Results

Demographic characteristics

Table 1 represents the demographic characteristics of the sample. Fifty-one percent were female. The sample median age was 55 (SD = 14.7) years. The majority of the people (37.3 %) are 50–64 years old with the range of 18–89 years. In the sample, 73.2 % were white and 23.7 % non-white. Twenty-two percent had attained an advanced degree. The missing data were less than 3 % for gender (2.9 %), race (3.0 %), ethnicity (2.8 %), and education (2.8 %). The severity of neurological chronic conditions for the clinical sample was rated by clinicians. The sample median disease duration was 7.0 (SD = 10.7) years, and 67.3 % had no-to-mild impairment. There were significant differences in all demographic variables between the clinical and internet sample (p < 0.05), except for gender.

Table 1 Demographic characteristics

Table 2 and Fig. 1 present descriptive statistics about the managing daily activities item pool. A majority of the test items were rated in the highest rating category and demonstrated a ceiling effect (range 33.1–86.0 %). The overall percentage for all the rating categories (1–5) was 6.3, 5.7, 10.2, 14.6, and 62.9 %, respectively. A majority of the items showed skewed response distributions to the ceiling (very confident). The test items demonstrated good item-total correlations (0.59–0.85) and acceptable for individual level of measurement (Cronbach’s alpha = 0.97).

Table 2 Descriptive statistics for PROMIS self-efficacy for managing daily activities
Fig. 1
figure 1

Distribution of the sum of rating scale frequency

Dimensionality

The factor analysis with one-factor solution indicated that the CFI (0.952) and SRMR (0.070) met the model fit criteria, and TLI (0.949) was marginally under the model fit criteria; however, RMSEA (0.09) did not meet the model fit criteria. The EFA was conducted to calculate the eigenvalues on the possible underlying factors. The first factor showed an eigenvalue of 24.34, and the other factors showed an eigenvalue less than 2 (see Table 3). Table 4 represents the EFA factor loadings on the 35 test items. All test items showed a high factor loading (0.72–0.92). The sum squared loadings of the 35 items were 24.97, and the ratio of first to second eigenvalue was 12.4. The test items explained 71 % of variance.

Table 3 EFA eigenvalue of factors on the item bank
Table 4 Managing daily activities item bank: item calibrations, step thresholds, factor loadings, and item fit statistics

High residual correlations (−0.22 to −0.29) were found between item 13—“Exercise at a vigorous level for 10 min” and item 2—“eat without help,” 3—“personal hygiene without help,” 26—“use telephone to schedule appointments,” and 34—“can find new ways to manage daily activities when the old way doesn’t work.”

Model fit

Nine items (items 2, 3, 17, 22, 23, 25, 26, 34, and 35) misfit the measurement model (Table 4). In order to investigate the impact of the misfit items, we created a scatter plot of between person measures calibrated from the full item pool (35 items) and person measures calibrated from non-misfit items (26 items). In spite of the narrow error bands, there were no person measures located outside the 95 % confidence interval (i.e., over 5 % of the sample). Based on these findings, we concluded that the misfit items had a negligible or no effect on the person measures.

Item calibrations

Table 4 summarizes the item calibrations for the item bank. The most challenging test items were item 13—“exercise vigorously for 10 min,” item 24—“keep doing my usual activities at work,” and item 30—“maintain a regular exercise program.” The average thresholds of the most challenging items were −0.13, −0.63, and −0.69, respectively. The least challenging test items were item 3—“personal hygiene without help,” item 33—“take medications with correct dose and times,” and item 2—“eat without help.” The average thresholds of the least challenging items were −1.72, −1.76, and −1.86, respectively.

Precision

The theta values were converted into T scores with average 50 and standard deviation 10 for a US clinical population of individuals with at least one chronic condition. The raw scores of the item pool were converted into T scores (“Appendix”). The sample demonstrated the average T score of 50.3 (SD = 9.5), range from 16.31 to 65.11, and median of 50.1. Figure 2 represents the precision of the 35 test items. The x-axis is the converted T score, and the y-axis is the standard error of measurement, indicating the level of precision. The item bank demonstrated the highest precision (SEM = 0.095) at the T score of 39. A wide T score distribution (20 ≤ T score ≤ 57.0) was identified with a high reliability of 0.90. The T score values over 62 demonstrated a low reliability of less than 0.80, and 14.6 % (n = 159) of the sample fell in this range.

Fig. 2
figure 2

Standard error of measurement of the managing daily activities item bank

Differential item functioning

The item bank showed no DIF for gender (male/female), race (white/non-white), ethnicity (Hispanic/non-Hispanic), age (under 65/over 65 years), data collection mode (clinical setting/internet survey), and neurological chronic conditions. All magnitudes of McFadden Pseudo R 2 for the DIFs were less than 10 % -- gender (0.01–3.21 %), race (0.04–2.37 %), ethnicity (0.01–5.32 %), age (0.00–0.90 %), data collection mode (0.01–4.51 %)--and the five chronic conditions: neuropathy (0.00–3.30 %), stroke (0.04–1.41 %), multiple sclerosis (0.01–1.44 %), Parkinson’s disease (0.02–1.77 %) and epilepsy (0.00–3.30 %).

Final item bank

Item 13—“I can exercise at a vigorous level for 10 min, i.e., running, jogging” showed a high residual correlation (over 0.2) with item 2, 3, 26, and 34 (−0.22 to −0.29). There was a high residual correlation between item 13 and items 2, 3, 26, and 34; however, item 13 was identified as the most challenging item in the bank. Thus, the item was considered important to measure a wide range of the latent trait. For this reason, item 13 was not removed from the item bank. The findings from the investigation of item calibrations and step thresholds of the 35 test items are reported in Table 4.

Discussion

We developed the PROMIS self-efficacy for managing daily activities item bank and evaluated its psychometrics. The item bank demonstrated a single measurement construct and precisely measured individuals with a wide range of abilities. Based on its acceptable psychometric properties, the item bank can be used for a computerized adaptive test (CAT) which measures self-efficacy for managing daily activities for individuals among neurological chronic conditions. In addition, the item bank can be used to create various short forms where necessary to reduce administrative burdens.

In keeping with the purpose of the PROMIS initiative [1], this study established a new domain of self-efficacy as part of the existing PROMIS Domain Framework. Self-efficacy is the last domain in the PROMIS mental health category. The item bank for self-efficacy demonstrated sufficient psychometric properties compared to the rest of PROMIS additional domains, such as anxiety, depression, pain, fatigue and anger [16, 2931]. Once these item banks were created and validated, they were further used for CAT and creating various length of short forms [16, 2931]. Since the study item bank demonstrated acceptable psychometric properties, various short forms (i.e., 4 and 8 items) of self-efficacy for managing daily activities can be created while maintaining good psychometric properties. In addition, the item bank can be used to develop a CAT to reduce administrative burdens while maintaining measurement precision.

We expected a slightly higher RMSEA value than the common RMSEA criterion because there was a large number of test items in the item pool [30]. However, since TLI marginally met the model fit criteria, we conducted EFA to investigate a potential secondary dimension. The EFA indicated that the item pool sufficiently met the unidimensional factor criteria.

Only one of 35 test items, item 13—“exercise vigorously for 10 min”—showed local dependency with items 2, 3, 24, and 34 (residual correlations = −0.22 to −0.29). High residual correlations estimated by the single factor CFA indicate that the relationships between item 13 and these four items are stronger than the relationships between item 13 and the latent construct. However, the relationships between item 13 and the four items are not interpretable. For example, the residual correlation between item 13 and item 2 was −0.29, indicating that patients having high confidence in exercising vigorously have low confidence in eating. Analyses conducted in this study do not speak to potential explanations for such findings. Item 13 was the most challenging item in the item pool and therefore effective in measuring individuals who respond at the higher ranges of the measure. For these reasons, we did not remove item 13 from the item bank.

The distributions of responses were skewed to endorsing high confidence levels, indicating that the test items in the bank were relatively less challenging than the confidence levels of our sample. In other words, the sample’s confidence levels were higher than the difficulty of the test items. This finding may explain the low precision for participants (14.6 % of the sample) with high confidence levels, those with T score greater than 62 (lower than reliability 0.80). Theoretically, the IRT model calibrates parameters based on relationships between person ability and item difficulty. When person ability is equivalent to item difficulty, the item information function carries the most information and low standard errors of measurement [32]. However, when there is a discrepancy between person ability and item difficulty, there is a high probability of errors that result in low reliability. In the item bank, the most challenging test items were item 13—“exercise vigorously for 10 min,” item 24—“keep doing my usual activities at work,” and item 30—“maintain a regular exercise program.” Although these test items were the most challenging items in the item bank, other items may not be challenging enough to measure the self-efficacy for individuals with higher confidence. In the item pool development process, the research team made efforts to develop particularly challenging items, such as items 12, 13, 14, 15, 16, 19, 20, 23, 24, 28, 30, 31, 34, and 35. Those items were identified as challenging items and were modified based on feedback from patients through focus groups and cognitive testing. For instance, item 28—“I can take care of others (for example, cook for others, help them dress, watch children)” which demonstrated lower proportions of ratings at the ceiling (about 50 %). However, most items demonstrating the highest proportions of ratings at the ceiling are related to basic and instrument activities of daily living. A possible explanation of the ceiling effect would be that it occurred due to characteristics among participants who reside in their community, a population in which 67.3 % of have no-to-mild impairment levels. Their medical conditions may be stabilized and good enough for living in the community. Therefore, additional research may be needed to develop more challenging test items in order to measure precisely self-efficacy for managing more challenging daily activities.

Although the conceptual domains of self-efficacy were defined and modified by qualitative methods (Delphi, focus groups, and cognitive interviews), the self-efficacy for managing daily activities item bank have similar test item content compared to the PROMIS physical functioning item bank (v. 1.2). In spite of this similarity, we pursued this self-efficacy measure because from the Delphi phase, being physically active was identified as an important aspect of self-efficacy for managing chronic disease. For instance, item 11, “I can walk a block (about 300 feet or 100 m) on flat ground” with the response options ranging from “I am not at all confident” to “I am very confident” is very similar to a test item of the PROMIS physical functioning item bank (v. 1.2), “Are you able to walk a block (100 m) on flat ground?” with the primary difference being in the response options ranging from “Without any difficulty” to “Unable to do” [33]. We believe this is a generic limitation of self-reported outcome measures in measuring different conceptual domains. To clearly distinguish the differences and similarities of the self-efficacy item bank with other conceptual constructs, future studies will be needed to investigate the convergent and divergent validity between the self-efficacy item bank and other self-efficacy measures and non-self-efficacy measures, respectively.

The item bank demonstrated no evidence of DIF across gender, race, ethnicity, age, data collection site, and the five neurological chronic conditions. However, sample size could affect the significant level of DIF when there is a small sample in a comparison group and a reference group or both. The managing daily activities item bank consisted of a 5-point rating scale, indicating the need for a sample size of at least 200 to detect moderate uniform DIF [34]. Although there was no DIF for demographics and the five neurological chronic conditions, the sample sizes for ethnicity (n = 64) and four chronic conditions, including stroke, multiple sclerosis, Parkinson’s disease, and epilepsy (n = 169–181), were less than 200. These small sample sizes in the comparison group might inflate the type II error rate. Thus, further studies are recommended to replicate with larger samples (n > 200) to test DIF for ethnicity and the four chronic conditions.

This study has several limitations. First, our sample had a predominance of neurological conditions. While there was no DIF for the data collection mode (clinical site and internet survey), there were significant differences in demographics and the neurological chronic conditions. The different recruitment methods may cause a selection bias, and the sample differences might affect psychometric properties. Secondly, while clinicians rated the item pool and other information for the clinical sample, the collected data was self-reported from the internet sample. Since data administration methods were different for the two samples, the quality of data may not be consistent.

In conclusion, the PROMIS self-efficacy for managing daily activities item bank was developed and tested by quantitative measurement theory methods. The psychometrics of the item bank indicated that self-efficacy for managing chronic conditions can be reliably measured, maintaining an acceptable precision across a wide range of confidence levels. Based on the sound psychometrics, a CAT for the item bank and various short forms (i.e., 4 and 8 items) can be developed in order to reduce the burden of completing the 35-item bank. Once a CAT is developed and tested in clinical settings in people with a range of self-efficacy, the item bank may be improved by tracking the trajectory of change in self-efficacy in a longitudinal study and by introducing more challenging test items that result in improved precision when measuring high levels of self-efficacy. Further research is needed to investigate the untested psychometrics of the item bank, such as predictive validity, test–retest reliability, and responsiveness.