Introduction

Cancer-related fatigue is defined as the perception of unusual tiredness that varies in pattern or severity and can affect the functional ability of cancer survivors [13]. A recent literature review of 18 studies measuring symptoms in adults during active treatment found fatigue to be experienced by 62 % of patients [4]. Fatigue continues to be a prevalent and distressful symptom for cancer survivors years after active treatment ends [5]. Among disease-free breast cancer survivors approximately 3 years post-diagnosis, 41 % reported moderate to severe fatigue levels [6].

Existing self-report fatigue questionnaires vary on a number of factors including questionnaire length, reference period, and the response scale [7]. One key factor that differentiates these questionnaires is whether they provide a single overall score (a unidimensional measure) or multiple scores that reflect different attributes of the fatigue experience (a multidimensional measure). The Piper Fatigue Scale (PFS) is one of the commonly used multidimensional fatigue measures in the cancer research field and includes subdomains of behavioral, affective, sensory, and cognitive/mood attributes of fatigue [8]. The original PFS consisted of 40 questions (items) and the revised PFS (PFS-R) includes 22 questions [8].

While the PFS-R has been translated into multiple languages [913], more researchers and clinicians would likely use the PFS if it were shorter in length. Respondent burden is a concern as most research studies include a battery of self-report measures to capture the health-related quality of life (HRQOL) of cancer patients. In addition, prospective studies are encouraged over cross-sectional studies to capture variations in fatigue experience over time; thus, reducing response burden is a necessity. Finally, to increase research of fatigue in diverse samples of survivors, fatigue instruments must be valid and reliable across different racial/ethnic groups.

The overall goal of this psychometric study was to analyze the PFS-R in a diverse cohort of breast cancer survivors to reduce the number of questions in the scale. To accomplish this goal, the original four subdomain structure of the PFS-R was re-examined. The assumption guiding this study was that if analyses confirmed the multi-dimensionality of the scale, then the shortened scale must maintain its ability to provide reliable measurement for each subdomain.

Methods

Participants

As described elsewhere [6, 14, 15], participants were female breast cancer survivors enrolled in the Health, Eating, Activity, and Lifestyle (HEAL) study. A total of 1,183 women were recruited through the population-based Surveillance, Epidemiology, and End Results (SEER) cancer registries in New Mexico, Los Angeles County, and Western Washington. These women participated in a baseline in-person interview within 1 year after diagnosis. Of those, 944 (80 %) participated in a follow-up assessment approximately 2 years after the first interview, and 858 (73 %) completed an additional HRQOL questionnaire approximately 3 years (40.5 months) after diagnosis.

For this study, 57 women were excluded for recurrent or new primary breast cancer before completing their HRQOL survey, as were 2 women with incomplete fatigue data. The final sample of 799 women included 436 from New Mexico, 195 from Los Angeles County, and 168 from Western Washington. All African-Americans were recruited from Los Angeles and most Hispanic women were recruited from New Mexico. In addition, the African-Americans were restricted to 35–64 years of age to focus on younger breast cancer survivors. Participants were diagnosed with in situ, Stage I, II, or IIIA breast cancer.

Informed consent was obtained from each participant at each assessment. The study was approved by the Institutional Review Board at each site, in accord with assurances filed with and approved by the U.S. Department of Health and Human Services.

Measures

Demographic and background variables

Age, education, and race/ethnicity data were collected at baseline. Data on marital status, household income, height, employment, menopausal status, smoking, and current weight were collected at the first follow-up survey. Self-reported comorbidities were captured on the HRQOL survey.

Clinical variables

Stage of disease was obtained from each SEER registry database. Medical records were abstracted to obtain treatment information and these data were supplemented with SEER registry data. Women were asked if they were taking tamoxifen (yes/no) at the 24-month follow-up. In the HRQOL assessment, women self-reported about reconstructive surgery and lymphedema.

Fatigue

The PFS-R was included in the HRQOL survey. It is a 22-item scale that measures four subscales: behavior (6 items), affect (5 items), sensory (5 items), and cognition/mood (6 items) [8]. Each item has 11 response categories on a 0–10 metric with verbal descriptors anchoring the endpoints. Each subscale is scored individually and then aggregated together for an overall score, with higher scores reflecting more fatigue. The HEAL study used an adapted version of the PFS-R that asked survivors to rate their fatigue over the past month rather than the past week. This extended reference period was used to minimize the effect of acute situational events and to enhance the assessment of the survivor’s general state of fatigue. Previous studies have found evidence that the PFS-R has acceptable internal consistency and evidence for validity with cancer patients [1618] including the adapted version used in this study [6].

The HEAL HRQOL questionnaire included an initial screening question that had respondents skip the PFS-R behavioral and affective subscales if they indicated that they had “no fatigue” over the past 4 weeks. These subscales were skipped because the questions asked the respondent to further clarify the fatigue they were currently experiencing. Approximately 36 % of the 799 breast cancer survivors did not complete those two subscales (N = 291), but did complete the sensory and cognitive/mood subscales.

Analyses

Before selecting items for removal from the PFS-R, the dimensionality of the PFS was re-confirmed to determine if psychometric analyses should be performed at the subscale level (i.e., referred to as the “multidimensional model” with four subdomains) or at the overall fatigue level (i.e., referred to as the “unidimensional model” with all 22 items loading on a single fatigue factor).

Confirmatory factor analysis was used to evaluate the fit of the multidimensional and unidimensional models to the HEAL data. In addition, a bifactor model was fit to the data to examine the extent to which the items loaded on a general fatigue factor compared with the items’ loadings on their respective fatigue subdomains. In the bifactor model, larger loadings on the general factor than the subdomain-specific factors would suggest that a dominant single fatigue factor accounted for a majority of the variation observed among the response data. Larger loadings on the subdomain-specific factors than the general factor would suggest that the multidimensional model was a better solution [19].

Figure 1 provides illustration of the three models tested. Model fit was assessed with multiple indices, with specific criteria noted for good model fit including: root mean squared error of approximation (RMSEA < .06), comparative fit index (CFI > .95), and Tucker–Lewis Index (TLI > .95) [2023]. MPLUS software (version 6.1) was used for each factor analysis that uses a maximum-likelihood estimator for continuous data.

Fig. 1
figure 1

Structural equation models fit to the data on the revised-Piper Fatigue Scale: multidimensional model, unidimensional model, and bifactor model

The next step included reviewing the psychometric properties of the PFS-R with specific attention given to shortening scale length. Six criteria, described below, were considered for making item selections: (1) content validity, (2) strength of relationship with fatigue, (3) content redundancy, (4) differential item functioning by race and/or education, (5) reliability, and (6) literacy demand.

Content validity of each item examined the extent to which the item reflected a critical attribute of the fatigue domain being measured [24]. The content validity review was especially important for those items being considered for removal based on evidence from the psychometric analyses.

The strength of the relationship between each item and the overall scale was assessed using three statistical methods. The first method was item-total score correlations, performed in SAS (version 9.2). Secondly, factor loadings were examined from the factor analyses. Finally, item response theory (IRT) models were used to examine how well each item was related to the fatigue domain (i.e., discrimination) and how the item’s response categories reflected different levels of fatigue (i.e., threshold). The IRTPRO software (ver. 2.1; Scientific Software, Inc.) was used to estimate IRT parameters for Samejima’s Graded Response Model [2527].

Content redundancy was assessed with inter-item Pearson correlations (from SAS), residual correlations abstracted from the factor analyses (from MPLUS), local dependence matrices, and IRT information functions (from IRTPRO). IRTPRO reports a standardized local dependence χ2 statistic where values over 10 may be considered problematic [28]. Local dependence suggests there is excessive correlation between the items even after controlling for the underlying fatigue domain being measured. Authors’ expertise was also used to judge content redundancy.

Differential item functioning was examined to identify possible response bias by race or by education. Due to the small sample size of Hispanic women and those classified as “other” race, differential item functioning could only be examined between non-Hispanic whites and African-Americans. Education was evaluated by comparing response data between those with a high school education or less to those with more than a high school education. Differential item functioning was tested within an IRT framework (using IRTPRO) using Wald χ2 to evaluate statistical significance. Because of the large number of response categories (11) per item with the possibility of small sample sizes for each category, a sensitivity analysis was performed comparing the differential item functioning findings using all 11 categories with an alternative 4-category response scale created by collapsing response categories (0 = none; 1–3 = mild; 4–6 = moderate; 7–10 = severe fatigue).

Scale reliability was examined using Cronbach’s coefficient alpha. We also examined scale precision with IRT information functions. We ideally wanted to maintain each fatigue subscale’s reliability above .80, which is more than adequate precision for group-level comparisons in cancer research settings [29].

Literacy demand was evaluated with the Lexile Framework for Reading [30]. A Lexile value is based on two strong predictors of how difficult a text is to comprehend: word frequency and sentence length. Lexile measures provide corresponding grade levels ranging from first grade to post-high school. Scores were averaged across PFS items to produce the mean literacy demand of the scale and corresponding mean reading grade level for the PFS. Lexile analyses have been previously used to evaluate HRQOL questionnaires [31]. The Institute of Medicine guidelines state that health communication materials should be written at a mean 8th grade reading level or below [32].

Together these six criteria were used to inform the selection of items to remain in the PFS-12. Guiding our decisions, the senior author, Barbara Piper, the originator of the PFS, provided oversight in these judgments bringing her collective experience in the use of her scale in different populations and settings worldwide. At a minimum, it was decided that three questions from each of the four PFS subscales would remain in the new scale to maintain the ability to factor analyze the scale (PFS-12). Only the important findings that led to the decision to remove or keep an item in the PFS-12 are reported in “Results” section. After the final selection of the PFS-12 items, factor analysis and DIF testing were not repeated because the items need to be administered together to a new sample.

Results

Characteristics of the 799 breast cancer survivors are summarized in Table 1. Chi-square tests indicated significant differences in distributions of survivors by race/ethnicity for age, marital status, education, employment status, income, BMI, menopausal status, comorbid conditions, cancer stage, treatment type, and lymphedema. This is not surprising given the different SEER registries and methods of recruiting at each site [6].

Table 1 Survivor characteristics by race and ethnicity

Dimensionality of the 22-item PFS-R

The confirmatory factory analysis model results are shown in Table 2. The unidimensional model did not fit the data (RMSEA = .15, CFI = .73, TLI = .70). The multidimensional solution showed adequate fit (RMSEA = .08, CFI = .92, TLI = .91), consistent with how the PFS-R has conventionally been scored and interpreted. The correlations among the four subscales ranged from 0.67 (affective and cognition/mood subscales) to 0.79 (sensory and cognition/mood subscales). The bifactor model showed good fit (RMSEA = .07, CFI = .95, TLI = .94).

Table 2 Factor loadings for a 4-factor, 1-factor, and bifactor model

The bifactor model results and the high correlations among the subscales suggest there may be evidence of a dominant underlying fatigue factor. A one-factor solution accounted for 58 % of the variance observed in the data, while the second factor accounted for only 7 % of the variance. To be consistent with prior use of the PFS-R and to maintain content validity, item reduction analyses proceeded on a subscale-by-subscale basis.

Selecting items for the PFS-12

Table 3 presents a summary of some of the psychometric analyses performed on the items within each subscale, with unshaded items being retained for the PFS-12. While differential item functioning by education level was performed, no differential item functioning was detected.

Table 3 Review of psychometric findings for items removed and retained in the PFS-12 ordered by subscale (N = 799)

In the behavioral subscale, no item tested positive for differential item functioning. The item “engage in sexual activity” had the lowest item-total score correlation and was found by the IRT model (not presented) to have poor discrimination. The item “fatigue causing distress” was judged to have a more affective than behavioral content. While no psychometric issue emerged, the item on “socializing with friends” was thought to be highly related to the item on “engaging in enjoyable activities” but the latter captured a broader range of impact on activities that could be done alone or in a group setting. However, the item “engaging in enjoyable activities” had the highest literacy demand, but was retained for noted content reasons.

In the affective subscale, four items tested positive for differential item functioning between African-Americans and whites. In addition, two item pairs were found to exhibit local dependence. Given this psychometric evidence, we relied on content expertise and literacy demand to select the final three items for the affective subscale of the PFS-12.

For the sensory subscale, two items tested positive for differential item functioning. In addition, three sets of items were found to be locally dependent. One item from each pair was removed to resolve the local dependence.

In the cognitive/mood subscale, two items tested positive for differential item functioning and three item pairs had evidence for local dependence.

For each subscale, the differential item functioning analyses were repeated, as a sensitivity analysis, using the collapsed 4-category response scale. There still was no differential item functioning by education level. Consistent with prior findings, the item “exhilarated to depressed” on the cognitive subscale tested positive for differential item functioning by race. However, the item “able to remember” did not demonstrate differential item functioning with the collapsed categories. The item “lively to listless” on the sensory subscale tested positive for differential item functioning by race for both the 11-point and 4-point scale. With the 4-point scale, the sensory item “energetic to unenergetic” no longer showed differential item functioning by race. Different from prior findings, the sexual activity item in the behavioral subscale tested positive for differential item functioning by race. In the affective subscale, only one item showed differential item functioning by race (“pleasant to unpleasant”), but was retained for content validity reasons.

Overall, the PFS-12 maintained high scale reliability (r = .92) with the original 22-item PFS-R having a reliability of .96. In addition, reliability for the PFS-12 subscales remained above .80: behavior (.89), affective (.87), sensory (.87), and cognition/mood (.87). The formatted version of the PFS-12 and accompanying scoring manual is provided in Appendix.

Discussion

In a diverse sample of breast cancer survivors approximately 3 years from diagnosis, the 22-item PFS-R was shortened to the 12-item PFS-12 based on multiple criteria including reliability, validity, literacy demand, and response bias (i.e., differential item functioning) by race. This ten-item reduction in the PFS-R has the potential to reduce response burden in future studies while still maintaining a high level of precision for group-level comparisons at the subscale level (i.e., behavioral, affective, sensory, and cognitive/mood aspects of fatigue).

After testing alternative factor analytic models, results indicated that a four-factor model representing the original four PFS-R subdomains fit the data better than a one-factor model. This provides evidence that the four subscales may represent distinct aspects of the fatigue experience as reported by breast cancer survivors.

It is possible, however, that the findings for the multidimensional model may be more methodological artifact than distinct fatigue subdomains. For instance, items within each subscale are worded more similarly than items in other subscales; in addition, items presented next to each other on average will be more related than items further away [33]. If the “distinctiveness” of the factors were purely artifact, this would suggest that the fatigue experience itself similarly affects all aspects of a person’s life: behaviors, sensory, affect, and cognition/mood. In support of this perspective, a dominant underlying factor was found accounting for 58 % of the variance and high correlations were found among the four subscales (all correlations >0.67). In addition, the bifactor model supported a dominant general fatigue factor after extracting clustering of items within each subscale. This study cannot disentangle whether evidence for a multidimensional model is due to the distinctness of each fatigue subdomain or due to artifact; likely it is a mixture of both.

Retaining the ability to provide scores on the PFS-12 for both overall fatigue and individual subdomains is valuable for researchers. It allows investigators to characterize the extent to which a health condition or treatment affects different aspects of the fatigue experience. In addition, an intervention may differentially affect specific fatigue subdomains; e.g., meditation may improve cognition/mood and have less effect on behavioral fatigue. For other research studies, overall fatigue may only be of interest as a mediating variable or an outcome. For each application, the PFS-12 can be used without extensive response burden.

Evidence in the literature from the US and abroad supports the PFS as a multi-dimensional measure. Three psychometric studies of non-English translations of the PFS-R, Greek [9], French [10], and Italian [34], also identified the same four subscales. In contrast, two studies found evidence for a 3-factor solution combining the domains of sensory with cognition/mood. One of these studies used a Brazilian translated version of the PFS-R [13]; the other used the English version in caregivers of stroke survivors [35]. Lending additional validity to the multi-dimensional model of fatigue, recent evidence suggested that the mechanisms driving fatigue may differ by fatigue dimension: increased inflammation (generally thought to produce fatigue in cancer survivors) was related to the behavioral and sensory aspects of fatigue but not to the more psychological aspects (affective and cognitive subscales) [36].

The review of literacy demand of the PFS-R found items in the sensory subscale required a 3rd grade education or higher. The most demanding items were in the behavioral subscale and required an 8th grade education or higher (with one item that required a post-high school education). This more demanding item was kept in the PFS-12 because it captures a broader range of enjoyable activities and could be applied to experiences alone or with a friend or partner. Future versions of the PFS-R or PFS-12 should consider revising this item to read, “To what degree is the fatigue you are now feeling interfering with your ability to do the activities you enjoy?” This simple modification does not change the item content, but drops the literacy demand to a 9th grade education level. Additional cognitive testing of this revised item is recommended.

This study expands on findings from a published study with the same breast cancer cohort, which found fatigue, measured by the PFS-R, was associated with poorer HRQOL [6]. Specifically, fatigue was associated with pain, cognitive problems, antidepressant use, weight gain/personal appearance, and physical inactivity [6]. Because this previous study provided evidence for the validity of the PFS-R in terms of its association with other related clinical and HRQOL factors, these analyses were not duplicated in this study.

This study had limitations. First, though this psychometric analysis used a population-based sample of breast cancer survivors, results cannot be generalized to other survivor groups or health conditions. In addition, race comparisons are not based on women equally recruited across sites. The Los Angeles site collected all African-American data and New Mexico collected a majority of the Hispanics. Thus, the differential item functioning results by race could have been associated with geographic differences [15]. In addition, eligible African-American women were restricted to 35–64 years of age; thus age differences could be another source for differential item functioning findings. Restricting whites to the same age range would not have yielded enough sample sizes for differential item functioning testing.

It is therefore recommended that additional analyses of datasets that included the PFS-R be done to confirm these findings, including datasets that used a 7-day reference period instead of the 30-day reference period used in this study. The psychometric properties of the PFS-12 also need reconfirmation in new samples, so the items are presented next to each other instead of dispersed across the larger PFS-R. Evaluations in women undergoing active treatment and in other cancer populations using both men and women also are recommended.

Despite these limitations, the brief and reliable PFS-12 should have great value in future patient-centered outcomes research studies to capture the multidimensional aspects of the fatigue experience. Additional research is planned to identify clinically meaningful cut-points on the PFS-12 that classify individuals as mildly, moderately, or severely fatigued and to characterize the association between these cut-points with HRQOL decrements. Together, a psychometrically sound, and decision-relevant fatigue measure will enable researchers to provide empirical evidence about the impact of interventions on survivors’ lives that may lead to identifying safer and more effective treatments.