Introduction

Over half of cancer survivors are likely to experience significant physical limitations [1]. Decline in physical function is often associated with a cancer diagnosis and the ensuing initial treatment [2, 3], and such decline can have long-lasting effects extending past treatment and is associated with lower quality of life and increased risk of mortality [4].

Physical function is a key patient-reported outcome (PRO) used to characterize and better understand overall health, level of physical disability, and general well-being. Physical function is a foundation for many commonly used general and cancer-specific (e.g., SF-36 and FACT-G, respectively) PRO measures [5, 6] and the Patient-Reported Outcomes Measurement Information System® (PROMIS®) [79]. These measures provide a systematic report of functional well-being, similar to physician-rated performance status measures that are known to have low inter-rater reliability [10]. Physical function PROs offer a comprehensive assessment of body function, impact of disability on physical participation, activity level, and environmental and personal characteristics [11, 12] and incorporate the patient perspective.

PROMIS, a US National Institutes of Health Common Fund initiative, has developed an extensive item response theory (IRT)-calibrated item bank, a collection of self-administered questions, and multiple short-form questionnaires available to measure physical function. This physical function domain was developed to measure a full range of function on one common standardized scale, minimizing ceiling and floor effects where the score is higher or lower than the survey can identify [12, 13], and has demonstrated conceptual validity and reliability [8, 14]. Initial validation of this domain in rheumatoid arthritis and osteoarthritis populations and normal aging cohorts showed that PROMIS physical function measures outperformed legacy instruments (i.e., the Health Assessment Questionnaire [HAQ]) [15]. Subsequent work validated this PROMIS physical function item bank in a more diverse general population sample [16]. The item banks were designed to allow customized short forms of variable length and item content to be created, yet yield comparable, standardized scores across the short forms [17]. However, the comparability of PROMIS physical function short forms in a community-based sample encompassing a broad range of age, disability level, and race–ethnicity has not been extensively tested.

Our study objectives were to evaluate (1) the applicability of the PROMIS physical function measures for a diverse sample of cancer patients, and (2) the psychometric performance of commonly used PROMIS physical function short forms.

Methods

Recruitment

The Measuring Your Health (MY-Health) study recruited a population-based sample of cancer patients from four Surveillance, Epidemiology, and End Results (SEER) Program cancer registries (The Greater Bay Area Cancer Registry covering the San Francisco Bay and surrounding area, the Cancer Registry of Greater California covering the rest of the state except Los Angeles County, the Louisiana Tumor Registry, and the New Jersey State Cancer Registry). We stratified sampling by four race–ethnicity groups (non-Hispanic white [NHW], Hispanic, non-Hispanic black [black], non-Hispanic Asian [Asian]) and three age groups (21–49, 50–64, 65–84), based on the base incidence rates at each registry. The study was approved by Institutional Review Boards at all participating institutions.

Population

Participants in this cohort were identified based on the following SEER eligibility criteria: 21–84 years of age at diagnosis; diagnosed with one of seven cancers (prostate, colorectal, non-small-cell lung, non-Hodgkin lymphoma, female breast, uterine, or cervical); no prior cancer diagnosis (except non-melanoma skin cancer); currently within 6–13 months of diagnosis and able to read English, Spanish, or Mandarin. Patients without cancer stage information, age, or race–ethnicity information were excluded from this analysis (N = 662, Fig. 1) to ensure all known-group comparisons were done across a single uniform cohort.

Fig. 1
figure 1

MY-Health cohort flow chart

MY-Health survey

Survey items included self-reported sociodemographic characteristics, receipt of recent treatments, comorbidities, patient-reported outcomes, and selected health behaviors. Pilot testing was conducted in 35 respondents to identify and correct any errors or unclear language and skip patterns in the survey. The SEER registry sites mailed a survey to eligible participants, with an additional Spanish and Mandarin Chinese translations sent to persons based on surname or made available upon request. Cover letters in the same language as the survey were sent explaining the reason for the study and requesting participation. Along with a second mailing, phone follow-up was initiated for all non-responders after 3 weeks to encourage return of the survey. When contacted, participants were given the option to complete the survey over the phone in English, Spanish, or Mandarin Chinese. All Spanish and Mandarin translations of PROMIS items followed a strict translation protocol [18] and were done in coordination with the PROMIS Statistical Center at Northwestern University. Participants received a $30 gift card or check after completing the survey.

Demographic and clinical variables

We merged the patient survey data with SEER registry variables. SEER registry variables include age, sex, date of cancer diagnosis, cancer type, and cancer stage. In addition, we included the following self-reported survey variables: receipt of chemotherapy, radiation therapy, or hormonal therapy; surgery; comorbid conditions (number and type); education level; current employment status; annual income; marital status; insurance coverage; and whether the patient was born in the USA. We used the following self-reported race–ethnicity categories (NHW, black, Hispanic, Asian), created following US Census (2010) classification algorithms [19]. When self-reported race–ethnicity was missing (<0.4 % of patients), SEER registry information was used.

Patient-reported outcomes

We evaluated three established PROMIS physical function (PF) short-form measures (PF 4a, PF 6b, PF 10a, and custom 16-item form). PROMIS PF short forms are fixed assessments, administered either on paper or electronically. Two forms evaluated here (PF 4a, PF 6b) are the physical function subscales of the PROMIS Adult Profile 29 version 2 and PROMIS Adult Profile 43 version 2, respectively [20]. We selected items for inclusion in the MY-Health survey instrument based on either their inclusion in commonly used short forms, or their frequent selection in the online PROMIS computer adaptive testing (CAT) format. We examined CAT item selection for two different patient groups (0.5 and 1.0 SD below the population mean). Convergent and discriminant validity (types of construct validity) was evaluated with respect to the following variables (each showing high internal consistency α in this cohort) [21]: ability to participate in social roles and activities version 1 (10 items, α = 0.98); emotional distress—anxiety (11 items, α = 0.97); emotional distress—depression (10 items, α = 0.97); fatigue (14 items, α = 0.96); and pain interference (11 items, α = 0.98). PROMIS measures are reported as T-scores (0–100 scale) with a mean of 50 and SD of 10. All PROMIS measures except ability to participate in social roles and activities are normalized to the general US population [7]. High scores for physical function and social roles and activity represent better functioning, and high scores for the symptoms represent greater symptom burden. To address convergent and discriminant validity, we also administered the 7-item FACT physical well-being (PWB) subscale (α = 0.84) [5]. Spirituality comprised of two subdomains (faith and peace) measured by the FACIT-SP-12 version 4 (α = 0.85) [22]; a 5-item financial burden subscale from the PSQ-III (α = 0.83) [23]; and an 8-item acculturation scale for US immigrants (α = 0.94) [24]. To address known-group validity, questions on the use of assistive devices, a single-item patient self-reported ECOG performance status scale used in cancer clinical trials to assess disease impact on daily living abilities [25], comorbid medical conditions (asthma, COPD, arthritis, and overall number), physical activity, stage of disease, cancer site, and demographic variables were included. Hypotheses are described below.

Reliability and validity testing

We used standard psychometric procedures to evaluate reliability and validity [26] of each PROMIS PF short form across three age (21–49, 50–64, 65–84) and 4 race–ethnicity (NHW, black, Hispanic, Asian) groups. We evaluated overall and item-level performance. We estimated internal consistency using Cronbach’s coefficient alpha, with α > 0.70 and α > 0.90 the thresholds for reliable group- and individual-level (inter-individual comparisons at a single time point) measurement, respectively. For structural validity, we evaluated unidimensionality of the PROMIS PF short forms using factor analysis methods, with a mean- and variance-adjusted weighted least square (WLSMV) estimator. Goodness-of-fit-model indicators and thresholds included Comparative Fit Index (CFI), and the Tucker–Lewis Index (TLI). We tested multiple types of construct validity across age and race–ethnicity groups. We examined convergent and discriminant construct validity by calculating Pearson correlations between physical function and other administered scales. The PROMIS PF short forms were expected to be positively correlated with social role participation and another measure of physical function (the FACT physical well-being subscale), negatively correlated with symptom severity (e.g., more fatigue or pain), and weakly correlated with other non-physical function measures (e.g., FACIT spirituality, financial burden, and acculturation). We used Chi-square tests to evaluate known-group validity of expected a priori differences in physical function (all forms), for the total sample. Specific variables, hypotheses and supporting citations, and any minimally important differences between race–ethnicity and age groups (PROMIS physical function T-score ≥ 4, a meaningful important score difference [27]) are described in Table 5. Factor analysis was conducted using Mplus (version 7.1, Los Angeles, CA); all other analyses were conducted using SAS (version 9.3, SAS Institute, Cary, NC).

Results

Overall, participants in the MY-Health cohort are demographically and clinically diverse, important for establishing generalizability to other cancer populations (Table 1). Nonwhite participants comprised 57 % of the total cohort, and 59 % of participants were under 65 years of age. Eighteen percent of the cohort reported less than a high school education (Hispanics, 37 %; blacks, 22 %; and Asians, 14 %), and 15 and 50 % of the cohort reported household income levels under $60,000. Thirty percent were not born in the USA, and 9 % of the surveys were completed in Spanish or Mandarin Chinese. Cancer incidence by type ranged from cervix (3 %) to breast (30 %); 12 % of patients were diagnosed with stage IV cancer; and about half reported the receipt of chemotherapy (48 %).

Table 1 Demographics and clinical characteristics by race–ethnicity

This cohort reported a mean PROMIS physical function score (using the PF 16-item short-form score) of 44.9 (Table 2), one-half of standard deviation was lower than the overall mean US population. Mean differences in PROMIS PF short-form scores and the 16-item MY-Health form ranged from 0.05 (PF 10a) to 0.80 points (PF 4a), all well within the mean standard error of measurement (2.2–3.9 points). These differences remained consistent across age and race–ethnicity groups. Reliability of all PROMIS physical function short forms was high (α = 0.92–0.96, Table 2) and remained >0.90 when restricted to subgroups based on age and race–ethnic groups (not shown in tables). Floor effects were minimal across all forms, but ceiling effects were evident in PF 4a (34.5 %) and PF 6a (25 %).

Table 2 Item-level and short-form properties

For structural validity, confirmatory factor analysis for a one-factor model fit to all 16 items generally showed good fit (CFI and TLI = 0.99). Exploratory factor analysis identified one strong factor (eigenvalue = 12.7) and high factor loadings (>0.6) for all items. A second, highly correlated factor (r = 0.83) was identified for items that ask about self-care actions (e.g., wash and dry your body, shampoo your hair) that are only found on the PF 10a (Table 3).

Table 3 Physical function factor loadings (oblique rotation) and correlations

For convergent and discriminant validity, physical function was correlated with other PRO domains as hypothesized (see Table 4) and consistent with the previous literature. There were strong correlations (r ≥ 0.67) with ability to participate in social roles, fatigue, pain, and functioning on the FACT-G PWB scale. The domain was moderately associated (r = −0.38 to −0.50) with depression, anxiety, and sleep disturbance scores. Physical function showed weak-to-moderate correlations (r ≤ 0.26) with spirituality, financial burden, and acculturation.

Table 4 Convergent and discriminant validity by physical function short form

Known-group testing confirmed our a priori hypothesis about differences in physical function (Table 5). Only two subgroup comparisons (cancer site and performance status) showed age and race–ethnic differences that were four or more points higher or lower than the reference groups. Among cancer clinical variables, cancer patients diagnosed with advanced cancer and those who received chemotherapy both reported significantly lower physical function scores (−4.40 and −5.35 points, respectively). Of all cancer types, lung cancer patients had the lowest mean physical function scores (39.1), while men with prostate cancer reported the highest mean scores (49.6).

Table 5 Known-group validity, score differences in physical function (PF 16 short form)

Self-reported comorbidities were associated with large decreases in physical function by both the number of other conditions reported and whether COPD or asthma was reported. As expected, the largest differences in physical function (13 points, p < 0.001) were found if a person indicated they had any trouble walking. These findings also were consistent when physical function was evaluated by ECOG performance status, covering a large range of disability. Overall, each decrease in performance status level (normal, some symptoms, <50 % bed rest, >50 % bed rest) was also a large, statistically significant decrease in physical function, while the scores and standard deviations were consistent across each level (Fig. 2).

Fig. 2
figure 2

PROMIS physical function short-form mean and standard deviation by ECOG performance status

We found that scores near the floor and ceiling of this domain were similar across all short forms examined. Groups anticipated to be very low functioning near the floor of this domain (>50 % bed rest) reported similar scores (<0.5 of a point) across all physical function short forms. The highest-functioning group (reported vigorous activity five or more times a week) had a two-point mean difference between 4-item and the full 16-item measures, still within the standard error of measurement for both forms (data not shown).

Discussion

This study demonstrated the validity and reliability of PROMIS physical function short forms in a sociodemographically diverse, population-based cohort of cancer patients. We found that scores across all short forms performed consistently across race–ethnic and age groups. Reliability and validity criteria were met for race–ethnic and age groups across all tested physical function forms, providing strong evidence that these measures are accurate, precise, and comparable in a diverse cohort of cancer patients.

Previous work validating the PROMIS physical function bank suggested that, like virtually all extant measures of self-reported physical function, there may be content gaps at the ceiling of the measure (i.e., items that can measure high levels of physical function and athleticism) [28]. We observed similar findings in this lower physical functioning cohort. Ceiling effects were identified for all physical function short forms in use and were notably higher in the 4-item form; floor effects were minimal across all short forms. This suggests that when physical function short forms are administered in higher-functioning populations, a full standard deviation above US population (60) or higher, custom item selection for higher-functioning individuals, becomes increasingly important to ensure accurate measurement. Assessment administration method (fixed item short form vs. CAT) should also be considered in selection, as recent studies show that CAT administration of the PROMIS physical function item bank in both clinical and general population samples reduces this ceiling effect [29, 30], and new items have been added that directly address ceiling and floor effects [31, 32]. However, when the administration of a fixed short form is necessary, our findings suggest that increasing short-form length (i.e., 6b or higher) reduces ceiling effects.

Factor analyses suggest that the four items measuring self-care (e.g., washing hair) may form a separate factor; however, the high correlations with the other physical function items support the unidimensionality of the PROMIS physical function item bank. This replicates results from the initial validation and calibration of the full physical function item bank recommending a parsimonious one-factor solution [15, 30]. Questions focusing on self-care actions may not be as relevant for a general ambulatory cancer population. Because these four self-care items are administered on one short form (10a), understanding the clinical needs of the population is important prior to a short-form selection. The PROMIS physical function domain has addressed some of these issues, offering tailored assessment for upper extremity function and use of mobility aids [33], increasing flexibility, and relevance of this domain across a broad range of physical function.

PROMIS currently offers a wide range of physical function short forms, geared toward different patient groups and functional ability. This study confirmed the expectation that longer forms reduce the standard error of measurement (i.e., reliability increased with longer short forms). However, the 6b form reported better internal consistency than the 10a form, with a smaller ceiling effect than that identified in the 4a form. While all forms performed well, the results presented here suggest diminishing gains in precision in the 10a and 16 forms (but lower floor and ceiling effects) compared to the 6b form in this population. Recent work has confirmed these findings [12, 15, 28, 3032], extending the range of items at the floor and the ceiling. When high precision is necessary in research settings, the PROMIS PF-20 (an extension of the PF 10a form tested here) is coming into broad use as a replacement for the traditional HAQ-DI. It has been found to be more sensitive to change and requires smaller sample sizes without increasing questionnaire burden.

This study has a few notable strengths and limitations. This population is limited to participants diagnosed with cancer, measuring average or lower physical function, limiting the generalizability of these findings to very-high-functioning individuals. However, this population is also a strength of this study as the broad cancer inclusion criteria (seven cancers, all stages) allowed for a wide range of disability levels encountered in many medical conditions. In addition, by using a large, community-based patient cohort with verified diagnoses and clinical characteristics, this study extends previous work that reported only self-report illness status or small clinical samples to a diverse community-based cohort. Furthermore, PROMIS measures are designed to provide cross-condition comparisons using a standard, non-cancer-specific scale of measurement. Therefore, these findings are relevant and applicable across a full range of physical function from individuals with little to no impairment to those on bed rest.

An additional limitation regarding the sample is the relatively low participation rate of eligible patients. The overall response rate (this includes those unable to be contacted, died, or later deemed ineligible) for this study was approximately 31 % higher among those able to be reached by study staff (53 %). While low, these rates are consistent with large, SEER-based surveys of recently diagnosed cancer patients [3436]. Additionally, this study specifically targeted and oversampled patients from underrepresented populations and patients with metastatic disease. As a result, these groups reported lower response rates (5–7 % lower), than younger, white or non-metastatic study participants.

A third limitation is that this paper focused on reliability and validity of common short forms across age and race–ethnic groups using classical test theory methods. Further work evaluating this domain with psychometric criteria such as differential item function (DIF) that identifies systematic differences in how groups respond to specific items is an important and complementary effort, currently underway. For example, past evaluations have determined that DIF by age group may be especially important for physical function [37].

Finally, it is important to note that while clinical characteristics about cancer type and stage were reported from the registry, treatment and comorbidity information were self-reported by patients. These two variables may be less accurate than other methods of data collection, such as medical record abstraction, and can be associated with an information bias. However, these are standard questions used in other national cancer surveys [34]. Therefore, we feel confident this information is sufficient to evaluate known-group validity.

The final study limitation is the inability to evaluate the PF 8b, an 8-item PROMIS PF short form, because we did not administer all eight items in this survey. Therefore, these findings cannot be extended to this short form. However, the 6b short form entirely overlaps the 8b, suggesting it will perform as well, if not better than the 6b.

Conclusions

This study confirms the validity and reliability of the PROMIS physical function item bank and short forms across a wide range of age and race–ethnic groups reflecting the extensive diversity of the US population. It shows that these short forms can precisely measure meaningful group differences in cancer patient populations, accurately reflecting both disease burden and comorbidities across all versions. While some isolated measurement issues were identified and should be considered when selecting a short form, their impact on the normalized scoring is minimal.