Over the last decade, considerable progress has been made towards the standardization of methods for assessing patient reported outcomes (PROs) [1,2,3,4]. The Patient Reported Outcome Measurement Information System® (PROMIS®) has contributed to this progress through consistent implementation of the PROMIS methodology [5,6,7] for the development of item banks and short forms in more than 100 domains of physical, mental, and social health (www.HealthMeasures.net). Many recent studies have demonstrated the validity of these measures for use in a diverse range of contexts and disease populations [8,9,10,11]. The PROMIS framework is widely cited and used when assessing common symptoms and functional domains of health-related quality of life [12,13,14,15,16,17,18,19].

Qualitative item review is particularly critical to ensure that the items capture relevant patient concerns in each domain, and that they are unambiguous and intelligible to people with a range of literacy [20, 21]. Qualitative item review also helps to ensure consistency of style, response options, and recall periods. For recall periods specifically, PROMIS investigators sought to identify the option(s) that would reduce the potential for bias in responding by drawing upon research from several disciplines, including memory encoding and recall [22,23,24,25], and judgment and decision-making [26,27,28]. For most PROMIS domains, the qualitative item review process led to the selection of a 7-day recall period as a general convention. The physical function domain is one of a few exceptions [29, 30]. The PROMIS Physical Function domain does not specify a recall period because of a prevailing preference in this particular domain to focus on self-evaluations of current capability rather than specific recollections over a defined time period (e.g., over the last 24 hours or 7 days). The latter approach—asking about functioning over a specified time period rather than perception of current capability—introduces uncertainty about the extent of functioning when respondents have not engaged in the activities described within the specified time period (e.g., getting in and out of a car, climbing stairs, exercising) [6, 20]. Thus, the PROMIS Physical Function items are phrased in the present tense to assess patients’ assessment of their current capability to carry out various physical activities. It is expected that those items that reflect an activity that patients have recently performed would naturally be responded to based upon that recent experience. In cases where a physical function item’s exemplar activity has not been performed recently, patients estimate their capability based on recent experience with similar tasks and/or reasoned estimation of their current physical capability [6].

Empirical evidence regarding the use of various recall periods has been mixed. Generally, the length of recall period is inversely related to the accuracy of recall [31,32,33,34]. Shorter recall periods can lead to the under-reporting of symptoms in some conditions, while longer recall periods can lead to over-reporting [35]. Still others have found no significant effects based on the length of recall and have recommended that recall periods be selected as needed to meet the needs of the administering clinicians/researchers [36, 37], including the possibility of using multiple “ecological momentary assessments” to acquire more robust data regarding respondents’ experiences over time [38].

In the current study, the influence of recall period on self-report was evaluated by administering 31 PROMIS Physical Function items to a large online sample. Specifically, these items were administered to samples from two distinct populations using three different recall conditions and two different administration conditions. The primary aim was to evaluate whether—and to what extent—the use of different recall periods and reminder options might lead to significantly different means among the items as a set and individually. The three recall options were: (1) no recall period (i.e., the current PROMIS approach); (2) 24-hour recall; and (3) 7-day recall. A second independent variable in this experiment was the use of reminders regarding the recall period: One mention of time frame at the beginning of the assessment (i.e., no reminders) versus a reminder with every item. The impetus and design of this study is a consequence of guidance and feedback received from the Food and Drug Administration (FDA) Center of Drug Evaluation and Research (CDER) Qualification Review Team (QRT), during the development of a Drug Development Tool (DDT) Clinical Outcome Assessment (COA) of PROMIS Physical Function in oncology.

Methods

The study protocol was reviewed by the Institutional Review Board Office of Northwestern University (IRB ID STU00205190) and exempted from full review.

Participants

Participants included 2400 English-speaking individuals who were recruited online between May 16 and May 25, 2017 by Opinions For Good (Op4G), a market research firm that maintains relationships with a large panel of survey respondents. Of these, 1001 respondents (40% female) were invited to participate because they were currently undergoing treatment for a cancer diagnosis (the “Cancer” sample). The remaining 1399 respondents (50.5% female) were recruited as part of a representative sample of the U.S. population with respect to age, gender, race/ethnicity, and education (the “General Population” sample). All participants gave their consent to participate by clicking “I agree” on a customized informed consent page and were required to actively agree to participate by opting in. For the largest sub-group (no recall, no reminder, general population; N = 598), representative proportions were achieved for gender, age, and education, though the joint representativeness for these demographic characteristics was not fully achieved in all race/ethnicity groups. Demographic characteristics of the cancer and general population samples are in Table 1.

Table 1 Descriptive statistics for the general population sample (n = 1399) and the cancer sample (n = 1001)

Materials

31 items from the PROMIS Physical Function item bank were administered online, either in the standard PROMIS format without a specified recall period or with a recall period of either 7 days or 24 hours. These 31 items included the 10 items in the PROMIS Short Form v2.0—Physical Function 10b [39, 40], the 16 items previously validated for use with a diverse U.S. population-based cohort of cancer patients [9], and 10 additional items that were suggested for inclusion by the U.S. Food and Drug Administration in a collaborative project. Note that there are five overlapping items in the 10b- and 16-item short forms, bringing the total PF item count to 31. In addition to the Physical Function items, participants were asked to complete the 10-item PROMIS Scale v1.2—Global Health [41], the Functional Assessment of Cancer Therapy—General [42], and to provide information regarding demographics, prior diagnoses, and co-morbidities.

Procedure

In order to test the effect of recall periods, participants were randomly assigned to one of three groups: 799 participants (201 and 598 from the cancer and general population samples, respectively) were administered the items without any reference to a recall period; 801 participants (400 and 401 from the cancer and general population samples, respectively) were administered the items with a 24-hour recall period (i.e., “Over the last 24 hours, …”); and 800 participants (400 and 400 from the cancer and general population samples, respectively) were administered the items with a 7-day recall period (i.e., “Over the last 7 days, …”). To test the effect of reminders of the recall period, participants in the 24-hour and 7-day recall groups were assigned, via a second randomization, into one of two reminder conditions: recall period presented only once at the beginning of the survey; or recall period presented with each of the 31 items. This recruitment and randomization scheme ensured enrollment of at least 200 people in each group.

Analyses

For our unidimensionality analyses, we sought to confirm that a single-factor structure underlies the PROMIS Physical Function-Oncology 31-item set. Using our full combined sample (N = 2400), we first examined item-total score correlations (i.e., corrected for item overlap) to identify any correlations < 0.40 as indicating possible low item-construct association. We reviewed all inter-item correlations for negligible correlations (e.g., < 0.10), suggesting unrelated item pairs, and extremely high correlations (e.g., > 0.90), suggesting potential item content redundancy. We also obtained an estimate of internal consistency reliability (Cronbach’s alpha). Next, we conducted a single-factor confirmatory factor analysis (CFA) to determine if all factor loadings were ≥ 0.50, all residual correlations were ≤ 0.20, and the overall model’s fit statistics indicated good fit (e.g., root mean square error of approximation (RMSEA) ≤ 0.10, Comparative Fit Index (CFI) ≥ 0.95, Tucker-Lewis Index (TLI) ≥ 0.95, standardized root mean square residual (SRMR) ≤ 0.08), thereby confirming a unidimensional model [43,44,45,46,47,48]. Then, for bifactor modeling, we began by conducting an exploratory factor analysis (EFA), with plans to extract two to three factors. We obtained percent of variance accounted for by eigenvalues 1–3; we also calculated the eigenvalue 1-to-2 ratio, with values ≥ 4.0 suggestive of a single, dominant first factor. Subsequently, we conducted a confirmatory bifactor analysis (CBFA), using our EFA findings to establish two to three evidence-based specific factors. We calculated omega, McDonald’s omegahierarchial (omega-H), and explained common variance (ECV). Omega-H values ≥ 0.70 are considered suggestive of sufficient unidimensionality, while ECV values > 0.50 are interpreted as a majority percentage of “common variance” having been explained by a single, general factor [49,50,51,52]. For Stage 1 DIF analyses, we implemented a hybrid logistic ordinal regression (LOR) and IRT approach to DIF detection. This involved the use of an IRT-derived ability (or trait) estimate for LOR modeling rather than a traditionally modeled summed-score ability (trait) term. In this stage we sought to identify items flagged for DIF by our investigated DIF factors: (1) cancer vs. general population (both “no recall” only); (2) recall time period. These analyses used (1) “general population-no recall” as the reference group and “cancer-no recall” as the focal group, and (2) “no recall” as the reference group and both “24-hour recall” and “7-day recall” as focal groups. For the cancer vs. general population (no recall only) DIF factor, tested item content and time frame context were identical across tested groups. This is analogous to gender DIF factor testing, where items are fixed and tested groups vary (female vs. male), with the null hypothesis being items do not perform differently per gender status. For the recall time period DIF factor, tested item content was again identical across tested groups, while time frame context was allowed to vary. This is analogous to language DIF factor testing, where common-content items are presented in distinct languages (e.g., English and Spanish), tested groups have common characteristics except for their language status (English-speaking vs. Spanish-speaking), and the null hypothesis is that items do not perform differently per language-presentation status. We used a McFadden pseudo-R2 change criterion of ≥ 0.02 to flag items for DIF and utilized the lordif R package, version 0.3–3, for conducting the DIF analyses [53]. Our DIF analyses evaluated uniform DIF (Model 1 vs Model 2), non-uniform DIF (Model 2 vs Model 3), and overall or total DIF (Model 1 vs Model 3). In lordif-based Stage 1 DIF analyses, the initial run employs the full set of tested items as anchors. In subsequent iterations, if item performance differences are found, such “flagged” items are removed as anchors, creating an empirically purified anchor set. Iterations continue until no additional item performance differences are identified and a final DIF-free item anchor set is established.

For Stage 2 DIF score impact studies, we planned to analyze the potential score impact of using a common set of item parameters for scoring vs. group-specific item parameters for any items flagged for DIF. We planned to conduct unadjusted (using a common set of item parameters) vs. DIF-adjusted (using a common subset of item parameters plus group-specific item parameters for flagged items) score difference analyses to obtain the following scoring impact evidence: (a) Pearson correlation (unadjusted vs. adjusted scores); (b) mean difference (unadjusted minus adjusted score); (c) standard deviation (SD) of the score differences; (d) root mean squared difference (RMSD) of the score differences; and (e) percentage of individual case score differences greater than their associated unadjusted score standard error (SE). We prepared to utilize the statistical program IBM SPSS, version 25.0.0.1, for conducting these Stage 2 DIF analyses [54]. All score estimates and score-related statistics, unless specifically noted otherwise, would be reported in or based on the theta metric (mean = 0; SD = 1).

Group differences in raw item-level and IRT-scored scale-level (Theta) scores across the recall periods and the reminder conditions were evaluated using fixed main effects analyses of variance (ANOVA). In follow-up multiple regression analyses, adjusting for age, gender, education, race/ethnicity, English fluency, and co-morbidities in both the general population and cancer samples. In addition, in the cancer sample, we adjusted scores for time since diagnosis, primary cancer site, and type of treatment. Model estimated mean differences were derived for each group; effect size differences were evaluated in T-score units (mean of 50 and standard deviation of 10), consistent with the PROMIS scoring metric [5, 6]. While the DIF analyses investigate the possibility that individual items may perform differently with one recall period versus another, the analysis of group differences in theta/T-scores for the different recall periods shows whether there is any consequence associated with DIF at the scale level.

All PROMIS Physical Function measures are optimally scored using IRT-based EAP (Expected A Priori) scoring methods, which rely on item-level calibrations (i.e., discrimination and threshold parameters) to convert a raw sum-score to a weighted distribution-based score centered on the U.S. general population (T-score) [5, 6]. Score conversions were completed using an electronic summed-score-to-IRT-score conversion table. Differences were considered trivial if below 2 T-score units (i.e., effect size 0.2) for the group differences and less than 0.2 SD (effect size) units for the item-level differences. [55] Higher scores on the PROMIS Physical Function metric indicate better physical functioning. No a priori hypotheses were made regarding the expectation of significant effects in either direction, though we did expect, given the large sample size (n = 2400) and multiple comparisons (31 items across 10 possible conditions), that some item-level differences would be observed by chance alone. With 200 patients per group, setting alpha = .05, this study had 85% power to detect a group difference of 0.3 SD units (3 T-score units). We did not adjust alpha for multiple comparisons.

Results

From our unidimensionality analyses, we confirmed a single-factor structure underlies the PROMIS Physical Function-Oncology 31-item set. With our full combined sample (N = 2400), we examined item-total score correlations; no correlations were < 0.40, thus, there was no indication of possible low item-construct association. In our review of inter-item correlations we found no negligible correlations (< 0.10) and no extreme high correlations (> 0.90); therefore, all items appeared sufficiently inter-related, and no items appeared redundant (minimum and maximum inter-item correlations were 0.34 and 0.88, respectively). Our estimate of Cronbach’s alpha (internal consistency reliability) was 0.98. In our single-factor CFA, all factor loadings were ≥ 0.50, all residual correlations were ≤ 0.20, and the overall model’s fit statistics indicated good fit, confirming a unidimensional model (i.e., RMSEA = 0.107, CFI 0.963, TLI = 0.961, and SRMR = 0.056). For our bifactor modeling, we first extracted two factors in our EFA (a potential third factor had no item loadings ≥ 0.30). We obtained the percent of variance accounted for by eigenvalues 1 (76.3%), 2 (5.0%), and 3 (2.2%); we also calculated the eigenvalue 1-to-2 ratio (15.1). EFA findings were all suggestive of an essentially unidimensional factor structure. In our confirmatory bifactor analysis (CBFA), using two EFA evidence-based specific factors, omega was 0.99, omega-H was 0.95, and ECV was 0.96. Our omega-H value was ≥ 0.70, suggestive of sufficient unidimensionality, and our ECV value was > 0.50, representing a majority percentage of “common variance” as explained by a single, general factor. Thus, our CBFA general factor accounted for 95% of PROMIS Physical Function-Oncology total score variance (omega-H). Model fit statistics from the bifactor model indicated good fit (i.e., RMSEA = 0.078, CFI = 0.982, TLI = 0.979, SRMR = 0.025).

Initial lordif analyses used the full 31-item set of tested PROMIS Physical Function-Oncology items as anchors. In all subsequent iterations within each of the lordif analyses conducted, no item performance differences were identified. Thus, in Stage 1 of the planned DIF analyses, no items were flagged for the DIF factor cancer vs. general population (cancer: n = 1001; general population: n = 1399), and no items were flagged for the DIF factor recall time period (no recall period: n = 799; 7-day recall period: n = 800; 24-hour recall period: n = 801). As a result, for all analyses, the full 31-item set served as the purified or DIF-free anchor set. Therefore, our empirical conclusion is that the use of a no recall vs. 24-hour recall and vs. 7-day recall period did not create item differences either within or across cancer and general population samples big enough to detect via our effect size-based analyses or important enough to impact scores. No Stage 2 DIF score impact studies were required. We therefore used the established (existing) PROMIS PF item parameters for all subsequent PROMIS PF scoring.

While our DIF analyses assessed IRT model differences between recall conditions, observed group differences for the recall period conditions were evaluated using analysis of variance (ANOVA). Unadjusted means and standard deviations for IRT-scored scale-level scores for each condition are reported in Table 2. The mean T-score difference among those in the cancer sample and those in the general population sample (m = 11.1 T-score points), reflect a substantial impact of cancer upon physical functioning. We observed the full range of T-scores (11.6–62.8) in both the cancer and general population samples. These minimum and maximum scores represent the floor (1% of general population sample; 1% for cancer sample) and ceiling (26% of general population sample; 3% of cancer sample), on the 31-item PF measure.

Table 2 ANOVA-based T-score means and standard deviations by sample, recall period, and reminder condition

Analyses of variance (ANOVA) comparing the main effects of the recall and reminder conditions were conducted separately in the cancer and general population samples. A significant difference among groups was found in the general population sample (F(4, 1374) = 2.67; p = .03) but not the cancer sample (F(4, 960) = 2.35; p = .052). In the general population, slightly higher (better) physical function scores were observed when using a recall period. Table 3 shows the estimated mean differences from multiple regression models adjusting for age, gender, education, race/ethnicity, English fluency, presence of co-morbidities, and—for the cancer sample—the time since diagnosis, primary site of cancer diagnosis, and the cancer treatment type. Based on the evidence for statistical significance in the general population ANOVA, pairwise tests of significance were conducted between estimated adjusted means for the PROMIS standard condition (no recall) and each of the other recall and reminder conditions. Only the 24-hour recall condition with reminders at every item was significantly different from the PROMIS standard “no recall” condition (Table 3; p < .01).

Table 3 Multiple regression model estimated means and differences based on sample and population

Results for the item-level analyses are provided in the Supplementary Materials (Table S1). In the general population sample, the overall number of non-trivial differences at the item level was small (6 of 124 differences with d > |0.2|; 1 with d > |0.3|). At the total score level, the 24-hour recall condition with reminders was most distinct from the PROMIS no recall standard. However, 5 of the 6 differences with d > |0.2| among individual items were actually in the 7-day recall condition without reminders. In all five cases, the 7-day recall condition with reminders had lower (worse) average responses than the no recall condition. We note with caution (due to the post hoc exploratory nature of these analyses) that these 5 items are targeted to moderate-to-strenuous physical activities (e.g., “doing 2 hours of physical labor”, “running, lifting heavy objects”). In the cancer sample, the overall number of non-trivial (d ≥ |0.2|) differences at the item level was also relatively small (19 of 124 differences with d > |0.2|). All but one of these differences were in the 24-hour recall conditions and reflected higher (better) physical function than responses in the no recall condition.

In our cognitive interviews, most patients considered the absence of a directed recall period as appropriate when responding to the 31-item custom PROMIS Physical Function questionnaire. Specifically, patients were asked the following question regarding recall period appropriateness: “In general, when thinking about your ability to carry out the physical activities we discussed today, is it better to consider a specific timeframe or respond according to your current capability?” In response, the majority of participants (74.2%) reported it is better to consider or respond to one’s current capability (without a directed recall period) as opposed to a specific timeframe. Patients further explained that responding to one’s current capability is ideal for considerations of physical function. The minority of patients (n = 8, 25.8%) who reported it is better to consider a specific timeframe were asked what timeframe they recommend for responding to questions of physical function. Patients’ recommendations included a range of different timeframes such as, time since diagnosis, overall experience, during treatment, since completing treatment, and past 1–2 months. Of the alternative recall period suggestions made by patients, none are appropriate for assessing measurable improvements in physical function in a clinical trial setting.

Moreover, based on participant responses to questions regarding recall period, a majority (n = 18; 58%) considered their current capability when responding to questions of physical function. In addition to responding according to their current capability, participants reported considering physical capability since diagnosis or treatment (16.13%) or over the past weeks or months (12.9%) when responding. Of the 17 patients who were asked whether it was difficult to respond to the questionnaire without a directed recall period, 16 (94.1%) reported having no problems.

Discussion

PROMIS provides a framework for measuring a range of common symptoms and functional abilities, including a large item bank for physical function. To help respondents answer questions that include physical capabilities that may not have been experienced in the recent past, PROMIS has opted to use a present tense, “no recall period” framing for each item. This study evaluated whether or not a recall period (7 days or 24 hours) makes any measurable difference in the way people respond to individual questions relative to one another, and whether it affects the score one would obtain on the PROMIS physical function metric. We found no important differences on physical function item responses, or physical function score, across the studied recall periods, suggesting that recall period has little to no effect.

Analyses of variance in the general population sample indicated a significant difference among groups but subsequent analyses suggested that the differences were small and difficult to interpret. Among the group-level mean scores for all 31 PROMIS Physical Function items, only the 24-hour recall condition without reminders was significantly different from the standard PROMIS no specified recall period. The item-level analyses indicated that the items with non-trivial differences had small effects, and that they indicated lower physical function when they used 7-day recall conditions with reminders for infrequent moderate-to-strenuous activity. This suggests evidence for a small effect among individuals who infrequently engage in physically demanding tasks. When consistently reminded to reference one’s response to the past week, respondents reported slightly worse physical function. It is unclear whether this constitutes over-reporting of physical function difficulties based on experience with the exemplars in these more difficult items or the potential influence of consistent reminders.

In the cancer sample, the analysis of variance results were not statistically significant. The item-level analyses indicated relatively few non-trivial differences in means relative to the PROMIS standard protocol, and these differences were present under different conditions than those found in the general population. That is, the non-trivial differences in the cancer sample were generally present only in the 24-hour recall conditions (with and without reminders), and these were all in the direction of higher physical functioning. It seems that the use of brief recall periods may have prompted respondents in the cancer sample to evaluate their physical function more positively than the other conditions. While scores in the cancer sample were more than 1 SD lower than those in the general population sample on average, it seems likely that slightly higher scores in the 24-hour condition for the cancer sample are less generalizable (i.e., more specific to very recent experiences) than those in the other conditions.

Future studies could further inform these findings by targeting additional patient populations, examining effects of anti-cancer treatment, employing alternative methods of data collection, and/or incorporating a broader range of content relating to physical function. It would be particularly useful to evaluate whether the evidence for slightly elevated scores in the 24-hour conditions subgroups in the cancer sample would be maintained in a longitudinal sample with daily assessments over a 7-day period as this would provide evidence regarding the generalizability (or its absence) when using these shorter recall periods outside of longitudinal data collection.

Limitations of this work should also be noted. These analyses did not differentiate among the many forms of cancer, the severity of the cancer being treated, or features of the treatment regimen. It is possible that these and other characteristics may influence responses to different recall periods. Further, generalizability of these findings may be limited to the population represented by this relatively young online sample. Replication of this study in clinical and community-based samples would be reassuring.

The results presented herein support the use of the standard PROMIS “no recall period” approach to measuring physical function. In both cancer and general population samples, there were no score differences when a 7-day recall was compared to no recall for the overall score. This was true even when respondents were reminded of the recall period with each item. When compared to 24-hour recall, there was a small but statistically significant difference such that the “every item reminder” group indicated better physical function by 2.1 T-score units. Indeed, the 2.1 point difference exceeded our a priori 2 point difference; however, we note the 0.21 effect size of that difference is quite small. Other than this, there was little to no evidence of meaningful differences between the current PROMIS standard protocol for Physical Function (no recall period) and alternative recall periods. In cases where differences were suggested, it seems that the absence of a recall period tends to elicit slightly lower (worse) physical function scores. We could find no evidence, in the cancer sample or the general population, to suggest that there are substantial differences between the standard PROMIS “no recall” condition and the use of a 7-day recall. We believe that an important finding of this study is that the magnitude of observed differences between recall conditions was small to negligible across all conditions. The “no recall” condition is the standard context in PROMIS Physical Function assessment; however, there is insufficient evidence to suggest that use of a 24-hour or 7-day recall period substantially alters the assessment. This was especially true in the cancer sample.