Introduction

Patient-reported outcomes (PROs), including health-related quality of life (HRQoL), are key endpoints in oncology clinical trials to assess the clinical benefit of new treatment strategies for the patients [1, 2]. However, the analysis of PRO data remains challenging [3], as no standardization of statistical methods has yet been proposed, thus rendering comparison of PRO results between clinical trials difficult. Several methods have been proposed for the longitudinal analysis of PRO data including the linear mixed model [4, 5] and the time to deterioration (TTD) approach [6,7,8]. This latter method has recently been proposed and is regularly used as a modality of longitudinal analysis in phase III clinical trials [3]. The TTD approach requires a clear definition of what is considered to be a “deterioration”, and this will depend on the cancer setting, the reference score and the minimal important difference (MID) used to qualify the deterioration, as well as on censoring rules. The deterioration could be considered as “reversible”, “definitive” (i.e., a deterioration maintained over time up until the last PRO assessment available for each patient), or “confirmed” (i.e., a deterioration sustained for a defined time period, which might be equal to or less than the time lapse defined in the preceding definition). Composite events may also be considered, for example, including death in the event definition. Since the variability in the TTD definitions may have an impact on the comparison of results between trials, some initial recommendations have previously been proposed in a paper published online in 2013, and they emphasized the need to adapt the definition based on the therapeutic setting [6]. For example, in the adjuvant setting, it has been proposed to use the time to first clinically significant deterioration as compared to the baseline score (“reversible” deterioration) [8]. In contrast, in advanced or metastatic disease, the time until definitive deterioration (whether or not death is included as an event) seemed to be more appropriate. The recommended definition of the time until definitive deterioration is the time to the first clinically significant deterioration compared to the baseline score, and with no further significant improvement compared to the baseline score (“definitive” deterioration) [7].

Despite these initial recommendations, however, various other definitions have been used in randomized clinical trials (RCTs) [9, 10], thereby limiting comparisons of results across RCTs, as with other traditional time-to-event endpoints, such as progression-free survival [11].

The objective of this study was to perform a systematic review of how the TTD approach has been used in phase III RCTs to analyze longitudinal PRO data. The primary objective was to examine the level of clarity of the TTD definition used. The secondary objective was to assess the concordance between PRO data using the TTD approach, and the primary endpoint(s), in the studies included in the review.

Methods

Search methods for identification of studies

A systematic literature search was conducted in PubMed/MEDLINE, the Cochrane Library and through manual searching since 2014. The search strategies combined different terms to represent PROs, quality of life and the TTD approach. A filter for RCTs was performed on PubMed/MEDLINE, using the Cochrane highly sensitive search strategy-sensitivity maximizing version (2008 revision), as described in the Cochrane Handbook for Systematic Reviews of Interventions [12]. We list the full search strategies for PubMed/MEDLINE and the Cochrane Library in Supplementary Table A.

Criteria for considering studies

All phase III RCTs in oncology published between January 2014 and June 2018 were eligible if they included a PRO endpoint and used the TTD approach to analyze PRO data. Among the articles identified, only original articles written in English were considered. We chose 2014 as the starting publication date for studies to be included in this review, as it was the year after the online publication (in 2013) of the initial article proposing recommendations for the TTD approach [6].

Selection of studies

Two authors (E.C., B.C.) independently assessed the eligibility of all articles identified by the searches. Articles that did not meet the inclusion criteria after screening the titles and abstracts or the full text were excluded. The reasons for exclusion of the ineligible studies were recorded. Disagreements were resolved through discussion between the two reviewers. In case of unresolved disagreements, a third author (A.A.) was consulted. Duplicates studies were identified and excluded, and only the original article(s) reporting PROs data by treatment arm were selected.

Data extraction

Two authors (E.C., A.A.) independently extracted information defined in a data collection form for each selected study. All extracted data were cross-checked, and discrepancies were resolved after discussion between the two reviewers.

The data extraction form recorded the following:

  • general information about the study and PRO assessment, i.e., year of publication, study location (i.e., the trial was considered as an international study if it recruited patients in more than one country), publication status (i.e., secondary publication dedicated to PRO results versus PRO results presented in the main publication), cancer site, disease stage, primary endpoint, PRO endpoint status, reporting of the baseline PRO sample size, PRO population considered for the analyses, questionnaire(s) used and timing of PRO assessments;

  • items specific to the TTD approach: the definition of deterioration used, the reference score used to qualify the deterioration, the MID considered, whether composite events were considered and if so, the events included in the composite, missing data and sensitivity analysis;

  • the reporting of the results, e.g., description of the targeted/expected baseline PRO scores, results presentation and reported events. If the study did not explicitly state that it focused only on specific dimensions of the questionnaire(s) used (“targeted dimensions”), then we considered all dimensions of the questionnaire(s) used, which we called the “expected dimensions”.

For the primary objective of this review, the clarity of the TTD definition used was assessed separately for main publications and for secondary publications dedicated to PRO results, according to 5 key criteria, namely: (1) the definition of deterioration (e.g., first deterioration, deterioration sustained for a defined time period, deterioration maintained over time); (2) the reference score used to qualify the deterioration; (3) the MID; (4) whether a composite definition was used (and if so, the events included in the composite); and (5) the manner in which missing baseline data were handled. Each criterion was coded as “defined” (2 points), “unclear” (1 point) or “not defined” (0 point). A score ranging from 0 to 10 was thus created, with a higher score indicating greater clarity of the TTD definition used.

The secondary objective regarding the concordance between PRO data using TTD approach and the primary endpoint(s) of the study was also investigated. Results obtained for the primary endpoint(s) were collected and classified as being “in favor of the experimental arm”, “in favor of the control arm”, “unclear” (i.e., impossible to identify which direction the results tended) or “no difference” (i.e., clear statement of an absence of statistically significant difference), according to the statistical significance reported in the results. The same classification was performed for the PRO endpoint using the TTD results. Results were classified as being “in favor of the experimental arm” or “in favor of the control arm” for the PRO endpoint, if more than half of the statistically significant targeted/expected dimensions were in favor of the experimental or control arm, respectively.

Data analysis

We performed descriptive analyses of the articles identified. Qualitative variables are described as absolute and relative frequencies, while quantitative variables are described as median and range.

Analyses were conducted using SAS version 9.3 (SAS Institute Inc., Cary, NC, USA).

Results

The systematic literature review search yielded 1549 studies published between January 2014 and June 2018, which were screened for eligibility. After duplicates were removed, 1229 records were screened and a total of 39 studies (2.5%) were finally identified as relevant according to the predefined inclusion criteria (Fig. 1 and Supplementary Table B) [9, 10, 13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49].

Fig. 1
figure 1

Flow of information through the different phases of the systematic review. RCTs randomized clinical trials, PROs patient-reported outcomes, TTD time to deterioration

General information about selected studies

Out of the 39 selected studies, the majority were international (n = 32, 82.1%) and most of the time, a secondary publication dedicated to the PRO results was provided (n = 32, 82.1%) (Table 1). The two main cancer sites were lung (n = 12, 30.8%) and gastrointestinal (n = 11, 28.2%) cancer. Thirty-six studies (92.3%) were on advanced and/or metastatic cancer and 3 studies (7.7%) on localized cancer only. Seven studies (17.95%) investigated a co-primary endpoint, considering overall and progression-free survival. PRO was considered as a secondary endpoint in 32 studies (82.1%), an exploratory endpoint in 6 studies (15.3%), and 1 study (2.6%) did not clearly define the status of the PRO endpoint. The population used for the longitudinal analysis of PRO was defined as intention-to-treat in 12 studies (30.8%), as a modified intention-to-treat in 19 studies (48.7%) was not clearly defined in 3 studies (7.7%) and was not reported in 5 studies (12.8%). The definition of the modified intention-to-treat PRO population varied by article. Regarding how missing baseline scores were considered, while an intention-to-treat population was declared for the analysis, only 2 studies (16.7%) considered observations with missing baseline data, 1 study (8.3%) had no missing baseline scores, 5 studies (41.7%) did not clearly define how missing baseline data were handled, and 4 studies (33.3%) did not report this information. Among all studies, the European Organisation for Research and Treatment of Cancer (EORTC) questionnaires were the most frequently used instruments (n = 26, 66.7%).

Table 1 General characteristics of the studies included (n = 39)

Methodology of the time to deterioration approach

The deterioration was defined as definitive in 10 studies (25.6%), corresponding to a deterioration maintained over time until the last PROs assessment available for each patient, and confirmed in 7 studies (17.9%), which corresponds to a deterioration sustained for a defined time period that could be equal to or less than the time lapse defined in the preceding definition (Table 2). The deterioration was explicitly stated as a first deterioration in 16 studies (41.0%) and was not explicitly stated as a first deterioration but implied in 8 studies (20.5%). The deterioration definition was not clearly defined in 1 study (2.6%). Two studies (5.1%) applied several definitions of deterioration in their main analysis [20, 24]. The baseline score was explicitly stated as the reference score to qualify the deterioration in most studies (n = 31, 79.5%) and was either not defined or unclear for the remaining 8 studies (20.5%). The MID was defined for all targeted/expected dimensions in 35 studies (89.7%), for some dimensions in 3 studies (7.7%) and was not defined in one study (2.6%). This MID was generally the same for each dimension (n = 29, 76.3%), and only 8 studies (21.1%) used a different MID according to the dimension. Composite definitions of PRO deterioration were considered in 16 studies (41.0%), including deterioration on any of three PRO scales (n = 1, 6.3%), an increase in analgesic use (n = 1, 6.3%), death (n = 14, 87.5%) and disease progression (n = 5, 31.3%) in the event definition. Five studies (31.3%) simultaneously considered death and disease progression as composite definitions. The majority of studies did not report how intermittent (n = 37, 94.9%) or monotone (n = 24, 61.5%) missing data were handled. Regarding studies in advanced and/or metastatic cancer (n = 36, 92.3%), only 12 studies (33.3%) considered death in the deterioration definition in their main analysis. Among the 24 studies (66.7%) that did not consider death in the main analysis, only 1 study (4.2%) integrated it in a sensitivity analysis.

Table 2 Definition of the time to deterioration approach considered in the studies included (n = 39)

Examination of the clarity of the time to deterioration definition used

The median level of clarity based on the 5 key criteria was 7 (range 5–9) when PROs were reported in the main publication (n = 7, 17.9%) and 9 (range 4–10) for the secondary publications dedicated to PRO results (n = 32, 82.1%). Regarding papers reporting PRO data in the main publication, 4 studies (57.1%) defined the deterioration definition, 3 studies (42.9%) defined the reference score, all studies defined both the MID and composite events, and none defined how missing baseline scores were handled. Regarding the 32 secondary publications, 26 studies (81.2%) defined the deterioration, 28 studies (87.5%) defined the reference score and the MID, all studies defined composite events and 17 studies (53.1%) described their method of handling missing baseline scores (Table 3).

Table 3 The 5 key criteria used to assess the clarity of the time to deterioration definition used in the studies included (n =39)

Reporting of the results

The targeted/expected baseline PRO scores were reported for all scores in 23 studies (59.0%), for some scores in 3 studies (7.7%) and were not reported in 13 studies (33.3%). Most of the studies presented the results for all targeted/expected dimensions (n = 33, 84.7%), and 6 studies (15.3%) presented only the results for some dimensions. The number of events reported for each targeted/expected dimension was not mainly reported (n = 24, 61.5%). The same results were observed in case of composite definitions (n = 12, 75.0%) (Table 4).

Table 4 Reporting of the patient-reported outcome (PRO) results of the studies included (n = 39)

Concordance between PRO data using the time to deterioration approach and the primary endpoint(s)

Among all studies involved, the primary endpoint(s) was (were) either in favor of the experimental arm, or did not differ between treatment arms. The results obtained for the primary endpoint(s) and the PRO endpoint were concordant in 27 studies (69.2%). Among these, results were in favor of the experimental arm in 23 studies (85.2%) and found no difference between treatment arms in 4 studies (14.8%) (Table 5).

Table 5 Concordance of the results between patient-reported outcome (PRO) data using the time to deterioration approach and the primary endpoint(s)

Discussion

This review aimed to study how the TTD approach has been used in phase III RCTs to analyze longitudinal PRO data. We identified 39 studies published between January 2014 and June 2018 that used the TTD approach. A number of studies used a composite definition of deterioration (41.0%), mostly including death (with or without disease progression) in the composite event. The choice of the events associated in the TTD definition should be made carefully and must be justified. Indeed, including disease progression in the TTD definition could be controversial, since disease progression is a tumor-centered endpoint and not a patient-centered endpoint [50]. In this review, 5 studies included disease progression in the TTD definition. In a composite endpoint, each component should be of similar importance to the patient and thus be clinically relevant [51], and homogeneous treatment effects on each component should be investigated [3, 52]. However, HRQoL is generally assessed until disease progression in RCTs. In such a case, ignoring disease progression would result in informative censoring.

In this review, different definitions of deterioration were reported, and the definitions differed even between two similar clinical trials. For example, two RCTs were performed in metastatic castration-resistant prostate cancer, comparing enzalutamide to placebo [24, 34]. Both trials assessed PRO using the Functional Assessment of Cancer Therapy-Prostate (FACT-P) and Brief Pain Inventory-Short Form questionnaires and analyzed the deterioration of the FACT-P total score as well as pain progression. In the AFFIRM trial [24], the FACT-P total score deterioration was defined as a first deterioration of at least 10 points compared to the baseline score, or death from any cause. In the PREVAIL trial [34], the FACT-P total score deterioration was defined as a first deterioration of at least 10 points compared to the baseline score. Thus, death was not considered as an event in the TTD definition in this latter study. Consequently, results cannot be directly compared between these two trials. Since the TTD approach seems to be used regularly to analyze longitudinal PRO data, it is mandatory to propose a standardization of the deterioration definition to be used, that is adapted to the cancer site and setting (localized or advanced/metastatic) and easy to follow in practice. Some initial recommendations were published online in 2013 [6], proposing a standardized definition by cancer setting (adjuvant or advanced/metastatic setting). In metastatic cancer, a definitive deterioration, whether or not it includes death as an event, was the recommended definition. However, the AFFIRM and PREVAIL trials both failed to follow these recommendations. While the majority of the studies in this review were performed in the advanced/metastatic cancer (92.3%), the deterioration considered was mostly a first deterioration, whether this was explicitly stated (44.4%) or not (22.2%). In contrast, the deterioration was defined as definitive or confirmed in 19.4% and 19.4% of studies, respectively. In general, very few papers have considered the 2013 recommendations [6]. Due to the low number of papers published per year, it was not possible to investigate whether any improvement in adherence to the recommendations occurred over time. This failure to take these recommendations into account may have several explanations. Firstly, communication about this methodological work may be insufficient. Second, researchers may consider these recommendations inappropriate for their setting. Third, the statistical analysis plans of the trials, including the definition of TTD, may have been written prior to the online publication of the recommendations in 2013. Finally, this could be due to the complexity of performing the analyses. Indeed, investigating a definitive deterioration of HRQoL requires checking that the deterioration is maintained at all subsequent measurement times. In order to facilitate the application of the TTD approach and to allow some standardization, this approach has been implemented in R software [53]. Moreover, the development of international guidelines is ongoing with the Setting International Standards in Analyzing Patient-Reported Outcomes and Quality of Life Endpoints Data (SISAQOL) project and will include the TTD approach [54].

At this time, no recommendation has been made on how death should be taken into account. In this review, 87.5% of the studies considered death as an event. To ignore death in advanced cancer could represent informative censoring. However, if death occurs a long time after the last available HRQoL assessment, it could overestimate the TTD and may be a better reflection of overall survival if most of the observed events are due to death. One could thus include death as an event if it occurred in a reasonable length of time after the last HRQoL assessment [55]. An alternative would be to consider death as a competing risk with the event of interest, i.e., the deterioration of HRQoL [45].

The use of the TTD approach requires specifying important details. Hence, the primary objective of this review was to assess the clarity of the TTD definition according to 5 key criteria. Not surprisingly, the clarity of the TTD definition was slightly better in the secondary papers than when PRO results were reported in the main publication. Overall, the clarity score was rather high for all studies. However, since only seven main publications were identified, no statistical comparison with the secondary publications dedicated to PRO results was performed. The manner of handling missing baseline scores was the least well-defined issue. A majority of studies did not clearly report how missing data at baseline were considered. Only two studies considered missing baseline scores as a censor [20, 31]. In the TTD approach, the baseline score is a key point. In case of missing baseline scores, it is necessary to know how its data are considered in order to apply the TTD approach. Most of the time, in this review, there was no mention of how intermittent or monotone missing data were handled over time. Only 2 studies assumed that the deterioration occurred at the time of the missing value [24, 44], which could make it possible to take account of informative missing data.

The secondary objective of this review was to assess the concordance between PRO data using the TTD approach and the primary endpoint(s) of the study. This objective highlighted the consistency and complementarity of the results in 27 studies (69.2%).

The TTD is strongly influenced by the measurement times. Although the measurement times are generally similar between treatment arms, the frequency and interval between two consecutive assessments can vary between RCTs. Most of the clinical trials in this review assessed PROs at every treatment cycle. However, some of them assessed PROs every 3 months [36] or every 3 and then every 6 months [45]. Furthermore, the assessment time may depend on the disease course and the expected change in HRQoL. The median TTD could be overestimated if the measurement times are too far apart. This is a common problem for time-to-event endpoints [56], and interval censoring could be a feasible solution to be investigated. The TTD approach should be applied only with a reasonable number of measurement times [3]. The EORTC recommends at least three HRQoL assessments in clinical trials [57] in order to assess the impact of treatment on HRQoL over time. To apply the TTD, three measurement times seem to be the strict minimum. We chose to report here the number of studies with at least four measurement times. In this review, most studies fulfilled this criterion.

In future research, to consider the PRO as a primary or co-primary endpoint using the TTD approach, certain assumptions have to be made [58]. These assumptions should be based on TTD data from previous studies. Furthermore, assuming that the measurement times can impact this approach, the timing of the PRO assessments should be similar to that used in the prior study on which the TTD assumptions are based. Indeed, more intensive assessment of PRO in one arm compared to the other could introduce bias in the TTD. In particular, if there are more missing data in one arm than in the other, then the TTD could be overestimated in the arm with more missing data. If it is not possible to make assumptions about the TTD, then TTD should not be used as the primary endpoint for PROs. Alternative approaches should be preferred, such as the mean difference in PRO scores measured at two different timepoints, or the difference between treatment arms, according to the MID. In any case, the PRO questionnaire should be administered at the earliest time(s) when MIDs are expected to occur.

This literature review has several limitations. First, it was limited to 4 years, from 2014 to 2018. This was due to the fact that the initial paper providing recommendations of the deterioration definition was published in late 2013. The objective was thus to observe whether the papers published since 2014 followed these initial recommendations. However, this review covers a short period and it would be interesting to reiterate this work in a few years from now. Indeed, this methodology is almost a novel approach to analyzing PRO data.

Conclusion

This systematic review highlights the heterogeneity of the definitions used for TTD, some of which may not be adapted to the disease setting. There is a compelling need to standardize the TTD approach adapted to the cancer site and setting.