Introduction

Health-related quality of life (HRQoL) is a multidimensional, subjective, and dynamic concept, incorporating at least three functional domains: physical, psychological, and social functioning, as well as symptoms due to disease and treatment [1].

In cancer clinical trials, HRQoL is often prospectively assessed using the European Organization for Research and Treatment of Cancer (EORTC) Core Quality of Life Questionnaire (QLQ-C30) [2]. The QLQ-C30 is a widely used cancer questionnaire composed of 30 ordinal items assessing 15 scales: global health status/QoL (GHS/QoL), five functional domains (physical, role, cognitive, emotional and social), three multi-item symptoms scales (fatigue, pain, nausea and vomiting), and six single-items symptom scales (diarrhea, constipation, insomnia, appetite loss, dyspnea, and perceived financial impact). For each domain, a raw score is first estimated as the average of all contributing items then standardized by linear transformation on a scale from 0 to 100 according to the scoring procedure recommended by the EORTC [3]. A high score on functional scales and GHS/QoL represents a high/healthy level of function and high global HRQoL, respectively, whereas a high score on symptom scales indicates a high level of symptomatology. The QLQ-C30 is now used in numerous studies, enabling comparison of results. It is also often associated with disease-specific modules.

In cancer clinical trials, HRQoL is generally collected at different assessment times predefined in the study protocol. To evaluate the impact of the treatment on the change in HRQoL over time, at least three assessments of HRQoL are recommended [4, 5]: at baseline (before the start of treatment), during treatment, and at the end of treatment.

Using appropriate methods to analyze such longitudinal data is essential, but the analysis strategies used are still not homogenous [4, 6]. The choice of an appropriate methodology for longitudinal analyses of HRQoL would enable homogenization of results across different therapeutic situations and tumor sites, thereby ensuring greater comparability of the results between trials [7, 8].

Two main longitudinal statistical approaches are used to analyze HRQoL in cancer clinical trials, namely linear mixed models (LMM) [9] or time-to-event modeling [10]. Time-to-event modeling, time-to-HRQoL score(s) deterioration, seems more accessible and intuitive for clinicians, but LMM are more widely used in practice, even though they are more complex. Both approaches use the standardized score recommended by the EORTC. In recent years, the longitudinal partial credit model (LPCM), an alternative strategy based on generalized linear mixed models (GLMM) for categorical data from Item Response Theory (IRT) has been proposed [11, 12]. LPCM has also previously been compared with the two former in a large simulation study [13]. To complement these results, the objective of this article is to evaluate these methods and to propose recommendations to standardize longitudinal analysis of HRQoL data in cancer clinical trials. The methods are first described and compared via statistical, methodological, and practical arguments, then applied on real HRQoL data from selected clinical cancer trials or published prospective databases in a variety of therapeutic situations and tumor sites. The advantages and disadvantages of the methods are discussed, and recommendations are proposed.

Materials and methods

Study selection

In total, seven French studies from a collaborating group were selected according to the following criteria: published randomized phase 2/3 clinical trials or prospective cohort studies comparing two treatments or groups of patients in adjuvant, advanced, or palliative setting in different cancer sites with longitudinal collection of HRQoL data with the QLQ-C30.

The scoring procedure recommended by the EORTC was used to calculate the standardized scores. To allow comparison between studies, all analyses were performed on the modified intent-to-treat (mITT) population, i.e., including all ITT patients with HRQoL data available at baseline [4].

Statistical analysis

Longitudinal analyses were performed using the three approaches described above. First, LMM that modeled the change in HRQoL score over time for each domain were used [12, 13]. This model combined fixed effects, i.e., group effect, time effect (time was considered as a continuous variable) and group-by-time interaction effect (difference in HRQoL change between groups); and random effects, i.e., random intercept and random slope. The random effects take into account the correlation between the different observations for a same patient and represent the individual deviation from the average intercept and average slope. Finally, the model was the following:

$$Y_{i} \left( t \right) = \beta _{0} + ~\beta _{1} t + \beta _{2} {\text{grp}}_{i} + \beta _{3} \left\{ {{\text{grp}}_{i} \times t} \right\} + u_{{0i}} + ~u_{{1i}} t + ~\varepsilon _{i} \left( t \right)$$

where \({Y}_{i}\left(t\right)\) denotes the HRQoL score for patient \(i\) at time \(t\) and should be normally distributed, \({\varepsilon }_{i}\left(t\right) \sim N\left(0,{\sigma }^{2}\right)\) represents the error term. The vector of the random effects \({u}_{0i}\) and \({u}_{1i}\) is assumed to be normally distributed with a mean of zero and an unconstrained covariance matrix.

Most of the time, the group effect was null, i.e., there was no difference between groups at baseline as is usual in randomized clinical trials, and no fixed group effect was kept in the model.

Second, the time-to-event approach was used, in which the deterioration of the HRQoL score is considered as an event. Due to substantial variability in the event definitions, a first set of recommendations were made regarding the definition of the time to deterioration [10]. Accordingly, for the adjuvant setting, we considered the time to first deterioration (TTD), defined as the time from randomization/inclusion to the study to the observation of the first clinically significant deterioration of the HRQoL as compared to the baseline score. Patients without significant deterioration were censored at the time of the last HRQoL assessment. Patients with only a baseline score (i.e., with no follow-up) were censored one day after baseline. For the advanced or metastatic settings, we considered the time until definitive deterioration (TUDD), defined as the time from randomization/inclusion to the study to the observation of the first clinically significant deterioration of the HRQoL score as compared to the baseline score, with no further clinically significant improvement as compared to the baseline score. Patients without clinically significant deterioration and those with deterioration but which was followed by a significant improvement are censored at the time of the last HRQoL assessment. Note that the TTD/TUDD approaches assume that right-censoring is independent of time to deterioration. Thus, the right-censored patients must be comparable to the patients still at risk regarding their risk of HRQoL deterioration. Finally, the responder threshold to qualify an individual change in TDD/TUDD was fixed at ten points, as usually considered for EORTC HRQoL questionnaires [1, 14]. Sensitivity analyses were then performed considering the best previous (instead of baseline) score as the reference score [10], and death as an event (only added for TUDD compared to the baseline score).

Third, LPCM [11,12,13] that considered the item responses instead of the score over time were used. A LPCM can be seen as a GLMM with a multinomial logit link function. It models the probability that a individual i selects category k of item j (k varies from 1 to mj with mj the number of possible response categories for item j) at visit t given her/his latent trait \({\theta }_{i}^{(t)}\) presenting her/his level of HRQoL at time t (time was considered as a continuous variable), and the difficulty parameters \({\delta }_{j,1},\dots ,{\delta }_{j,{m}_{j}}\):

$$P\left( {X_{{i,j}} = k~|~\theta _{i}^{{\left( t \right)}} ,~\delta _{{j,1}} , \ldots ,~\delta _{{j,m_{j} }} } \right) = \frac{{\exp ~(k\theta _{i}^{{\left( t \right)}} - \mathop \sum \nolimits_{{p = 1}}^{k} \delta _{{j,p}} )}}{{\mathop \sum \nolimits_{{h = 1}}^{{m_{j} }} \exp ~(h\theta _{i}^{{\left( t \right)}} - \mathop \sum \nolimits_{{p = 1}}^{h} \delta _{{j,p}} )}}$$

The latent variable, assumed to be normally distributed, was linearly decomposed similarly to the first approach (LMM) with fixed and random effects:

$$\theta _{i} ^{{\left( t \right)}} = ~\beta _{0} + ~\beta _{1} t + \beta _{2} {\text{grp}}_{i} + \beta _{3} \left\{ {{\text{grp}}_{i} \times t} \right\} + u_{{0i}} + ~u_{{1i}} t$$

LPCM is based on three fundamental IRT assumptions [15], namely unidimensionality (the latent trait is a scalar), monotonicity (the item response functions are increasing), and local independence (the item responses are conditionally independent given the latent trait). Statistical longitudinal analyses were performed using SAS software, Stata commands [16, 17] and R package QoLR [18].

Results

Methodological and practical comparison

Table 1 summarizes the main features of each of the three longitudinal approaches; considering both methodological and practical arguments related to the response variable, modeling, results, as well as the interpretation and readability of the results.

Table 1 Methodological comparison of the three longitudinal methods

Using the LMM, the outcome is the HRQoL score, which is considered as a continuous variable, while the number of possible values of the HRQoL score depends on the number of items contributing to the scale. For example, for single-item scales (six single-item symptoms for QLQ-C30) with four response categories, only four values exist for the corresponding HRQoL score. Figure 1 illustrates how the 30 items of the QLQ-C30 are distributed to calculate the 15 scale-specific HRQoL scores. Time-to-event modeling approaches raise the same concern: a change of one unit in single-item scales that have four response categories corresponds to a HRQoL score difference of 33 points. Thus, a particular attention should thus be paid to the distribution of the EORTC scores in order to use the appropriate individual threshold to quality the deterioration instead of systematically consider a difference of ten points per scale. Only the LPCM approach can avoid such pitfalls by considering the response to the items as outcome instead of the HRQoL score. A limit of the LMM approach is the Gaussian assumption: the score variable could have a non-symmetrical distribution and the LMM treats the score as a continuous instead of a categorical variable. In this regard, LPCM seems more appropriate [11], but also has three strong assumptions. Another advantage is that it makes it possible to directly use the response to the items, and not only the summary HRQoL score. Indeed, patients can obtain the same HRQoL score with different responses to the items. However, few adapted programs are available to manage GLMM with both random intercept and slope. A SAS program using PROC NLMIXED and a Stata program using the glamm procedure (https://www.glamm.org/) give similar results but the Stata glamm procedure is time-consuming.

Fig. 1
figure 1

Distribution of the 30 items in the HRQoL score calculation for EORTC QLQ-C30

Techniques for dealing with multiple comparisons are available with all three approaches, even though, in practice, type I error adjustment is rarely taken into account in the analysis, except when HRQoL is the primary endpoint [6]. Concerning the management of missing data, likelihood-based methods such as LMM or GLMM provided unbiased estimates under MCAR or MAR assumptions [19] contrary to the time-to-event approach. Non-informative missing data reduce only the statistical power in all three strategies [4]. Time-to-event analysis, as well as LMM and LPCM, provides biased estimations in case of informative drop-out. Only joint modeling of HRQoL measurement and the missing data process can produce unbiased estimation [19]. The compliance over time should always be described and compared between treatment arms. Moreover, the reason for missing HRQoL forms is an important issue and should be recorded in clinical cancer trials, to make it possible to characterize the mechanism of missing data at least.

Concerning the interpretation and readability of the results, time-to-event analyses are more easily interpretable for clinicians because of their ubiquitous use in oncology. TTD allows a direct interpretation of the results in terms of clinically relevance with the integration of the responder threshold within the definition of deterioration. Note that the mean HRQoL change for LMM can be also interpreted in accordance with the group-level MID. Finally, the IRT-based model remains very difficult to interpret even for a statistician. The different graphical outputs available with the three approaches are illustrated in Fig. 2. For each of the three methods, a summary graph with all the scales, such as a forest plot showing the estimated effect and its 95% confidence interval, could be also prepared [20].

Fig. 2
figure 2

Graphical outputs for LMM, LPCM and TTD/TUDD. a, b Individual (point) and mean (line) predicted values. c Kaplan–Meier survival estimate

Selected randomized clinical trials and prospective cohorts

Randomized clinical trials or prospective cohorts from a French collaborating group were selected in a variety of therapeutic situation and tumor sites. The CO-HO-RT trial [21], APAD trial [22, 23], and Response Shift study [24] involved patients with adjuvant breast cancer; MIROX [25] involved metastatic colorectal cancer patients, PRODIGE5/ACCORD17 [26] involved advanced esophageal cancer patients, PRODIGE4/ACCORD11 [27] involved metastatic pancreatic cancer patients, and TEMAVIR [28] included patients with unresectable glioblastoma. Table 2 describes the clinical trials selected, including the trial acronym and ClinicalTrials.gov identifier, control, and experimental arms, primary endpoint, and details about HRQoL assessment.

Table 2 Description of the clinical trials or prospective cohort selected

Application on the selected databases

Table 3 summarizes the results obtained with the three different approaches. Specifically, we report for each method: the scales with a significant improvement/deterioration over time (LMM, LPCM) and the scales with a significant difference in the experimental group compared to the standard group (LMM, LPCM: group-by-time interaction effect, TTD/TUDD: hazard ratio). The number of significant scales as well as the interpretation is also given. Additionally, all the PRO results that should be reported in RCTs are given in Supplementary Table 1 (LMM and LPCM) and Supplementary Table 2 (TTD/TUDD) on an example (PRODIGE5/ACCORD17 trial).

Table 3 Evaluation of the three longitudinal methods on the selected databases

For adjuvant situations, for all breast cancer clinical trials, the significant scales and their interpretation were the same between LMM and LPCM; except in one study, where LPCM found an additional significant scale. Similar results (except for one study) were also observed between the TTD approach based on a 10-point responder threshold compared to the baseline score or the best previous score. However, results were concordant between the LMM/LPCM and TTD approaches for the APAD trial only, although the number of significant scales was lower with the TTD approaches (5 for LMM/LPCM vs 3 or 4 for TTD). In the other two studies, the TTD approach found only one significant scale, which was not among the scales found to be significant by the LMM and LPCM methods in three of the four cases.

For advanced disease, for the PRODIGE5/ACCORD17 trial in esophageal cancer and TEMAVIR trial in glioblastoma, we observed a similar number of significant scales for LMM and LPCM, always with the same interpretation. Moreover, only one or two scales were significant considering the TUDD, and the scales identified with LMM/LPCM and TUDD were always different.

For metastatic disease, whatever the tumor site (colorectal and pancreas cancer), the results were similar between LMM and LPCM. Indeed, the following scales: pain for LMM and fatigue for LPCM in PRODIGE4/ACCORD11 and MIROX trials, respectively, were non-significant but, always, at the limit of significance. Finally, the number and type of significant scales were different between LMM/LPCM and TUDD. Moreover, among the three event definitions considered for the TUDD approach, the results were also discordant, in particular when death was added as an event.

Discussion

This article compares the two most common methods for longitudinal analysis of HRQoL in cancer clinical trials, the LMM and TTD/TUDD approaches, and an alternative strategy based on IRT, namely the LPCM, through statistical, methodological, and practical arguments.

From a statistical point of view, the LPCM approach is more suited than LMM and TTD/TUDD to the construction of EORTC questionnaires. Indeed, the HRQoL scores for dimensions based on few items are considered as continuous variables whereas in fact, they present the characteristics of ordinal variables [11, 12]. However, a previous simulation study comparing these three approaches found that the LMM was the most powerful method in all the scenarios considered, ahead of the LPCM [13]. This study also found that the statistical power of the TTD/TUDD approach was low, especially for single-item scales (even with a large sample size), but the case where death or drop-out was integrated into the event definition was not considered. Finally, the LMM is a well-established approach, more intuitive and easy to perform, contrary to LPCM, which is difficult to understand and to interpret, even for a statistician, and not implemented in the main statistical software packages. Nevertheless, the use of the LPCM could be argued and justified for single-item scales.

For the LMM, time was treated as continuous, which implies to make an assumption on the relationship between time and HRQoL. Notice that the linearity assumption considered could be relaxed by including for example a quadratic term or by using splines that would allow a flexible form for the HRQoL trajectories. For the LPCM, time was also treated as continuous. In both the LMM and the LPCM, time could be also treated as a discrete variable.

From a practical point of view, to promote quicker and more systematic analysis of HRQoL data with the three methods in clinical trials in oncology, we developed several commands providing automatic and reliable analyses with the statistical software Stata [16, 17] and R [18]. Moreover, SAS and Stata codes to implement the LPCM are also available from the authors on request.

This article also compares the three methods on real HRQoL data from seven clinical cancer trials and French published prospective databases in adjuvant, advanced or palliative settings, and in different tumor sites. In the majority of cases, we observed concordant results between the LMM and LPCM approaches (significant scales and interpretation of the results, i.e., in favor of the experimental or control arm). However, discordant results were observed between the TTD/TUDD approach and the two others. This issue was not unexpected, since it had been already been raised in two glioblastoma trials [7, 8]. These discordant results are coherent with regard to the different criteria considered: GLMM investigate the change in HRQoL score (LMM) or the latent trait level (LPCM), whereas the TTD/TUDD approach studies the time until the occurrence of a HRQoL score deterioration (whether definitive or not). For TTD/TUDD analyses, the event definition is a major difficulty, especially since the choice of the reference score to determine the deterioration (for example, baseline or best previous score) and the possible inclusion of death in the event definition (as recommended in the palliative setting) could produce discordant results. Besides, in the case where death is included in the event definition, it is necessary to remain vigilant with regard to death occurrence in relation to the questionnaire collection time: with too many early or late deaths, TUDD analysis could coincide with overall survival analysis. On the other hand, if death is not included in the event definition, it may introduce an informative censoring and bias the results. At last, our study presented a limit: the choice of a ten-point responder threshold, for all scales, to qualify an individual deterioration in the TTD/TUDD approaches. Indeed, for single-item scale, for example, a ten-point difference is achievable with a movement of only one response level and for others, it would require more than this. The choice of ten points cannot be clinically meaningful across all subscales. In fact, the EORTC is currently working on the definition of MID for group-level as well as responder threshold per EORTC questionnaire and cancer sites [29]. These new recommendations could then be used and will be better adapted according to the type of analysis.

Overall, these methods seem to provide complementary interpretations and information. The LMM and LPCM are the most powerful methods on simulated data, while the TUDD/TDD approach gives more clinically understandable results. Thus, one method does not outperform the others and we would recommend combining the LMM and TTD/TUDD approaches, except for single-item scales, for longitudinal analysis of HRQoL data (if HRQoL is a secondary endpoint) in cancer clinical trials.

This statement has been supported by the glioblastoma trials previously cited [7, 8]. In fact, a secondary paper on HRQoL data has been published for AVAglio trial using both LMM and TTD approaches [30], which enabled the comparison of the results with the RTOG0825 trial [8]. For single-item scales, the TTD/TUDD approach is not appropriate, and LPCM is more appropriate than LMM and seems easy to implement because, in this case, LPCM is only a classical GLMM for ordinal data.

An article fully dedicated to reporting HRQoL data should be systematically proposed after the first publication of the trial’s results and should be written according to the recommendations of Calvert et al. [31]. In particular, statistical approaches for dealing with missing data (such as sensitivity analysis with joint modeling) and type I error adjustment must be explicitly detailed. Moreover, it is highly recommended and appreciated when the HRQoL results are published relatively soon after the main paper comes out.

Our work and views are consistent with the objective of the SISAQOL Consortium [32, 33] to propose recommendations for standardizing analyses of patient-reported outcome data in cancer clinical trials. Indeed, before planning clinical cancer trials with HRQoL as a primary/co-primary endpoint, it is essential to harmonize the methodology for HRQoL analysis and the reporting of the results. This seems the most reliable way to obtain comparative results between trials in order to make assumptions to plan future clinical trials.

Conclusion

In conclusion, these results pledge for the recommendation to use both longitudinal methods LMM and TTD/TUDD (except for single-item scales) in HRQoL-specific publications to move towards becoming a consensus. The choice of the method should be also guided by the clinical objective, depending on whether the objective is to show a difference in the evolution of the mean score over time (LMM, LPCM) or a difference in the risk of HRQoL deterioration over time (TTD/TUDD).

Standardization of the longitudinal analysis of HRQoL is an essential step towards confirming its position as a primary or co-primary endpoint in cancer clinical trials, ultimately leading to a change in clinical practice in light of HRQoL data.