Introduction

Chronic graft versus host disease (cGVHD), a serious and life-threatening condition, occurs in approximately 30–70% of patients who receive allogeneic hematopoietic stem cell transplantation (HSCT) [1,2,3]. It is characterized by complex allogeneic and autoimmune dysregulation of the immune system. Symptoms may impact multiple organs with a predilection for oral and ocular mucosa, skin, lung, liver, gastrointestinal, and genitourinary tract epithelium. Prior to August 2017, there were no approved second-line treatments and no standard of care. Conducting a randomized, controlled trial in refractory cGVHD is challenging because of its rarity, life-threatening nature, and difficulty identifying a comparator arm.

On August 2, 2017, the Food and Drug Administration (FDA) approved ibrutinib (IMBRUVICA, AbbVie Inc.) for the treatment of patients with cGVHD after failure of one or more lines of systemic therapy. Ibrutinib is the first FDA-approved drug for the treatment of cGVHD. Approval was based on results from study PCYC-1129-CA (NCT02195869), a single-arm trial of 42 patients with cGVHD after progression on first-line corticosteroid therapy and requiring additional therapy. The primary endpoint was a clinician-reported outcome (ClinRO); best overall cGVHD response rate (BORR) per the 2005 National Institutes of Health (NIH) Consensus Panel Response Criteria with modification to align with the updated 2014 NIH Consensus Panel Response Criteria. Using this ClinRO, ibrutinib demonstrated a BORR of 66.7% (n = 28, 95% CI 50.5%, 80.4%), which included both partial and complete responders. Median time to response was 12.3 weeks. A sustained response (≥ 20 weeks) was demonstrated in 48% of the patients for whom there was no available therapy. In addition, responses were seen across different organ involvement within the first 3 months of treatment. A sustained response in approximately half of patients with an unmet medical need can be considered clinically meaningful. Importantly, the ClinRO results were supported by favorable patient-reported outcome (PRO) results [4].

Recent legislation, including the Twenty-first Century Cures Act, has highlighted the importance of capturing patient input to inform medical product development. As the symptom burden associated with cGVHD is high, PRO measures are especially useful for capturing relevant symptoms and describing patients' experience with treatments and interventions. One commonly used PRO strategy in clinical trials to capture patient experience is the inclusion of fit-for-purpose PRO measures. The FDA defines “fit-for-purpose” as “a conclusion that the level of validation associated with a medical product development tool is sufficient to support its context of use” [5]. The Lee Chronic GVHD Symptom Scale (hereafter referred to as the LSS) is a PRO measure developed in 2002 to assess the heterogeneous symptom bother and impacts of cGVHD [6] and was included in the registration trial for ibrutinib.

In previous research, patients with cGVHD who completed PRO measures were found to experience detrimental effects to their physical functioning and other symptoms when disease symptoms increase in severity [7]. In another study, newly diagnosed cGVHD patients who met clinical criteria for response, had greater reduction in symptom burden [8].

Despite the advantages of collecting PRO data in rare disease trials such as those conducted in cGVHD, one challenge is the frequent absence of a control arm, leading to concern that patients may overestimate benefit when aware of treatment assignment [9]. In this trial, additional limitations of the PRO results included identification of a responder definition, and other instrument shortcomings. In this manuscript, we report the FDA review of PRO data from Study PCYC-1129-CA, and the decision to incorporate descriptive data from a PRO measure in the FDA label to support the primary clinical results.

Materials and methods

Study participants

Study PCYC-1129-CA was a multicenter, single-arm, open-label phase 1b/2 trial of ibrutinib in patients with steroid dependent or refractory cGVHD after allogeneic HSCT. Patients were required to have received ≤ 3 prior therapies for cGVHD. Patients were also required to have either > 25% body surface area erythematous rash or NIH mouth score > 4 [10]. Patients in the trial were asked to complete screening, treatment and follow-up assessments. A total of 45 patients were enrolled, and 43 treated. One patient who received ibrutinib was excluded due to relapse of underlying disease at baseline, resulting in a population of 42 patients. The design of this trial has been described in further detail elsewhere [11].

PRO measures

The LSS consists of 30 items that are used to create 7 subscales: Skin (5 items), Eye (3 items), Mouth (2 items), Lung (5 items), Nutrition (5 items), Energy (7 items), and Psychological (3 items) [6]. For each item, patients rate how bothered they were by symptoms (e.g., mouth ulcers), impacts (e.g., avoiding certain foods) and medical interventions (e.g., use of eye drops) over the past month using a 5-point response scale with the options: 0 = Not at all, 1 = Slightly, 2 = Moderately, 3 = Quite a bit and 4 = Extremely. Items are summed to generate subscale scores, and the LSS total score is calculated as the average of the subscale scores. All calculated scores are linearly transformed to a 0–100 scale (per scoring algorithm). A higher score indicates more bother from cGVHD symptoms. A decrease or improvement of ≥ 7-points on the LSS total score has been published as a clinically meaningful difference. This > 7-point threshold was calculated using distribution methods (i.e., half a standard deviation of the baseline LSS total score for the population) [6].

In addition to the LSS, the drug sponsor collected the Patient Self-Report section (Form B) of the NIH cGVHD Response Assessment Form [12]. For this study, FDA focused on two global items:

  • Item 1. “Overall, do you think that your chronic graft versus host disease is mild, moderate, or severe?” (0 = None; 1 = Mild; 2 = Moderate; 3 = Severe)

  • Item 3. “Compared to a month ago, overall would you say your chronic GVHD symptoms are” (3 = Very much better, 2 = Moderately better, 1 = A little better, 0 = About the same, − 1 = A little worse, − 2 = Moderately worse, − 3 = Very much worse).

Data were collected at week 1 (baseline), week 13, and every 12 weeks thereafter, with additional assessments at the progressive disease visit (if applicable), end of treatment (EoT) visit and response follow-up visits. A late protocol amendment was implemented to capture an additional PRO assessment at week 5.

Statistical analysis for PRO

The PRO measure was used to assess the secondary endpoint: “Change in symptom burden measured by the Lee cGVHD Symptom Scale,” with no adjustment for Type I error. The analyses presented by the applicant were replicated by the FDA and will be presented in the results section. Summary statistics were used to describe the LSS total and subscale scores over study visits. A responder analysis was conducted using a ≥ 7-point improvement on the LSS total score as the response threshold based on the previous literature. Patients who experienced a ≥ 7-point improvement are referred to as patients with a PRO response in this paper. A sub-group analysis of PRO responders by the clinical outcome, best overall response was also investigated. Finally, mean change from baseline on the LSS total score was assessed.

Post hoc FDA analysis

Completion rate for the LSS was calculated as the number of patients completing > 50% of items at each PRO assessment divided by the number of patients expected to complete the LSS at that assessment (i.e., patients still on treatment). The denominator did not include patients who had progressed or died [13].

The sensitivity analyses outlined below addressed two limitations: (1) the ≥ 7-point threshold may not be meaningful and (2) open-label bias may have overestimated the treatment benefit.

First, the relationship between the PRO response was compared to clinical response using descriptive statistics. Next, the threshold for meaningful change for the LSS total score was assessed using anchor-based methods supplemented with cumulative distribution function (CDF) curves as is suggested in the FDA Guidance to Industry for PROs [9]. Here, the PRO measurement results are defined (i.e., anchored) in terms of change external to, in this case, the LSS. The anchors were: patient-reported change from baseline on global cGVHD severity (item 1) and change in overall cGVHD symptoms (item 3) from the NIH Response Assessment. Due to small sample size, adjacent response options on global change (item 3) were collapsed (e.g., Better = Very much better, Moderately better, A little better). This resulted in three categories; Worse, About the same and Better. Change from baseline and week 13 was used for global severity (item 3) where > 0 = Better, 0 = No change and < 0 = Worse. Week 13 was used due to reduced sample size at subsequent assessments. The mean score of the improvement group was considered as the threshold for meaningful change.

Baseline differences on the LSS total score, subscales and psychological items were explored between patients with and without a PRO response. This was presented under the assumption that certain subscales may have been more sensitive due to this being an open-label study.

Finally, we looked at floor and ceiling effects for each item. These effects were considered present if more than 20% of baseline responses were in the highest (ceiling) or lowest (floor) response categories. For example, a floor effect for an item would exist if more than 20% of patients responded “Not at all” to being bothered, whereas a ceiling effect would be present if > 20% of patients responded as being “extremely” bothered.

Analyses were performed on the pooled phase 1b/2 data (i.e., all-treated population). All analyses were completed using SAS software (release 9.4, SAS Institute, Inc., Cary, NC).

Results

Forty-two patients were included in the all-treated analysis population. Median age was 56 years (range 19, 74 years) and 52.4% were male. Median duration of time on treatment was 4.4 months (range 7 days, 24.9 months), with 12 responding patients still on treatment at end of study.

PRO: completion rates

All 42 patients completed the baseline assessment. Post-baseline completion for the LSS was > 83% at all other designated clinic visits, except for the week 5 visit, which was added as a late protocol amendment (Table 1 in Online Appendix). Ten patients completed a week 5 assessment, however, the completion rate is unclear as the denominator (number of patients eligible) after the protocol amendment was not adequately described in the submission.

PRO: mean change from baseline

Mean change from baseline for the LSS total and subscale scores was reported for each study visit for patients who completed an assessment. Over the first 12 months of treatment, the mean change from baseline for LSS total score monotonically improved from − 1 (standard deviation (SD) = 10, N = 10) at week 5 to − 9 (SD = 12, N = 15) at week 49 (Table 1). The largest changes were observed for the Skin and Eye subscales (Figs. 1 and 2 in Online Appendix). Item-level change for these two subscales indicated that no single item was responsible for the change.

Table 1 Mean change from baseline for LSS total score

Patients with ≥ 7-point improvement on LSS total score

Analyses submitted to FDA reported that 18 (42.9%) patients had a ≥ 7-point improvement on the LSS total score at any point during the assessment period (Table 2). Seventeen of these 18 patients were classified by the investigator as experiencing a clinical partial response or better.

Table 2 Number of patients by clinical response and PRO LSS total score responders

Post hoc FDA analysis

FDA further explored duration of PRO response using the ≥ 7-point threshold. Ten out of 42 (23.8%) patients had a ≥ 7-point improvement on the LSS total score at any point that was maintained for ≥ 2 consecutive visits. Of these ten sustained responses, 1 patient had an initial response that was captured at the EoT visit and was sustained at a follow-up visit. Of the patients who were considered PRO responders, their mean change from baseline was − 14.2 (SD = 5.7, range − 7.1, − 27.7), and the median time to an improvement of ≥ 7-points was 2.9 months (range 0.9, 16.69, Fig. 1).

Fig. 1
figure 1

Swimmer’s plot for PRO LSS total score responders (N = 18)

FDA assessed the meaningfulness of the 7-point change threshold. The Spearman correlation between change from baseline at week 13 on the global severity of cGVHD item and the LSS total score was 0.5 and − 0.3 between patient global impression of change and change from baseline at week 13 for the LSS total score. This suggests these anchors were appropriate [14]. At week 13, nine (29%) patients reported less (better) severity of their GVHD symptoms and 13 (42%) patients reported an improved (better) change in GVHD symptoms over the past month (Table 3). No patients had worse severity when comparing their week 13 score to baseline, but three patients reported their symptoms had gotten worse over the past month based on the global impression of change. The patient global severity item possibly overestimated the effect (effect size > 0.5, Table 3), therefore, we focus on the LSS threshold using global impression of change as the anchor (item 3). This analysis suggested a meaningful threshold to be 6.4 or greater. The CDF curves (Fig. 8a, b in Online Appendix) revealed a separation between the stable and better group. There were too few patients who reported worsening symptoms to interpret that curve, and worsening was not included.

Table 3 Anchor-based analysis of Lee Chronic GVHD Scale total score

Further LSS analyses

FDA analysis noted floor effects at baseline for 20 of the 30 items. For 10 of these items, more than 50% of patients endorsed the lowest response category (0 = Not at All). Three items had ceiling effects (Table 4).

Table 4 Floor and ceiling effects for the LSS items

For medical intervention items, no patients reported bother with the intravenous line/feeding tube item, and only 2 patients reported slight or moderate bother on the use of oxygen item throughout the study. For the item assessing bother associated with eye drop use, at any time during the study, more than 50% of patients reported being bothered by the frequent use of eye drops.

Baseline characteristics of patients with ≥ 7-point improvement on the LSS

To understand whether patients with a clinical response reported different responses to the LSS items at baseline, we looked at descriptive statistics for the baseline assessment by clinical response. Overall, the patients with a PRO response had, on average, higher scores on the LSS total score and 6 of 7 of the subscales at baseline (Table 5).

Table 5 Baseline descriptive statistics for the LSS subscales and total scores by LSS PRO responder sub-group

Discussion

Results from our FDA analyses suggest that patients who experienced a clinical response while on ibrutinib were also likely to self-report reduced bother in their symptoms using the LSS. This review of the PRO data was challenging due to several important limitations of the trial design, assessment tool and analysis. Our review focused on three main issues, (1) responder definition/duration of response on the LSS total score, (2) appropriateness of LSS instrument, and (3) study design (single-arm trial/concern for bias).

Responder definition: clinically meaningful change threshold for the LSS

Clinically meaningful change for the LSS has been proposed as a decrease of ≥ 7-points on the LSS total score [6]. This threshold was arrived at using a distribution-based method [15]. These methods are data driven, do not reflect the patient’s assessment, and are often considered to be supportive to anchor-based methods. However, including anchors in trials may not always be feasible, and there is some evidence that there may be more commonality than difference [16]. Using the patient-rated global items, an anchor-based calculation was applied and supplemented with CDF curves. The threshold estimated was a change of 6.4-points on the LSS total score, which corresponds closely to the 7-point threshold estimated by Lee et al. [6]. The CDF plots also suggested threshold values close to this magnitude to be reasonable. The majority of patients (78%) had a change that was at least 11 points (4 points greater than 7), suggesting that even if a slightly more conservative threshold had been chosen, few patients would have been reclassified.

These estimations are limited by a trial not designed for these analyses, e.g., the anchor item has a 1-month recall period and was used to assess mean change from baseline at 13 weeks, and a very small sample size. Despite these limitations there was concordance observed with the literature. Given this and the results presented, the FDA review concluded that the magnitude of the patient-reported responses were supportive of meaningful clinical benefit.

Limitations of the LSS

The LSS was developed in 2002 to evaluate how bothersome patients find their symptoms [6]. This questionnaire has been well established for use in clinical practice, however, it was not designed as a standalone outcome measure for clinical trials to support regulatory action. Despite this, there are benefit of using patient-reported symptoms which can include bother and impact, which compliments clinician-assessed symptoms. Further study should be done on whether PRO measures are being routinely employed in clinical practice for the assessment of cGVHD. Currently, care guidelines, such as the guidelines published by the National Comprehensive Cancer Network do not include recommendations to capture patient reports of symptoms [17].

The LSS only measures symptom bother; other important aspects of symptoms such as severity and interference with function are not captured. Symptom bother can be a challenging concept to measure and can vary as a function of disease stage and individual tolerance. For example, patients may report being bothered by a symptom that is not very severe, or, a patient may come to tolerate a symptom and report less “bother” even though the symptom remains severe. Because of these challenges, FDA generally recommends measuring symptom severity or frequency, where appropriate, as these concepts might be more sensitive to the treatment effect. However, FDA recognizes that bother, burden, or interference can provide additional important information once severity or frequency is established.

In this trial, patient-reported global severity of cGVHD and impression of change in cGVHD symptoms were assessed. FDA analyzed the relationship between these global items and the LSS total score and found moderate correlations between global severity and LSS total score (baseline and week 13). Evidence of agreement for patients reporting improvement at the same PRO assessment on both a global item and the LSS total score was weak (Table 2 in Online Appendix). Poor agreement could be due to the threshold, or differences between a single-item measure versus a multi-item measure, or because each instrument measures a slightly different concept, or finally some combination of these issues. In general, because this trial was not designed to optimally carry out additional assessments of the measurement properties of the LSS, some findings are difficult to interpret. At best, we found moderate evidence that in addition to improvements in bother, patients were reporting decreases in overall disease and symptom severity.

Another challenge with the LSS concerns the subscales and item content. For example, it is unclear why items measuring bother by joint and muscle aches, are scored as part of the Energy subscale. As our primary focus was on the LSS total score, the mapping of items to subscales was not further explored. If future studies focus on change in the subscales, additional work will be required to determine whether the subscales can be considered fit-for-purpose. Additionally, as an exploratory trial objective was to evaluate the effect of ibrutinib on cGVHD symptom bother, the inclusion of items measuring bother due to other non-investigational medical treatments on the LSS was problematic. The impact of these items was considered small as only one of the medical treatments was prevalent (use of eye drops) in the trial population.

Finally, many items were found to have either floor or ceiling effects. This remains a limitation of LSS, although these effects may be difficult to avoid entirely given the clinical reality that cGVHD symptomatology is heterogeneous. For instance, while it is important to cover all symptoms/outcomes, not all symptoms and functional impacts will occur for an individual patient. This was observed in this trial and can result in large proportions of patients responding with the lowest (floor) response options, making it difficult to distinguish between patients due to a lack of sensitivity. Despite the large proportion of items with floor effects in this trial, we still observed change.

The decision to incorporate LSS results in FDA labeling of ibrutinib should not be construed as endorsement of the current LSS questionnaire as “fit-for-purpose” for a clinical outcome assessment to quantify benefit for cGVHD in registration trials. We support efforts to further modify and improve the LSS and other measures of cGVHD symptoms and impacts for use in future cGVHD trials. Efforts to improve the measurement of cGVHD symptoms could consider the limitations outlined in this manuscript, particularly around alignment of domains and item content. Instrument developers are encouraged to meet with FDA to obtain specific recommendations on how to adapt existing PRO instruments for regulatory purposes early in their drug development program.

Limitations of the trial design: open-label trials and concern for bias

Non-randomized studies are susceptible to bias through knowledge of treatment assignment which may lead to expectation of treatment benefit. Additionally, concomitant treatments that would be expected to affect patients’ reports of their symptoms (e.g., topical treatments) should be standardized, recorded and analyzed. This was not the case in this trial and this information could not be incorporated into our analysis.

The degree to which PRO results are influenced by response bias in open-label trials is poorly understood and there are no agreed-upon methods to account for this potential effect. FDA hypothesized that knowledge of treatment assignment may provide emotional benefit, and that the psychological subscale may be more susceptible to overestimation of treatment benefit. However; our analysis of the psychological subscale did not find evidence to suggest that PRO responders were overly influenced by improvements on this subscale. Skin and Eye subscales were most improved; however, skin and eye are hallmark symptoms of cGVHD, and patients reported high skin and eye bother at baseline, in part due to the inclusion criteria requiring patients have either > 25% erythematous rash or > 4 total mouth score per NIH criteria. This requirement may have influenced the dominance of these domains. Alternatively, this could indicate that the other cGVHD symptoms were not as relevant to the study population.

Another assumption explored was the notion that if a response was driven by being in an open-label trial (perception of symptom benefit in the absence of true therapeutic efficacy), the observed PRO benefit would occur early, and be less durable as treatment side effects and untreated disease symptoms would overcome this response bias. In this trial, assessment of early improvement was limited because time between baseline and first on-treatment assessment was 13 weeks for 3 quarters of the patients enrolled and this corresponds to the median time to response on the LSS total score of 2.9 months (range 0.9, 16.7). While an amendment was made to include a week 5 assessment, only 10 patients completed this assessment. The fact that half the responses occurred or were still present after 3 months of treatment, and more than half of the patients had a response that lasted 2 or more visits suggested that responses were not all early and of short duration.

Finally, results were consistent with previous research identifying an association between PR/CR (i.e., the clinical response) and an LSS improvement [18]. In another study of imatinib or rituximab to improve cutaneous sclerosis in patients with cGVHD, the authors observed a significant decrease in the Skin subscale for patients in the imatinib arm. This was generally in line with clinical findings [19].

Based on FDA sensitivity analyses, it was felt to be unlikely that the PRO results were heavily influenced by response bias due to being an open-label trial. However, patients may still have perceived a larger magnitude of benefit knowing they were on an investigational agent, and this remains a significant limitation of the study design. Importantly, the PRO results were not the primary endpoint of the study and were providing supportive evidence of treatment efficacy demonstrated by clinical evaluation of both signs and symptoms of cGVHD.

Our focus was to describe analyses of the Lee Symptom Scale and its use in the regulatory context of the ibrutinib approval. We recognize an important area of future research is to understand the relationship between PROs and clinical adverse events (e.g., infections such as pneumonia).

Conclusion

Study PCYC-1129-CA demonstrated favorable clinician-reported cGVHD efficacy results that were complemented by results from PRO data, supporting the FDA’s positive benefit–risk assessment leading to regular approval. Limitations of the PRO results include single-arm trial design, responder definition, and instrument shortcomings. These limitations were thoroughly explored through additional FDA post hoc analyses. Despite the limitations identified with the LSS as a clinical outcome assessment for regulatory use, the tool is familiar to physicians treating cGVHD and the FDA review concluded these results were important to convey to treating physicians in the product label. Modification of the LSS to improve its use as a clinical outcome assessment tool for regulatory decision-making should be considered for future trials.