Introduction

Exposure–response (E–R) modeling of clinical endpoints is important in drug development for facilitating informative dosing selection. A widely used class of E–R models includes the indirect response (IDR) models [1]. These models are most often used to describe pharmacodynamics endpoints with the mechanism of delay. However, many clinical trial endpoints are based on disease scores that are not physiological variables. For example, two types of commonly used efficacy measurements in psoriasis are the Psoriasis Area and Severity Index (PASI) score, ranged 0–72 with 0.1 increments, and Physician’s Global Assessment (PGA) scores, a 6-point scale measuring disease severity (0 = cleared, 1 = minimal, 2 = mild, 3 = moderate, 4 = marked, and 5 = severe) [2]. Clinical trial endpoints typically include proportions of patients achieving various criteria including the following: PASI 75, PASI 90, and PASI 100, representing 75, 90, and 100% improvement in PASI score from baseline, respectively; PGA score = 0, or PGA score ≤1. Applications of IDR models to categorical clinical endpoints have emerged in the last decade via the latent variable approach [3].

E–R modeling using PASI scores as continuous variables have been conducted before [4,5,7], however evidence of their ability to accurately predict the PASI criteria appears lacking. PGA scores are most effectively analyzed as an ordered clinical endpoint [8]. The PASI criteria (Pc), namely PASI 75, PASI 90, and PASI 100, can be combined into one ordered categorical endpoint Pc having four possible outcomes: Pc = 0, if achieving PASI 100; Pc = 1, if achieving PASI 90 but not PASI 100; Pc = 2, if achieving PASI 75 but not PASI 90; and Pc = 3, if not achieving PASI 75. Conceivably, because Pc and PGA both measure the same disease activity, their E–R characteristics should be similar, thus jointly modeling them should allow better integration of information. When modeling a continuous and a categorical endpoint measuring the same disease activity, Hu et al. [9] recently showed that joint modeling of endpoints could be more parsimonious and yet better describe the individual endpoints, compared with separately modeling the endpoints. This report addresses whether this type of improvement is possible when modeling two categorical endpoints. The answer is by no means a priori clear, due to the difference in nature between continuous and categorical endpoints. Additionally, with two categorical endpoints, the optimal choice of a common latent variable may not be obvious.

Psoriasis is a chronic immune-mediated skin disorder [10,11,12]. Interleukin (IL)-12 and -23 have been implicated in the pathogenesis of psoriasis [13,14,15], and agents that block IL-12 and IL-23 have demonstrated efficacy in the treatment of moderate-to-severe plaque psoriasis [14]. Guselkumab is a monoclonal antibody that specifically blocks IL-23. Using data from a Phase 2b dose-ranging clinical trial of guselkumab in psoriasis [2], this report investigates the potential benefit of jointly modeling two categorical clinical endpoints and the source of potential improvement, and discusses the impact of latent variable choice.

Methods

Study design

A Phase 2, randomized, double-blind, parallel, dose-ranging study was conducted in patients with moderate-to-severe plaque psoriasis. Approximately 240 patients were randomly assigned to treatment with subcutaneous injection of guselkumab 5, 50, or 200 mg at Weeks 0 and 4 followed by every-12-week (q12w) dosing, or 15 or 100 mg, with q8w dosing, or placebo. The placebo group crossed-over to the 100 mg q8w dosing at Week 16. The last dose was given at Week 40. Data from the Week 40 database lock was used for analysis. The detailed study design has been previously published [2].

Guselkumab serum concentration, antibodies, PASI, and PGA measurements

Serum samples of guselkumab, along with PASI and PGA scores, were collected q4w during Weeks 0–40. At visits when study patients received the study agent, blood samples were collected prior to study agent administration. A validated electrochemiluminescence immunoassay with a lower limit of quantification (LLOQ) of 0.01 μg/mL at a minimum required 1:10 dilution was used to measure serum guselkumab concentrations. A small number (5.3% in total) of post-dose pharmacokinetic (PK) measurements were below LLOQ and excluded from analysis. Serum samples for the evaluation of antibodies to guselkumab were collected at Weeks 0, 16, and 40. Antibodies were detected using a validated sensitive and drug-tolerant electrochemiluminescence immunoassay method using the MSD platform. The observed sensitivity of this anti-drug antibody (ADA) assay was 3.1 ng/mL for antibodies to guselkumab in human serum that did not contain guselkumab; as validated, 15 ng/mL of antibodies to guselkumab could be detected in the presence of up to 3125 ng/mL of guselkumab in human serum samples. Study patients were classified as having a positive antibody status if antibodies to guselkumab were detected in the sample at any visit after guselkumab treatment. The final dataset contained 238 patients with 2014 PK measurements, 2220 post-baseline PASI scores, and 2456 PGA scores.

Population pharmacokinetics model

The population PK analysis was conducted for guselkumab to generate individual parameters that adequately describe patient PK profiles to facilitate E–R modeling. Based on earlier experience [5], a confirmatory population PK analysis [16, 17] was implemented using a one-compartment model with first-order absorption and first-order elimination [apparent clearance (CL/F), apparent volume of distribution (V/F), and absorption rate constant (k a )]. Between-patient random effects on CL/F, V/F and k a were included using lognormal distributions. Correlation between BSVs on CL/F and V/F was also included, along with baseline body weight effects on them using a power model standardized to the median baseline body weight of 90 kg. An additive-plus-proportional error model was used, with the standard deviation (SD) of the additive component fixed at approximately 0.0029 based on the LLOQ value of 0.01 and assuming a uniform distribution of U(0, LLOQ). In our experience, this usually will result in similar data likelihood, nearly identical PK model parameter estimate, and slightly shortened run time, and could occasionally stabilize parameter estimation. Individual Bayesian PK parameter estimates for patients were obtained for the E–R model development.

Latent variable indirect response model framework

The latent variable approach presumes an underlying latent variable such that the endpoint occurs when the latent variable crosses certain thresholds. More precisely, let n be the number of categories of an ordered categorical variable Y, L(t) be the latent variable, and α k be the thresholds where k = 1, 2, …, n−1, such that:

$${\text{ACR}} \le {\text{k}} \Leftrightarrow {\text{L}}({\text{t}}) < \alpha_{k}$$

model L(t) as:

$${\text{L}}({\text{t}}) = {\text{M}}({\text{t}}) + \sigma \varepsilon$$
(1)

where M(t) is the model predictor, ε is distributed with mean 0 and variance 1, and σ is the error standard deviation. Assuming that ε follows the standard normal distribution, then:

$${\text{prob}}[{\text{Y}} \le {\text{k}}] = {\text{prob}}[{\text{L}}({\text{t}}) < \alpha_{k} ] = {\text{prob}}[\varepsilon < (\alpha_{k} - {\text{M}}({\text{t}}))/\sigma ] = \varPhi [(\alpha_{k} - {\text{M}}({\text{t}}))/\sigma ]$$

In this setting of latent variables, σ is not identifiable and may be assumed to be equal to 1; this gives:

$$\varPhi^{ - 1} [{\text{prob}}({\text{Y}} \le {\text{k}})] = \alpha_{k} - {\text{M}}({\text{t}})$$
(2)

which corresponds to probit regression. Assuming ε follows a logistic distribution leads to logit regression. In the context of E–R modeling, this derivation was first given in Hutmacher et al. [18]. To stabilize parameter estimation, α k are typically better re-parameterized; e.g., for n = 3, as (α 1 , d 0 , d 2) with d 0 , d 2 > 0 such that α 0 = α 1 − d 0 and α 2 = α 1 + d 2.

The latent variable representation Eq. 2 allows mechanism-based models to be used for M(t). Between-subject variability is typically modeled at the intercept level with an additive normal distribution η ~ N(0, ω 2). Modeling the total treatment effect as the sum of placebo effect f p (t) and drug effect f d (t), this leads to the mixed-effect probit regression, as follows:

$$\varPhi^{ - 1} [{\text{prob}}({\text{Y}} \le {\text{k}})] = \alpha_{k} + {\text{f}}_{\text{p}} ( {\text{t)}} + {\text{f}}_{\text{d}} ( {\text{t) + }}\eta$$
(3)

The placebo effect may typically be modeled empirically, e.g., with an exponential function:

$${\text{f}}_{\text{p}} ({\text{t}}) = - F_{p} {\text{exp}}( - r_{p} \;{\text{t}})$$
(4)

where F p is the maximum placebo effect and r p is the rate of onset. The drug effect was modeled using a latent variable R(t), governed by:

$${\frac{{d\mathop {R(t)}\nolimits_{{}} }}{dt} = \mathop k\nolimits_{in} \left( {1 - \frac{{\mathop {\mathop C\nolimits_{p} }\nolimits^{{}} }}{{\mathop {IC}\nolimits_{50}^{{}} + \mathop {\mathop C\nolimits_{p} }\nolimits^{{}} }}} \right) - \mathop k\nolimits_{out} \mathop {R(t)}\nolimits_{{}} }$$
(5)

where C p is drug concentration, and k in , IC 50, and k out are parameters in a Type I IDR model. It was further assumed that at baseline R(0) = 1, yielding k in  = k out . The reduction of R(t) was assumed to drive the drug effect through:

$${\text{f}}_{\text{d}} ({\text{t}}) = {\text{DE}}[1 - {\text{R}}({\text{t}})]$$
(6)

where DE is a parameter to be estimated that determines the magnitude of drug effect.

Theoretically, the representation of drug effect in Eqs. 36 has been shown to be equivalent to a change-from-baseline latent-variable IDR model [19], under which k out may be interpreted as the rate of drug effect onset and offset, and DE may be interpreted as the baseline of the latent variable prior to normalization [3]. Theoretical characteristics of general change-from-baseline IDR models, which have one less parameter than their corresponding IDR models, have been derived [19, 20]. Change-from-baseline latent IDR models are needed in the modeling of categorical endpoints because the latent variable is determined only up to a constant and therefore needs to be normalized [3, 18]. For more details on the theoretical characteristics of latent variable IDR models, see [3].

No covariate effects were explored for the E–R model due to the small sample size.

E–R modeling of PASI criteria and PGA scores

Equations 36 were first fitted to Pc and PGA data separately, and then simultaneously with BSV correlation and shared parameters explored. Practically, maximum sharing could occur if the underlying latent variables for the two endpoints differ by only a scale factor, in which case only one parameter Sc could be used to jointly model Pc and PGA, as follows:

$$\varPhi^{ - 1} [{\text{prob}}({\text{Pc}} \le {\text{k}})] = \alpha_{k,PC} + {\text{f}}_{\text{p}} ( {\text{t)}} + {\text{f}}_{\text{d}} ( {\text{t) + }}\eta$$
(7)
$$\varPhi^{ - 1} [{\text{prob}}({\text{PGA}} \le {\text{k}})] = \alpha_{k,PGA} + Sc[{\text{f}}_{\text{p}} ( {\text{t)}} + {\text{f}}_{\text{d}} ( {\text{t) + }}\eta ]$$
(8)

with fp(t) and fd(t) given by Eqs. 46. In theory, sharing on the intercept parameters could also occur, which would imply similarities between the Pc and PGA category separations. However this type of similarity may be unlikely in practice.

Fitting Eqs. 38 simultaneously to the Pc and PGA data reduces the total number of fixed and random effect parameters by 4 and 1, respectively, compared with separately modeling the endpoints.

Model estimation and evaluation

The sequential PK/PD modeling approach was used by first fixing the individual empirical Bayesian PK parameter estimates [18]. NONMEM version 7.3 was used for all modeling [21]. The LAPLACE estimation option was used for population PK parameter estimation, and the Importance estimation option was used for E-R modeling. While most models were pre-specified in this analysis, a decrease in the NONMEM minimum objective function value (OFV) of 10.83, corresponding to a nominal p value of 0.001, was considered the threshold criterion of whether including an additional model parameter improves the model fit in certain assessment of E–R modeling. Visual predictive check (VPC) was used for model evaluation by simulating 500 replicates [22].

Results

Demographics and baseline characteristics

Baseline body weight, the only influential PK covariate, ranged between 45 and 175 kg, with a mean (SD) of 91 (22) kg. Fourteen patients had positive ADA status, nearly evenly spread across the six treatment groups. More detailed demographics and baseline covariates were reported previously [2].

Guselkumab population pharmacokinetic modeling

Parameter estimates of the confirmatory analysis are given in Table 1. Standard goodness-of-fit diagnostics (shown in Figure S1 in Supplementary Material) indicated no anomalies. PK parameter estimates and their standard errors appeared within expectations and were comparable with those of Phase 1 [3, 23]. Estimating the standard deviation of the additive component of the residual error (instead of fixing at approximately 0.0029) resulted in an estimate of 0.0016 with an associated RSE of near 100%, nearly identical NONMEM OFV, and nearly identical estimates for the remaining parameters. This supported fixing this parameter based on the theoretical consideration. Figure 1 shows the VPC results. The observed 95% percentiles for the 5 mg group were abnormally high, which might be partly due to the fact that two patients incorrectly received higher doses. While VPC accounted for the effect of the incorrect doses, the added variability might have not been fully accounted for. The exact reason is unclear and could be due to sample variation. Overall, the model reasonably described the observed data. It is noted that the number of patients with positive ADA status was below 20, the pre-specified threshold of allowing reliable estimation in the confirmatory analysis [16, 17], and therefore excluded from the covariate inclusion. With over eight PK samples per patient, any potential ADA effect was considered unlikely to substantially affect the quality of the individual empirical Bayesian parameter estimates, whose generation for subsequent E–R modeling was the main objective of this PK analysis.

Table 1 Population pharmacokinetic model parameter estimates
Fig. 1
figure 1

Visual predictive check results of the guselkumab population pharmacokinetics model by treatment group

PASI response model

Equations 36 were fitted to the PASI response data. Estimation of the model was stable, and model parameter estimates are given in Table 2. Estimation precision was reasonable. VPC results are shown in Fig. 2, and the model reasonably described the observed data.

Table 2 PASI criteria and PGA exposure–response model parameter estimates
Fig. 2
figure 2

Median model predictions at planned observation times and 90% prediction intervals (PI), in overlay with observed Psoriasis Area and Severity Index (PASI) response frequencies, for the separate model. PASI75, PASI90, PASI100: proportion of subjects achieving 75, 90, or 100% PASI reduction from baseline

PGA model

The number of PGA scores of 4 and 5 were relatively small, totaling 209, and therefore, they were combined with PGA scores of 3. Consequently, n = 4 was used in Eqs. 36 to model the probability of achieving PGA scores of k = 0, 1, 2, or 3. The model parameter estimates are given in Table 2. Estimation precision was reasonable. VPC results are shown in Fig. 3, and the model reasonably described observed data.

Fig. 3
figure 3

Median model predictions at planned observation times and 90% prediction intervals (PI), in overlay with observed Physician’s Global Assessment (PGA) response frequencies, for the separate model. PGA0, PGA = 0; PGA01, PGA = 0 or 1; PGA012, PGA = 0, 1, or 2

PASI—PGA correlation model

Because Pc and PGA are both measures of disease activity, patients responding to one type of measure may be expected to also respond to the other type; therefore their corresponding BSV terms (η Pc and η PGA in Table 2) may be expected to be correlated. To investigate this, a BSV correlation model was employed by fitting the Pc and PGA data simultaneously, applying Eqs. 36 to Pc and PGA with distinct fixed and random effect parameters, and incorporating an additional correlation effect between the η Pc and η PGA using a bivariate normal distribution. This resulted in a NONMEM OFV decrease of over 300, indicating a significant improvement of the fit. The model parameter estimates are given in Table 2. The parameters for individual Pc and PGA components were similar to those obtained with the separate models. The correlation between η Pc and η PGA was estimated as 0.94. While positive correlation was expected, the high correlation magnitude suggested that one common BSV term could be sufficient.

Joint PASI—PGA model

Modeling Pc and PGA data together also provided a framework to assess the similarity of fixed effects between endpoints along with that of random effects. It could be hypothesized that, based on binding, a single latent variable could govern both endpoints through IDR models. Indeed, comparing the separate Pc and PGA model parameter estimates in Table 2 shows that the rate parameters r p and k out were relatively similar between the endpoints, along with the potency parameter IC50. Furthermore, the maximum effect parameters F p and DE appeared to differ in a relatively narrow range of 1.5–2, and the standard deviations of BSV differed similarly as well. This suggested that the underlying latent variables for the two endpoints could indeed differ by only a scale factor, and motivated the use of only one parameter Sc to jointly model Pc and PGA as in Eqs. 78.

The final joint model of fitting Eqs. 48 simultaneously to the Pc and PGA data resulted in a NONMEM OFV increase of 64 compared with the correlation model. The joint model used five fewer fixed-effect parameters and two fewer random-effect parameters, indicating a significant improvement of the fit. Comparing with the separate model, the joint model used five fewer fixed-effect parameters and one fewer random-effect parameter, and yet with a decrease in OFV of over 200. This showed a similar nature of improvement in fit compared with the joint analysis of continuous and ordered categorical data [9]. The joint model parameter estimates are given in Table 2. For Pc, the joint model parameter estimates were generally similar to those obtained with the separate model. Estimation precision improved, more notably for those parameters with RSE > 20% in the separate model. This may be expected, as the precision for those parameters less precisely estimated may benefit more by including additional information (from PGA data). VPC results of the joint model are given in Figs. 4, 5, which showed a similarly reasonable description of the data as Figs. 2, 3. This is consistent with the reasonable RSE magnitudes and the similarity between the parameter estimates of the joint and the separate models.

Fig. 4
figure 4

Median model predictions at planned observation times and 90% prediction intervals (PI), in overlay with observed Psoriasis Area and Severity Index (PASI) response frequencies, for the joint model. PASI75, PASI90, PASI100: proportion of subjects achieving 75, 90, or 100% PASI reduction from baseline

Fig. 5
figure 5

Median model predictions at planned observation times and 90% prediction intervals (PI), in overlay with observed Physician’s Global Assessment (PGA) response frequencies, for the joint model. PGA0, PGA = 0; PGA01, PGA = 0 or 1; PGA012, PGA = 0,1, or 2

Discussion

A parsimonious population E–R model for two ordered categorical endpoints, i.e., PASI criteria (PASI 75, PASI 90, and PASI 100) and PGA scores, was developed based on the Phase 2 dose-ranging study of guselkumab. This was motivated by the previous development of joint modeling of a continuous and an ordered categorical endpoint [9]. Similarly to the previous scenario [9], the joint model achieved significant improvement in model fit in terms of NONMEM OFV, while using fewer parameters than the separate model. The correlation model investigation demonstrated that the improved fit of the joint model is due to the BSV correlation. However, unlike in the previous scenario [9] where additional BSV terms for the categorical endpoint could be estimated by leveraging information from the continuous endpoint, the joint model for two categorical endpoints had essentially the same number of parameters. Hence, its description of individual endpoints apparently did not improve in terms of the VPCs. In principle, separate modeling of the endpoints should result in their unbiased estimation, provided it is reasonably supported by the data. Joint modeling cannot improve on estimation accuracy; actually, accuracy could deteriorate if the joint mechanism is inappropriately assumed. Where the joint modeling stands to gain is in the precision of estimation, as evidenced by the improvement of RSEs in Table 1, achieved through the use of all relevant information. Improvement in precision is important for predictions and decision making. In terms of drug development, improving precision allows the trial objectives to be achieved with fewer subjects and reduced costs.

As endpoints, Pc is a measure of change from baseline by nature while PGA is an absolute measure. This raises the conceptual question of why their analyses could be pooled. From a theoretical perspective, this may be due to the fact that the categories analyzed represent relatively large improvements in efficacy, thus making the conceptual difference smaller. Indeed in psoriasis drug development, Pc and PGA are used somewhat interchangeably as efficacy measures. Furthermore, when endpoints are measured at the same time, the residual level correlation may also be modeled [24]; this is not expected to affect the separate endpoint predictions but can be important in the predictions of achieving joint criteria using both endpoints [25].

Since the categorical endpoint Pc is derived from PASI scores, it is natural to ask whether PASI score could be used as a latent variable to model Pc. Indeed, previous PASI-related longitudinal E–R modeling [4,5,7], including that used to guide the design of the Phase 2 dose-ranging study of guselkumab, first embellished an E–R model of PASI score as a continuous variable, and then used the model to predict the various categories of Pc. To investigate this issue, it is helpful to first examine the predictive ability of the earlier model [5], which is of interest by itself. For this purpose, an external VPC of the earlier model was conducted with the Phase 2 PASI data by treatment, shown in Fig. S2 in Supplementary Material. The model predicted the median PASI scores very well, which in part explained why the Phase 2 study reached its exact objectives by achieving the targeted treatment separations. On the other hand, the continuous PASI score model could not sufficiently predict Pc. As Fig. S2 shows, the model under-predicted the 5% percentiles, which could not be improved even after updating the model with Phase 2 data. To our knowledge, no continuous PASI score models have yet been published that accurately describe Pc. This could be explained by two reasons. Firstly, most previous model development used the additive-plus-proportional residual error model [4,5,7], which is ill-behaved in this case, as the likelihood conceptually could become arbitrarily large due to the fact that PASI = 0 may be observed. Secondly, PASI scores are Bounded Outcome Scores (BOS), which report a discrete set of restrictive values on a finite range [26]. BOS data often demonstrate non-standard, e.g., J- or U- shaped, distributions, and data transformations aiming to achieve normality can be much more difficult than many other types of skewed distributions. For example, beta-regression has often been used for BOS modeling using a transformation of original scores to the open interval (0, 1), by shrinking values at the boundary by an apparently innocuous small factor of 0.01 [27, 28]. Theoretical examination of the effect of this small factor shows that reducing its value sufficiently could make the influence of boundary values arbitrarily large, thus rendering the interpretation of the results dubious. Other approaches explicitly dealing with the nature of BOS have been developed, including Estimating Transformations [29], Coarsened Grid with or without transformations [26, 30], and a nonparametric approach [31]. Each of these explicit BOS approaches has conceptual appeal; indeed, extensive efforts applying these approaches lead to various degrees of improvement in describing the PASI score distributions. Unfortunately, none resulted in any improvement in describing Pc. This appears to be due to the fact that, in order to accurately describe Pc, the continuous PASI score model must accurately predict not only the mean data, but also the entire distribution. This is extremely difficult for a clinical endpoint with highly skewed distributions. The high efficacy of guselkumab resulted in just such a data distribution with most scores on the low end, near or achieving 0. For this reason, modeling Pc directly as an ordered categorical variable is likely an effective approach by modeling a presumed unobservable latent variable, instead of using the “true” latent variable even if it is observed (and thus no longer “latent”).

While the Phase 1 model had limited ability to predict Pc, it led to a Phase 2 study that, in light of its outcome, was considered well-designed. This shows that the model does not have to be perfect to make correct decisions. Indeed, if the model were perfect, there would be no need for additional clinical trials. The drug development paradigm does demand increased precision as the stages progress, as later stage decisions incur increased costs. That is, levels of precision that are acceptable at early stages may not suffice for later stages, and improvements should be consistently sought as information accumulates.

A general issue with ordered categorical data is how many levels should be modeled. Increasing the number of categories modeled increases the use of available information in the data, provided that estimation of the intercept parameters could be supported. On the other hand, all levels are often not of similar interest. For example, achieving PGA = 0 and PGA ≤ 1 are considered more relevant efficacy markers. During the development of the separate PGA model, we initially attempted to further combine PGA level 2 with those ≥3. However this resulted in model non-convergence. Detailed investigation showed that having the PGA level 3 modeled is important for the estimation of placebo effect, which also affects the entire model estimation [3]. In Fig. 3, the placebo effect for PGA is visually discernable only in the PGA ≤ 2 group. This supported the relevance of retaining this level in the model even though it is not of direct clinical interest. While in theory placebo effect estimation could be supported by the relative differences among active treatments groups and thus not always require observing a trend in the placebo group, confounding may occur and precise estimation may become difficult in such situations, especially when the effect is small. It is of interest to note that estimation of the separate PASI Pc model could be supported without the need of including additional levels, e.g., PASI 50. Unnecessarily including such levels would not only deviate from the main interest of modeling, but also cause lack of fit due to the skewness of the PASI score distribution discussed above.

While the importance of model precision may seem obvious, tendencies may exist that sacrifice utilization of relevant information for the sake of convenience. For example, there is a view that landmark E–R analysis is often sufficient even when longitudinal data are available [32]. This optimistic view may be due to the fact that decisions often focus on predictions based on only point estimates, and the uncertainties are ignored. This may in part have contributed to the fact that few publications report observations that confirm earlier phase model predictions, especially for clinical efficacy modeling based on proof-of-concept trials where the uncertainty may be most pronounced [5]. Recent IDR model developments have made longitudinal modeling more adaptive to clinical efficacy endpoints of various types, and relevant information should be used to increase the E–R analysis precision whenever possible.