Optimizing Acute Myelogenous Leukemia Treatment Regimes via Sequential Structural Mean Models

Johnson, Geoffrey S.; Topp, Andrew S.; Wahed, Abdus S.

doi:10.1007/s40840-022-01359-0

Optimizing Acute Myelogenous Leukemia Treatment Regimes via Sequential Structural Mean Models

Published: 27 July 2022

Volume 45, pages 539–566, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Bulletin of the Malaysian Mathematical Sciences Society Aims and scope Submit manuscript

Optimizing Acute Myelogenous Leukemia Treatment Regimes via Sequential Structural Mean Models

Download PDF

117 Accesses
Explore all metrics

Abstract

We propose optimizing dynamic treatment regimes using sequential structural mean models in the treatment of acute myelogenous leukemia (AML) involving multiple stages of chemotherapy. The inverse-probability-of-treatment-weighted (IPTW) or g-computation estimator is used at each stage to estimate what we call the ‘preliminary’ optimal treatment regime, given patient information up to the current stage and prior treatment assignment. Essentially, this tailors the optimal treatment assignment at the current stage and provides an optimal strategy for the remaining stages given the information currently available. We compare this method for optimizing a dynamic treatment regime to Q-learning. Additionally, we use a two-step prescriptive variable selection procedure that supports the tailored optimization of dynamic treatment regimes using structural mean models by eliminating from consideration any suboptimal treatment regimes and sifting out the covariates that prescribe the optimal treatment regimes. The weighting techniques of the g-computation and IPTW estimators allow an appropriate comparison of the treatments at each stage, while avoiding the non-regularity issues associated with backwards induction techniques. This facilitates standard large sample theory and the bootstrap for constructing confidence intervals and performing hypothesis tests. Though applied to a specific two-stage sequential multiple assignment randomized trial (SMART) design, the methods described herein are easily generalized to other SMART designs and applications.

Bayesian Phase II Single-Arm Designs

Novel Bayesian Adaptive Designs and Their Applications in Cancer Clinical Trials

Multiple random change points in survival analysis with applications to clinical trials

Article 06 May 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In oncology and other therapeutic areas, it is typical in clinical practice for patients not to receive a single frontline treatment, but a series of treatments, with each treatment based on the patient’s treatment history and intermediate outcomes. This is certainly the case for acute myelogenous leukemia (AML) or myelo-dysplastic syndrome (MDS), a cancer of the myeloid line of blood cells, where patients are treated with a frontline chemotherapy combination to induce complete remission, and if the patient suffers a relapse or resistant disease, a salvage chemotherapy or hematopoietic stem cell transplant is given in a second attempt to induce remission. [4] analyze the results of an AML-MDS trial involving this very design concerning 210 patients with leukemia randomized to one of four combination frontline treatments. The salvage therapies in this study were not randomized, and were ignored for the original analysis. The concern for AML patients is that the optimal frontline chemotherapy may very well depend on the possible salvage treatment taken. Additionally, age, cytogenetic abnormality, and other characteristics are known to influence disease remission and may further alter the choice of frontline therapy for treating AML. By incorporating these patient characteristics, the optimal frontline treatment may be tailored, but ignoring the salvage therapies and post-baseline characteristics may bias the results. Taking into account the salvage therapies, Wahed and Thall [16] analyzed this data using a likelihood-based approach, and Xu et al. [19] improved upon it by using Bayesian non-parametrics, though none of them incorporate patient heterogeneity. In this paper, we individualize the optimal treatment at each stage by taking into account patient covariates in the realm of dynamic treatment regime theory.

A dynamic treatment regime (DTR), also known as an adaptive treatment strategy (ATS), is a decision rule that guides the treatment choices over the course of therapy. The sequence of treatments a patient receives depends on the patient’s health status, response to prior treatments, and other patient characteristics (Murphy [11]; Robins et al. [14]; Chakraborty and Murphy [3]). The goal is to find a DTR that optimizes the overall outcome, commonly taken as overall survival in cancer or other rapidly fatal diseases (Wahed and Tsiatis [17]; Wahed and Thall [16]; Huang et al. [9]). Techniques for analyzing DTRs are important to properly account for patient responses and sequence of treatments to correctly identify the optimal treatment at each stage. To see this consider a game of chess, where each player’s turn corresponds to a stage of a DTR, each player’s move corresponds to a treatment assignment, and achieving check mate corresponds to optimizing the outcome. The player’s best move in each turn depends on his/her previous and future moves. If the chess player optimizes each move individually, without regard to past or future moves, he/she will likely not achieve check mate. If the treatments at each stage of a DTR are compared without regard to past and future treatments, biased results may occur. To this end, there are primarily two approaches: structural mean models (direct search) and nested mean models (inductive search).

Structural mean models use weighting techniques found in survey sampling to consistently estimate the mean outcome for a specific frontline and salvage therapy (regime), allowing a direct comparison of the outcome across regimes. Inverse-probability-of-treatment-weighted (IPTW) estimators, as their name suggests, average the outcome for all subjects following a specific regime, while weighting each observation by the inverse of the probability of receiving the treatments prescribed by the regime, similar to the Horvitz–Thompson estimator (Horvitz and Thompson [8]). Another method, the g-computation estimator, first finds the mean of each intermediate outcome of a particular regime and then weights these means by the proportion who followed each path to form a mean of the regime (Robins [13]). For a detailed look at these methods in the survival analysis setting, see [10]. Conversely, nested mean models use backwards induction to find the optimal treatment at each stage of treatment modification. Murphy [11], Robins [15], and others pioneered the use of backwards induction in statistics via Q-learning, A-learning, and g-estimation to identify such optimal regimes. These algorithms work backwards in time by identifying at each stage which treatment has the largest expected outcome, creating pseudo data for each subject by replacing his/her observed outcomes with the estimated optimal expected outcome at later stages, given prior observed outcomes and covariate information. The optimal treatment at each stage is the one with the largest expected value of this pseudo data. As in a basic randomized clinical trial, a subgroup analysis can be performed for either structural or nested mean models to see if the marginal results hold throughout, or if the optimal treatment regime depends on patient information.

As ever larger studies collect more patient data, it is natural to turn to variable selection methods when searching for the optimal regime. Common approaches to variable selection include, but are not limited to, the forward, backward, and step-wise selection methods, which by their nature are discrete processes, and the least absolute shrinkage and selection operator (LASSO) and its derivatives, which are continuous processes. To operate, all of these methods rely on a measure of model fit or prediction error, such as the sum of squared errors, the leave-one-out cross-validation estimate of prediction error, or Akaike information criterion (AIC). These variable selection methods are designed to sift through a large collection of variables and identify those that most greatly reduce the variability and increase the accuracy of the estimator, which Gunter et al. [5] define as predictive variables. However, in the realm of dynamic treatment regimes, we are interested in variables that are not only predictive, but also help prescribe the optimal treatment for a given patient. Such variables known as prescriptive variables (Hollon and Beck [7]) must qualitatively interact with treatment. For a nested mean model approach, Gunter et al. [5] proposed two different ranking methods to sort variables according to how likely they will qualitatively interact with the outcome, and provided a four step algorithm involving LASSO regression on nested subsets of covariates for selecting important predictive variables. Zhang [20] generalized from the least squares regression model and offered a simpler, more effective two-step method involving multivariate adaptive regression splines (MARS) models and logistic regression with LASSO to identify prescriptive covariates.

Most authors employing structural mean models perform a single marginal analysis, comparing dynamic treatment regimes for the entire sample of patients (Wahed and Thall [16]; Xu et al. [19]). Those that perform a subgroup analysis using conditional models do so by conditioning on baseline information only (Hernan et al. [6]; Chakraborty and Murphy [3]). While these conditional models shed some light on the regime effects across baseline covariates, they lack the ability of Q-learning and other backwards induction techniques to use past and current patient information to prescribe the optimal treatment at each stage. This paper explores a new method for optimizing dynamic treatment regimes using sequential structural mean models that incorporate current patient information at every stage (decision point), and uses an effective prescriptive variable selection method following Zhang [21]. Additionally, we use a simulation study to evaluate these methods and apply them to the phase II study mentioned above concerning 210 patients with acute myeloid leukemia (AML) or high-risk myelodysplastic syndrome (MDS) (Estey et al. [4]).

2 Dynamic Treatment Regimes and Corresponding Terminology

Consider a two-stage sequential multiple assignment randomized trial (SMART) design structured after the AML-MDS trial, where patients are randomized to one of four induction therapies, $\mathcal {A}=\{a_1,a_2,a_3,a_4\}$ (see Fig. 1). A patient could die, the disease could become resistant to the initial treatment, the patient could respond (complete remission), or he/she could experience disease progression after complete remission. For each of the induction therapies, if treatment resistance or progression following complete remission is observed, patients are further randomized to one of two salvage treatments, $\mathcal {B}=\{b_1,b_2\}$. This design allows for inference on sixteen DTRs that might be carried out in clinical practice, namely $d(A_i=a_j;B_{1i}=b_k,B_{2i}=b_l),\ j=1,...,4,\ k=1,2,\ l=1,2$ where $d(A_i;B_{1i},B_{2i})$ stands for “Treat with $A_i$; if the patient is resistant to $A_i$ treat with $B_{1i}$, or if the patient responds to $A_i$ (complete remission) but later experiences disease progression treat with $B_{2i}$.” Our goal is to find the optimal treatment regime among these that maximizes expected survival time.

Let $T^D_i$, $T^{R}_i$, $T^{RD}_i$, $T^C_i$, $T^{CP}_i$, $T^{PD}_i$, and $T^{CD}_i$, respectively, denote the observed time to death if neither remission nor resistance was observed, the observed time to resistance and the observed time from resistance to death if resistance is observed, the observed time to complete remission, the observed time from complete remission to disease progression, the observed time from progression to death, and the observed time from complete remission to death if complete remission is observed. Using the above sojourn times, each patient’s survival time can be expressed as

$$\begin{aligned} T_{i}=\left\{ \begin{array}{ll} T^D_i, &{} R_{1i}=0 \\ T^R_i+T^{RD}_i, &{} R_{1i}=1 \\ T^{C}_i+T^{CP}_i+T^{PD}_i, &{} R_{1i}=2,\ R_{2i}=1\\ T^C_i+T^{CD}_i&{} R_{1i}=2,\ R_{2i}=0. \end{array}\right. \end{aligned}$$

where $R_{1i}$ indicates whether a patient fails, is resistant, or experiences complete remission, and $R_{2i}$ indicates whether or not those that experienced complete remission later experience disease progression. $R_{1i}$ and $R_{2i}$ index the paths of each treatment regime.

In the presence of non-informative right censoring, one might consider the restricted survival time where total follow-up time is limited to L, where L is some value less than the maximum survival time for all patients. Therefore, the survival time for all patients will be truncated at L, $T^L=$min(T, L). For ease of notation, we will drop the superscript and simply use T. We will denote the $i^{th}$ patient’s censoring time by $C_i$ and the survival distribution of $C_i$ by $K(t)=P(C_i>t)$. Define $U_i=$min$(T_i,C_i)$ and $\Delta _i=I(T_i \le C_i)$, respectively, to be the observed time to event (death or censoring) and the death indicator. It is possible that $C_i<T_i$, so that for a single patient some of the sojourn times are censored while others are observed. Therefore, $U_i$ can be expressed as

$$\begin{aligned} U_{i}=\left\{ \begin{array}{ll} U^D_i, &{} R_{1i}=0 \\ T^R_i+U^{RD}_i, &{} R_{1i}=1 \\ T^C_i+T^{CP}_i+U^{PD}_i, &{} R_{1i}=2,\ R_{2i}=1\\ T^C_i+U^{CD}_i&{} R_{1i}=2,\ R_{2i}=0, \end{array}\right. \end{aligned}$$

where $R_{1i}$=0 if a patient fails or is censored prior to observing $R_{1i}$; $R_{2i}$=0 if a patient dies after complete remission or is censored after complete remission prior to observing $R_{2i}$; $U^D_i$ = min$(T^D_i,C_i)$; $U^{RD}_i$ = min$(T^{RD}_i,C_i-T^R_i)$; $U^{PD}_i$ = min$(T^{PD}_i,C_i-T^{CP}_i-T^C_i)$; and $U^{CD}_i$ = min$(T^{CD}_i,C_i-T^C_i)$.

Then, introducing further indicators for first and second stage treatment, $Z^{(A)}_{ji}$=$I\{A_i=a_j\}$ equals 1 if patient i received the $j^{th}$ induction therapy, $Z^{(A)}_{ji}$ equals 0 otherwise, $Z^{(B_1)}_{ki}$=$I\{B_{1i}=b_k\}$ and $Z^{(B_2)}_{li}$=$I\{B_{2i}=b_l\}$ denote the salvage treatment assignment indicators, defined only if $R_{1i}$=1 or $R_{2i}=1$, respectively, and $G^H_i(t)$ denotes information collected on patient i prior to time t. Using the observed data, one can create treatment regime indicators as $d_i(a_j;b_k,b_l)$ = $Z^{(A)}_{ji}\Big (I\{R_{1i}=0\} + I\{R_{1i}=1\}Z^{(B_1)}_{ki} +I\{R_{1i}=2\}I\{R_{2i}=1\}Z^{(B_2)}_{li} + I\{R_{1i}=2\}I\{R_{2i}=0\} \Big )$.

By design, treatments are assigned independently of prognosis or any observed data measured prior to the second stage. Therefore, $P\big (Z^{(A)}_{ji}=1\big )=\pi ^{(A)}_{j},$ $P\big (Z^{(B_1)}_{ki}=1\big |R_{1i}=1\big )=\pi ^{(B_1)}_{k},$ and $P\big (Z^{(B_2)}_{li}=1\big |R_{1i}=2,R_{2i}=1\big )=\pi ^{(B_2)}_{l},$ where $\pi ^{(A)}_j$, $\pi ^{(B_1)}_k$, and $\pi ^{(B_2)}_l$ are known randomization probabilities. These three conditions are often referred to as no unmeasured confounders or sequential randomization assumption. This ‘no unmeasured confounders’ condition holds even if the second-stage randomization probabilities depend on the first-stage treatment assignments.

3 Structural Mean Models for Dynamic Treatment Regimes

3.1 Structural Mean Models Conditional on Baseline Information

To estimate the mean of each dynamic treatment regime, one can use structural mean models and then compare the means to determine the optimal regime. Inverse-probability-of-treatment weighting (IPTW) and g-computation are two methods. Wahed and Tsiatis [17] provide a nice discussion of the first method in the context of survival analysis with no adjustment for covariates.

For an efficacy estimand for $A_{}=a_j$, $T_i$ would be set to missing when $R_{1i}=1$ or $R_{2i}=1$, and a pattern mixture model is employed. If intention to treat were to be followed for an effectiveness estimand for $A_{}=a_j$, the observed $T_i$ would be utilized for $R_{1i}=1$ or $R_{2i}=1$ regardless of $B_{1i}$ or $B_{2i}$. Our focus will be to estimate the effectiveness estimand $\mu _{jkl}=E[T_i|d_i(a_j;b_k,b_l)=1]$, $j=1,2,3,4,\ k,l=1,2,$ the mean survival time for those following a given regime, for specific A, $B_{1}$, and $B_{2}$. Since our SMART design allows us to confidently assume no unmeasured confounders, each regime mean is representative of the expected outcome had the entire sample of patients followed that regime. Recall that patients following $d(a_j;b_k,b_l)$ are a mixture of four groups. We can use data from these patients to infer about $\mu _{jkl}$, accounting for the two stages of randomization. If there was no randomization, and if everyone in the sample was treated using the same DTR, we would have used the sample average $n^{-1}\sum _{i=1}^{n}T_i$ to estimate $\mu $. If there was only one stage of randomization, we would have considered using $\sum _{i=1}^{n}{Z^{(A)}_{ji}}T_i/\sum _{i=1}^n Z^{(A)}_{ji}$=$n^{-1}\sum _{i=1}^{n}(Z^{(A)}_{ji}/\hat{\pi }^{(A)}_{j})T_i$ $\approx $ $n^{-1}\sum _{i=1}^{n}(Z^{(A)}_{ji}/\pi ^{(A)}_{j})T_i$. To account for the two stages of randomization, we consider the quantity

$$\begin{aligned} W_{jkli}= & {} \frac{Z^{(A)}_{ji}}{\pi ^{(A)}_{j}}\left( I\{R_{1i}=0\} + I\{R_{1i}=1\}\frac{Z^{(B_1)}_{ki}}{\pi ^{(B_1)}_{k}} +I\{R_{1i}=2\}I\{R_{2i}=1\}\right. \\&\quad \left. \frac{Z^{(B_2)}_{li}}{\pi ^{(B_2)}_{l}}+I\{R_{1i}=2\}I\{R_{2i}=0\} \right) . \end{aligned}$$

Note that $W_{jkli}T_i$ is nonzero only for patients who are treated according to $d(a_j;b_k,b_l)$, and based on the assumptions in Sect. 2$W_{jkli}T_i$ has expectation equal to $\mu _{jkl}$, which implies that to find an unbiased estimator of $\mu _{jkl}$ one need only turn to the empirical average, $\frac{1}{n}\sum _{i=1}^nW_{jkli}T_i$. When censoring is present, the above result should be modified slightly. Using the observed data, the estimator for $\mu _{jkl}$ becomes $n^{-1}\sum _{i=1}^n\big (\Delta _i/\hat{K}(U_i)\big )W_{jkli}U_i,$ where $\hat{K}(t)$ is the Kaplan–Meier estimator or any other consistent estimator of the censoring survival distribution.

In a basic randomized clinical trial, the mean outcome for each treatment group is estimated and compared to see which treatment has the largest expected outcome, assuming larger outcomes are better. Similarly, the marginal estimators above are useful for comparing the mean outcomes across treatment regimes to identify which treatment regime has the largest expected outcome. As in a basic randomized clinical trial, a subgroup analysis can be performed to see if the marginal results hold throughout, or if the optimal treatment regime depends on patient characteristics. Following Robins et al. [14], Orellana and Rotnitzky [12], and Wang and Zhao [18], the estimator for mean survival time in the presence of censoring can be extended to the regression setting to adjust for baseline covariates using an accelerated failure time (AFT) model via the least squares estimating equation

$$\begin{aligned} \mathcal {U}^{}_n(\varvec{\theta })=\sum _{i=1}^n\sum _{j=1}^4\sum _{k=1}^2\sum _{l=1}^2\frac{\Delta _i}{\hat{K}(U_i)}W_{jkli}\bigg \{\frac{\partial m}{\partial \varvec{\theta }}\bigg \}^T\Bigg [\text {log}U_i-m\Big (X_{i},\varvec{d}_i,\varvec{\theta }\Big )\Bigg ]=0, \end{aligned}$$

(1)

where $\{\}^T$ is the transpose operator, $X_{i}$ is a vector of baseline covariates from $G^H_i(0)$, $\varvec{d}_i$=$\big [d_i(a_1;b_1,b_1),$ ...,$d_i(a_4;b_2,b_2)\big ]^{T}$, $m\big (X_{i},\varvec{d}_i,\varvec{\theta }\big )$ = the mean function, $\epsilon _i$=$\text {log}T_i-m(X_i,\varvec{d}_i,\varvec{\theta })$, and $\mu \big (X_{i},\varvec{d}_i,\varvec{\theta }\big )$ $\equiv $ = $\text {exp}\big \{m\big (X_{i},\varvec{d}_i,\varvec{\theta }\big )\big \}$E[$e^{\epsilon _i}$] $\approx $ $\text {exp}\big \{m\big (X_{i},\varvec{d}_i,\varvec{\theta }\big )\big \}$ (note E[$e^{\epsilon _i}$] may be very different from 1). For example, $m\big (X_{i},\varvec{d}_i,\varvec{\theta }\big )$ could be modeled as

$$\begin{aligned} m\big (X_{i},\varvec{d}_i,\varvec{\theta }\big )=X_{i}^{T}\varvec{\beta } + d_i(a_1;b_1,b_1)X_{i}^{T}\varvec{\alpha }_{111}+\dots +d_i(a_4;b_2,b_2)X_{i}^{T}\varvec{\alpha }_{422}, \end{aligned}$$

(2)

where $\varvec{\theta }=\{\varvec{\beta }^{T},\varvec{\alpha }_{111}^{T},\varvec{\alpha }_{112}^{T},\cdots ,\varvec{\alpha }_{422}^{T}\}^T$, and $X_{i}$ contains an element equal to 1 corresponding to an intercept term. One may prefer to model $T_i$ instead of log$T_i$ with $m\big (X_{i},\varvec{d}_i,\varvec{\theta }\big )$ given as

$$\begin{aligned} m\big (X_{i},\varvec{d}_i,\varvec{\theta }\big )=\text {exp}\{X_{i}^{T}\varvec{\beta } + d_i(a_1;b_1,b_1)X_{i}^{T}\varvec{\alpha }_{111}+\dots +d_i(a_4;b_2,b_2)X_{i}^{T}\varvec{\alpha }_{422}\}, \end{aligned}$$

which utilizes a log link between the mean outcome and the linear predictor. The preliminary optimal treatment regime, the one with the largest expected outcome, is given by

$$\begin{aligned} d^{opt}(X_{i})=\{d(a_{j^{*}};b_{k^{*}},b_{l^{*}}):\ a_{j^{*}},b_{k^{*}},b_{l^{*}}=\underset{a_j,b_k,b_l}{{\textit{argmax}}}\ \mu _{}(X_{i},\varvec{d}_i,\varvec{\theta })\}. \end{aligned}$$

(3)

We use the term ‘preliminary’ when referring to an optimal regime that is conditional on baseline information, but marginalized over stage 2 information. The optimal frontline treatment is given by $A^{opt}(X_i)$=$\underset{a_j}{{\textit{argmax}}}\ \{\underset{b_k,b_l}{{\textit{max}}}\ \mu _{}(X_{i},\varvec{d}_i,\varvec{\theta })\}$. This is the frontline therapy corresponding to the preliminary optimal regime.

To implement this estimating equation, one would create sixteen copies of the analysis data set, one for each regime, each with a distinct value of $\big (\Delta _i/\hat{K}(U_i)\big )W_{jkli}$. The indicators $d_i(a_{j^{'}};b_{k^{'}},b_{l^{'}})$, where $j^{'}\ne j$ or $k^{'}\ne k$ or $l^{'}\ne l$, would be artificially set to zero so that the observations with non-zero weights in a given copy of the data set belong to only one regime. This effectively replicates the observations that are consistent with more than one regime (Chakraborty and Murphy [3]). This is needed because the regime “arms” of the study do not form mutually exclusive groups of patients. These sixteen data sets would then be stacked one on top another and submitted to a software package for a weighted regression. Treating $\hat{K}(U_i)$ as known, the empirical sandwich estimator of the covariance matrix for the parameter estimates can be used to draw inference when comparing regime means. This is important since the regimes are not independent. When treatment assignment is not random, as will be the case in Sects. 5 and 6, the treatment assignment probabilities can be modeled using logistic regression. This is important in order to maintain the no unmeasured confounders assumption. In this case, both $\hat{K}(U_i)$ and $\hat{W}_{jkli}$ would be treated as known and the empirical sandwich covariance estimator still employed, as is customary in ANOVA and regression where any such model can be re-expressed as a function of empirical inverse probability weights. Many authors make it a point to explore this additional source of variability, but that is outside the scope of this paper and beyond the interest of most applications.

It should be noted that the above regression model only incorporated baseline information, yet patient information is available throughout the trial. Using the law of total expectation, the mean survival time under a regime of interest that is conditional on all possible patient information is given by

$$\begin{aligned}&E[T_{i}|X_i,\bar{X}^R_i,\bar{X}^C_i,\bar{X}^P_i,d_i(A;B_1,B_2)=1]\nonumber \\&\quad =P(R_{1i}=0|A_i,X_{i})E\left[ T^D_{i}|A_i,X_{i},R_{1i}=0\right] \nonumber \\&\qquad + P(R_{1i}=1|A_i,X_{i})\bigg \{E[T^{R}_{i}|A_i,{X}_{i},R_{1i}=1]+E[T^{RD}_{i}|A_i,B_{1i},\bar{X}^R_{i},R_{1i}=1]\bigg \} \nonumber \\&\qquad +P(R_{1i}=2|A_i,X_{i})P(R_{2i}=1|R_{1i}=2,A_i,\bar{X}^C_i)\bigg \{E[T^C_{i}|A_i,{X}_{i},R_{1i}=2]\nonumber \\&\qquad +E[T^{CP}_{i}|A_i,\bar{X}^C_{i},R_{1i}=2,R_{2i}=1]+E[T^{PD}_{i}|A_i,B_{2i},\bar{X}^P_{i},R_{1i}=2,R_{2i}=1]\bigg \} \nonumber \\&\qquad + P(R_{1i}=2|A_i,X_{i})P(R_{2i}=0|R_{1i}=2,A_i,\bar{X}^C_i)\bigg \{E[T^{C}_{i}|A_i,X_{i},R_{1i}=2]\nonumber \\&\qquad +E[T^{CD}_{i}|A_i,\bar{X}^C_{i},R_{1i}=2,R_{2i}=0]\bigg \}, \end{aligned}$$

(4)

where $\bar{X}^R_{i}$, $\bar{X}^C_{i}$, and $\bar{X}^P_{i}$ are vectors of covariates from $G^H_i(T^R_i)$, $G^H_i(T^C_i)$, and $G^H_i(T^C_i+T^{CP}_i)$, respectively. Because we have no unmeasured confounders, it is as though we can peek into alternate universes and see all of the potential outcomes a prospective patient would have for each of different response groups, given his/her information at each stage. We gather all of that patient information together at once and combine it into a composite score in Equ. (4). In practice, Equ. (4) is not very useful unless we are willing to consider specific patient information for every intermediate outcome. Nevertheless, one can set estimating equations, for example,

$$\begin{aligned} \sum _{i=1}^n\frac{\Delta _i}{\hat{K}(U_i)}I\{R_{1i}=1\}\bigg \{\frac{\partial m^{RD}}{\partial \varvec{\theta }^{RD}}\bigg \}^T\Big \{\text {log}U^{RD}_i-m^{RD}\big (\bar{X}^{R}_i,A_i,B_{1i},\varvec{\theta }^{RD}\big )\Big \}=0 \end{aligned}$$

(5)

with mean model $m^{RD}\big (\bar{X}^{R}_i,A_i,B_{1i},\varvec{\theta }^{RD}\big )$ and $\mu ^{RD}\big (\bar{X}^{R}_i,A_i,B_{1i},\varvec{\theta }^{RD}\big )$ $\equiv $ $E[T^{RD}_i|\bar{X}^{R}_i,A_i,B_{1i},\varvec{\theta }^{RD}]$ $\approx $ $\text {exp}\Big \{m^{RD}\big (\bar{X}^{R}_i,A_i,B_{1i},\varvec{\theta }^{RD}\big )\Big \}$ for those with $R_{1i}=1$, to model the sojourn times of each path. Then, a g-computation model for $E[T_{i}|X_i,\bar{X}^R_i,\bar{X}^C_i,\bar{X}^P_i,d_i(A;B_1,B_2)=1]$ that is conditional on baseline and follow-up information can be created using

$$\begin{aligned}&\mu \Big (X_i,\bar{X}^R_i,\bar{X}^C_i,\bar{X}^P_i,d_i(A;B_1,B_2)=1,\varvec{\theta },\varvec{\psi }\Big )\nonumber \\&\quad =P(R_{1i}=0|A_i,X_{i},\varvec{\psi }_1)\bigg \{\mu ^D(A_i,X_{i},\varvec{\theta }^D)\bigg \} \nonumber \\&\qquad + P(R_{1i}=1|A_i,X_{i},\varvec{\psi }_1)\bigg \{\mu ^R(A_i,{X}_{i},\varvec{\theta }^R)+\mu ^{RD}(A_i,B_{1i},\bar{X}^R_{i},\varvec{\theta }^{RD})\bigg \} \nonumber \\&\qquad +P(R_{1i}=2|A_i,X_{i},\varvec{\psi }_1)P(R_{2i}=1|A_i,\bar{X}^C_{i},\varvec{\psi }_2)\bigg \{\mu ^C(A_i,{X}_{i},\varvec{\theta }^C)\nonumber \\&\qquad +\mu ^{CP}(A_i,\bar{X}^C_{i},\varvec{\theta }^{CP})+\mu ^{PD}(A_i,B_{2i},\bar{X}^P_{i},\varvec{\theta }^{PD})\bigg \} \nonumber \\&\qquad + P(R_{1i}=2|A_i,X_{i},\varvec{\psi }_1)P(R_{2i}=0|A_i,\bar{X}^C_{i},\varvec{\psi }_2)\bigg \{\mu ^C(A_i,X_{i},\varvec{\theta }^C)\nonumber \\&\qquad +\mu ^{CD}(A_i,\bar{X}^C_{i},\varvec{\theta }^{CD})\bigg \}, \end{aligned}$$

(6)

where $\varvec{\theta }=[\varvec{\theta }^D,\varvec{\theta }^R,\varvec{\theta }^{RD},\varvec{\theta }^C,\varvec{\theta }^{CP},\varvec{\theta }^{PD},\varvec{\theta }^{CD}]$ and $\varvec{\psi }=[\varvec{\psi }_1,\varvec{\psi }_2]$. ${P}(R_{1i}=r|A_i,X_i,\varvec{\psi }_1)$ and ${P}(R_{2i}=s|A_i,X_i,X^C_i,\varvec{\psi }_2)$ can be modeled through logistic regression. The optimal regime, the one with the largest expected outcome, is given by

(7)

and the optimal frontline treatment is given by $A^{opt}(X_i,\bar{X}^R_i,\bar{X}^C_i,\bar{X}^P_i)$=$\underset{a_j}{{\textit{argmax}}}\ \Big \{\underset{b_k,b_l}{{\textit{max}}}\ \mu \Big (X_i,\bar{X}^R_i,\bar{X}^C_i,\bar{X}^P_i,d_i(a_j;b_k,b_l)=1,\varvec{\theta },\varvec{\psi }\Big )\Big \}.$

Equation (6) represents a weighted average of patient outcomes between different response groups.

The prospective patient has not yet experienced his/her sample path, and in a sense has missing data for all stages after baseline. To estimate the mean outcome for the prospective patient under a regime of interest requires us to integrate (6) over the probability measure of the covariates that are missing for this patient, the covariates collected after baseline. This leaves us with an estimated mean under the regime of interest, given the baseline information we have on the prospective patient. Most authors comparing dynamic treatment regimes using structural models, such as IPTW and g-computation estimators, routinely integrate over all patient information, except treatment assignment, performing a marginal comparison of regimes. In our approach, we integrate (6) over all stage 2 information, except for stage 2 treatment assignment, facilitating a comparison of treatment regimes conditional on baseline information, similar to (1) and (2). Integrating (6) produces a preliminary optimal regime $d^{opt}(X_i)$ in (7), and $A^{opt}(X_i)$=$\underset{a_j}{{\textit{argmax}}}\ \Big \{\underset{b_k,b_l}{{\textit{max}}}\ \mu \Big (X_i, d_i(a_j;b_k,b_l)=1,\varvec{\theta },\varvec{\psi }\Big )\Big \}$.

If one is willing to assume that, conditional on patient information, the response proportions are independent of the corresponding mean sojourn times (with respect to the covariates), then the integration can be performed piece-wise for each component. To operationalize this, one would first fit the component model for response or sojourn time with all of the significant terms through stage 2. The predicted values of this model would then be regressed on the same covariates as before, except for any stage 2 covariates. This effectively averages the model over the stage two covariates, leaving a model that is conditional on baseline information only. These integrated component models can then be combined to form (6). See Appendix A. When treatment assignment is not random, as will be the case in Sects. 5 and 6, all variables that are confounded with treatment assignment should be included in the sojourn time models. This is important in order to maintain the no unmeasured confounders assumption. To obtain a suitable variance estimator, one could apply the delta method to the integrated form of (6), though this would be impractical given the complexity of the g-computation model. Instead one can simply rely on the bootstrap.

3.2 Tailoring the Salvage Therapy

Regardless of whether g-computation (6) or IPTW (1) and (2) is used, to tailor the stage 2 treatment prescribed by the preliminary optimal regime, the mean models for the stage 2 sojourn times, i.e. $E[T^{RD}_{i}|A_i,B_{1i},\bar{X}^R_{i},R_{1i}=1]$ and $E[T^{PD}_{i}|A_i,B_{2i},\bar{X}^P_{i},R_{1i}=2,R_{2i}=1]$, can be examined using the estimating equations

$$\begin{aligned} \sum _{i=1}^n\frac{\Delta _i}{\hat{K}(U_i)}I\{R_{1i}=1\}\Big \{\frac{\partial m^{RD}}{\partial \varvec{\theta }^{RD}}\Big \}^T\Big \{\text {log}U^{RD}_i-m^{RD}\big (\bar{X}^{R}_i,A_i,B_{1i},\varvec{\theta }^{RD}\big )\Big \}=0\quad \end{aligned}$$

(8)

and

$$\begin{aligned} \sum _{i=1}^n\frac{\Delta _i}{\hat{K}(U_i)}I\{R_{1i}=2\}I\{R_{2i}=1\}\Big \{\frac{\partial m^{PD}}{\partial \varvec{\theta }^{PD}}\Big \}^T\Big \{\text {log}U^{PD}_i-m^{PD}\big (\bar{X}^{P}_i,A_i,B_{2i},\varvec{\theta }^{PD}\big )\Big \}=0. \end{aligned}$$

(9)

By evaluating $\mu ^{RD}\big (\bar{X}^{R}_i,A_i,B_{1i},\varvec{\theta }^{RD}\big )$ and $\mu ^{PD}\big (\bar{X}^{P}_i,A_i,B_{2i},\varvec{\theta }^{PD}\big )$ at $A_i=A^{opt}(X_i)$, the optimal stage 2 treatment given optimal stage 1 treatment can be identified using

$$\begin{aligned} B_{1}^{opt}(\bar{X}^{R}_i)=\underset{b_k}{{\textit{argmax}}}\ \mu ^{RD}\big (\bar{X}^{R}_i,A_i=A^{opt}(X_i),B_{1i}=b_{k},\varvec{\theta }^{RD}\big ) \end{aligned}$$

(10)

and

$$\begin{aligned} B_{2}^{opt}(\bar{X}^{P}_i)=\underset{b_l}{{\textit{argmax}}}\ \mu ^{PD}\big (\bar{X}^{P}_i,A_i=A^{opt}(X_i),B_{2i}=b_l,\varvec{\theta }^{PD}\big ) \end{aligned}$$

(11)

for $R_{1i}=1$, and $R_{1i}=2$ and $R_{2i}=1$, respectively. Just as with the IPTW estimator, by treating $\hat{K}(U_i)$ and the covariates as known one can rely on large sample theory for variance estimation available in any standard regression software. The optimal treatment regime using sequential structural mean models can then be constructed as “Treat with $A^{opt}(X_i)$; if resistance is observed, treat with $B^{opt}_{1}(\bar{X}^{R}_i)$; if disease progression after complete remission is observed, treat with $B^{opt}_{2}(\bar{X}^{P}_i)$.” The beauty of constructing optimal dynamic treatment regimes in this way is that if additional stage 2 patient information is not available, a salvage treatment based on baseline information can still be prescribed using $d^{opt}(X_i)$. Although we have demonstrated this technique for optimizing a dynamic treatment regime on a specific two stage SMART design, the methods are easily generalized to other SMART designs with an arbitrary number of stages. First, conditional on baseline covariates, a preliminary optimal regime is estimated using the g-computation or IPTW estimator. Then, conditional on information up to stage two (including frontline treatment assignment and response status prior to stage 2), the g-computation or IPTW estimator is used again to estimate the mean outcome for each treatment regime over the remaining stages. This process continues until the last stage, where the g-computation or IPTW estimator reduces to a simple regression comparing the last stage treatment assignment. Each successive g-computation or IPTW estimator tailors the optimal treatment assignment at the current stage and provides a strategy for the remaining stages, given past treatment assignment and patient data. This is particularly useful in the event that no future information becomes available for the prospective patient. All authors we have encountered who use structural models (IPTW) do so using only baseline information, prescribing the optimal treatment regime using $d^{opt}(X_i)$, but naturally it is best to re-evaluate the strategy as more information becomes available. This is what we propose.

3.3 Comparison with Q-learning

From the reinforcement learning literature in the field of DTRs, the Bellman equations (Bellman [1]) identify the optimal treatment at each stage and lead to the Q-functions that comprise Q-learning. For our two stage SMART design, assuming no unmeasured confounders, these would be

$$\begin{aligned} \mathcal {Q}_{B_{1}}\big (A_i,\bar{X}^R_i,B_{1i}=b_k\big )= & {} E[T^{RD}_{i}|A_i,B_{1i}=b_k,\bar{X}^R_{i},R_{1i}=1],\\ \mathcal {Q}_{B_{2}}\big (A_i,\bar{X}^{P}_i,B_{2i}=b_l\big )= & {} E[T^{PD}_{i}|A_i,B_{2i}=b_l,\bar{X}^P_{i},R_{1i}=2,R_{2i}=1],\\ \mathcal {Q}_A\big (X_i,A_i=a_j\big )= & {} E\big [H^{(A)}_i|X_i,A_i=a_j\big ], \end{aligned}$$

where

$$\begin{aligned} H^{(A)}_i=\left\{ \begin{array}{ll} T^D_i,\ \ &{} \text {if } R_{1i}=0\\ T^R_i+\underset{b_k}{{\text {max }}}\mathcal {Q}_{B_{1}}\big (A_i,\bar{X}^R_i,B_{1i}=b_k\big ),\ \ &{} \text {if } R_{1i}=1\\ T^C_i+T^{CP}_i+\underset{b_l}{{\text {max }}}\mathcal {Q}_{B_{2}}\big (A_i,\bar{X}^{P}_i,B_{2i}=b_l\big ),\ \ &{} \text {if } R_{1i}=2,\ R_{2i}=1\\ T^C_i+T^{CD}_i,\ \ &{} \text {if }R_{1i}=2,\ R_{2i}=0, \end{array} \right. \end{aligned}$$

with $A^{opt}(X_i)$ $\equiv $ $\underset{a_j}{{\text {argmax }}}\mathcal {Q}_A\big (X_i,A_i=a_j\big )$, $B^{opt}_{1}(\bar{X}^R_i)$ $\equiv $ $\underset{b_k}{{\text {argmax }}}\mathcal {Q}_{B_{1}}\big (A_i=A^{opt}(X_i),\bar{X}^R_i,B_{1i}=b_k\big )$, and $B^{opt}_{2}(\bar{X}^P_i)$ $\equiv $ $\underset{b_l}{{\text {argmax }}}\mathcal {Q}_{B_{2}}\big (A_i=A^{opt}(X_i),\bar{X}^{P}_i,B_{2i}=b_l\big )$. The similarity between Q-learning and g-computation is striking, except that Q-learning averages over stage 2 information and estimated optimal stage 2 treatment assignment, whereas g-computation averages over stage 2 information while holding stage 2 treatment assignment fixed when searching for $A^{opt}(X_i)$. To see this, compare the expected value of $H^{(A)}_i$ using the law of total expectation with Equation (4). Regardless of what estimation method is used (structural or nested models), the optimal choice of frontline treatment depends on what salvage treatment is taken. To identify $A^{opt}(X_i)$, Q-learning (through the use of pseudo data $H^{(A)}_i$) assumes that those who move to stage 2 take their optimal salvage therapy. The optimization of $A_i$ is marginalized over all $B_{1}^{opt}(\bar{X}_i^{R})$ and $B_{2}^{opt}(\bar{X}_i^{P})$. Q-learning estimates all of the best strategies over later stages, combines them into an average best strategy, and assigns the optimal frontline treatment based on this average best strategy given baseline information. Structural mean models consider an average patient over later stages; they identify the single best strategy for the average patient and assign the optimal frontline treatment based on this best strategy given baseline information. Interestingly, all of the same steps are applied in Q-learning and g-computation, the only difference being their order. G-computation first finds the sojourn means conditional on response status, combines them using the law of total expectation, integrates over stage 2 information, and then applies the max and argmax operators to identify $A^{opt}(X_i)$. On the other hand, Q-learning first finds the sojourn means conditional on response status, applies the max operators to the stage 2 sojourn means, integrates over stage 2 information, combines them using the law of total expectation, and then applies the argmax operator to identify $A^{opt}(X_i)$. Compared to the Bellman equations that give rise to Q-learning, structural mean models are robust to extreme observations at later stages when choosing the optimal treatment at the current stage. They allow an appropriate comparison of the treatment regimes while avoiding the non-regularity issues of pseudo data associated with backwards induction techniques. This facilitates standard large sample theory and the bootstrap for constructing confidence intervals and performing hypothesis tests. For a more detailed comparison, see Appendix A.

The IPTW and g-computation estimators above allow us to identify the optimal treatment regime given patient information. What may not be immediately clear from these estimators is the functional dependence they outline between the patient information and the optimal treatment regime. This is especially true for the g-computation estimator. The variable selection method presented next allows us to identify which of the covariates in these structural mean models are prescriptive and to describe the functional dependence between these prescriptive variables and the optimal treatment regime.

4 Prescriptive Variable Selection for Structural Mean Models

Variable selection methods such as forward, backward, and step-wise, as well as the least absolute shrinkage and selection operator (LASSO) and its derivatives help to find the covariates and their interactions that are predictive when estimating the expected value of $Y_i$, but do not clearly identify for which values of the covariates the choice of optimal treatment changes, where we take the optimal treatment to be the one with the largest expected outcome. Gunter et al. [5] define predictive variables as those used to reduce the variability and increase the accuracy of the estimator, whereas variables that help prescribe the optimal treatment for a given patient are called prescriptive (Hollon and Beck [7]). When estimating the mean outcome, it is best to collect as many predictive variables as possible; however, only those predictive variables that are also prescriptive are needed when deciding between treatments (Figure 2). In order for a variable to be prescriptive, it must qualitatively interact with treatment. A variable X is said to qualitatively interact with the treatment Z if there exists at least two distinct non-empty sets within the space of X for which the optimal treatment is different. That is, there exist disjoint, non-empty sets $S_1,S_2\subset space(X)$ for which $\underset{Z}{{\textit{argmax}}}E[Y|X=x_1,Z=z]\ne \underset{Z}{{\textit{argmax}}}E[Y|X=x_2,Z=z]$ for all $x_1\in S_1$ and $x_2\in S_2$.

Working in the single-stage setting using backwards induction, Gunter et al. [5] propose two different ranking methods to sort variables according to how likely they will qualitatively interact with treatment, and provide a four step algorithm involving LASSO regression on nested subsets of covariates for selecting important predictive variables. Following their work, Zhang [20] generalize from least squares regression and consider Multivariate adaptive regression spline (MARS) models. They offer a simpler, more effective two-step method: (i) MARS models are used to model the outcome of interest and simultaneously select predictive (and prescriptive) variables from a larger subset of variables and (ii) the treatment interaction contrast is used as the outcome in a penalized logistic regression with LASSO ($L_1$-logistic regression) to identify which of the significant interactions are not only predictive, but also prescriptive.

Expanding on the ideas in Zhang [20], we use a two-step approach for prescriptive variable selection in structural mean models. For models of the form in Eqs. (1) and (2), the first step implements a variable selection method to identify significant interaction effects between baseline covariate information and treatment regimes when estimating mean survival time. Equation (3) is used to create a categorical variable indicating the estimated preliminary optimal treatment regime for each subject, given his/her baseline covariate information. In the second step, we use the estimated preliminary optimal treatment regime as the outcome in a classification method such as multinomial logistic regression, using significant baseline covariates from (2) in step 1 as predictors. Any baseline effects deemed significant in the second step are prescriptive variables that qualitatively interact with treatment regime when estimating mean survival time and prescribe the preliminary optimal treatment regime. An analogous two-step method can be used for models of the form in Eqs. (5) and (6).

At this point, the reader might be left wondering what the purpose is of the classification method in the second step. After all, we do have everything we need from the first step to assign the preliminary optimal treatment regime conditional on baseline information. If we were estimating marginal means for each treatment regime, we would directly compare the 16 means and identify the largest one with, say, a forest plot. By conditioning on baseline information, we could create a separate forest plot for each combination of baseline covariates, but as the number of baseline covariates increases this becomes tedious, and this may not suggest a clear functional relationship between the baseline covariates and the optimal regime. The importance of the second step is twofold: (i) if the variable selection process from the first step included many baseline covariates, the second step narrows our focus to those baseline covariates that are prescriptive (ii) once we have narrowed our focus, the second step allows us to see the functional dependence between the preliminary optimal treatment regime and these prescriptive covariates. Both of these points are especially important when using the g-computation estimator.

In a sense, we are modeling our model, using a classification method to model the argmax of the g-computation or IPTW estimator. We admit that this extra layer of modeling may introduce additional misclassification than using the g-computation or IPTW estimator alone, but it allows us to clearly and succinctly describe the prescription of our g-computation or IPTW model. It should also be noted that only the prescriptive variables need be collected to prescribe according to the classification model, saving hospital and patient resources. If additional accuracy is desired, the g-computation or IPTW estimator can be used to prescribe the optimal treatment regimes, and the classifier in step two can be used to describe the prescription mechanism.

Regardless of whether g-computation (6) or IPTW (1) and (2) is used, the same two step method can be used when tailoring the stage 2 treatment prescribed by the estimated preliminary optimal regime. The first step implements a quantitative variable selection method to identify significant interaction effects when estimating the mean sojourn time from stage 2 to death. Equations (10) and (11) are used to create a categorical variable indicating the estimated optimal stage 2 treatment given the estimated optimal stage 1 treatment and patient information. In the second step, we use the estimated optimal stage 2 treatment as the outcome in a classification method such as logistic regression, using information up to stage 2 as predictors. Any effects deemed significant in the second step are prescriptive variables that qualitatively interact with stage 2 treatment when estimating the mean sojourn time from stage 2 to death. The estimated treatment means that are grouped and paneled by the prescriptive variables are used to confirm and report the results.

With structural mean models like g-computation or IPTW estimators, each treatment regime must be directly compared to determine the optimal one. When this comparison is also conditional on patient information, this technique for optimizing dynamic treatment regimes becomes overwhelming. The two-step prescriptive variable selection procedure supports the tailored optimization of dynamic treatment regimes using sequential structural mean models by eliminating from consideration any suboptimal treatment regimes and sifting out the covariates that prescribe the optimal treatment regimes. This variable selection method is easily applied at every stage of the SMART design.

5 Simulation

We conducted a simulation experiment to evaluate the optimization of frontline treatment and salvage treatment given patient information using the methods described in Sects. 3 and 4. Identical to the acute myelogenous leukemia or myelo-dysplastic syndrome (AML-MDS) trial design presented in Sect. 2, and later in Sect. 6, we consider a 2-stage SMART design.

The scenario was generated to closely mimic the AML-MDS data described in Sect. 6. Subjects were randomly assigned to one of four induction therapies, $\mathcal {A}$=$\{$(1)FAI, (2)FAI+ATRA, (3)FAI+G, or (4)FAI+G+ATRA$\}$. The simulated population experienced one of three possible cytogenetic abnormalities with equal probability, and age was generated using a Weibull distribution, truncated between 20 and 90 years. Response status $R_{1i}$ depended on frontline treatment, age, and cytogenetic abnormality, while response status $R_{2i}$ depended on frontline treatment only. If $R_{1i}=1$, assignment to follow-up therapy $\mathcal {B}$=$\{$(0)Other treatment, (1)HDAC$\}$ depended on age, while if $R_{2i}=1$, assignment to follow-up therapy depended on $\text {log}T^{CP}_i$. Sojourn times followed various Weibull distributions, with means depending on frontline treatment, age, cytogenic abnormality, and where appropriate, earlier sojourn times and follow-up therapy. Details of the data generation are provided in Web Appendix with related SAS code.

In this scenario, $n=1000$ observations (training data) were simulated, and the g-computation and IPTW regression estimators were fit. For each subject, the estimated preliminary optimal regime conditional on baseline information was identified, and logistic regression (written as logistic$^{gcomp}$ and logistic$^{IPTW}$) was used to identify the functional dependence between the preliminary optimal regime and baseline covariates. For those subjects whom $R_{1i}=1$ or $R_{2i}=1$, the sojourn models for $\text {log}T^{RD}_i$ and $\text {log}T^{PD}_i$, respectively, were evaluated at $\hat{A}_i^{opt}(X_i)$, and ${B}_{1i}^{opt}(\bar{X}^R_i)$ and ${B}_{2i}^{opt}(\bar{X}^P_i)$ were estimated. Logistic regression (also written logistic$^{gcomp}$ and logistic$^{IPTW}$) was used to identify the functional dependence between the estimated optimal salvage treatment and patient information up to stage 2. The g-computation, IPTW, and classification models were then applied to a new set of $n=100,000$ observations (test data) to determine how well the models correctly classify subjects to their optimal frontline treatment, salvage treatment, and treatment regime and to determine how well the models agree with one another. The classification rate is calculated on the $n=100,000$ subjects. This process is replicated 5000 times, and the average classification rates are reported in Tables 1 and 2.

In row 1 of each table, g-computation vs logistic$^{gcomp}$ compares the proportion of times the argmax of the g-computation estimator produces the same result as the classification model of the argmax of the g-computation estimator for (a) frontline treatment and (b) salvage treatment. The remaining rows are interpreted similarly. The results in Table 1 are under correct model specification for the g-computation model and nearly correct model specification for the IPTW model. We say ‘nearly’ since the data generative process followed the g-computation estimator; hence, the IPTW model will not be exactly correct. Correct model specification is to say there was no model selection in either step of the two-step method. The correct models were known and fit. This is to give us an idea of how the g-computation, IPTW, logistic$^{gcomp}$, and logistic$^{IPTW}$ models work under the most ideal circumstances in the given scenario. As expected, the IPTW model and its associated logistic$^{IPTW}$ model were in perfect agreement with one another over 99% of the time when identifying the optimal frontline treatment. This is not surprising since the IPTW estimating equations replicate the observations belonging to multiple regimes, and the mean model fits a separate slope/intercept for every regime. Any covariates that interact with treatment regime can be nearly perfectly captured in the logistic$^{IPTW}$ model. On the other hand, the g-computation model fits separate parameters for the covariates across the mean sojourn times, and the associated logistic$^{gcomp}$ model may not perfectly capture these relationships. Nevertheless, the logistic$^{gcomp}$ model did agree with the g-computation model over 99% of the time as well, on average. The results in Table 2 incorporate backward variable selection using AIC in step one and backward variable selection using significance level in step two of the proposed variable selection method for each of the 5000 replications. This is to give us a sense of how the g-computation, IPTW, logistic$^{gcomp}$, and logistic$^{IPTW}$ models work under usual model building circumstances in the given scenario. Backward selection was chosen because there were relatively few covariates at each stage to choose from. Methods such as LASSO work particularly well when there are many candidate variables.

Table 1 Agreement rates (se) under correct model specification. 5000 simulations of $n=1000$

Full size table

In row 1, g-computation vs logistic$^{gcomp}$ compares the proportion of times the argmax of the g-computation estimator produces the same result as the classification model of the argmax of the g-computation estimator for (a) frontline treatment, (b) tailored salvage treatment, and (c) tailored treatment regime. The remaining rows are interpreted similarly.

Table 2 Agreement rates (se) using backward selection for model building. 5000 simulations of $n=1000$

Full size table

In row 1, g-computation vs logistic$^{gcomp}$ compares the proportion of times the argmax of the g-computation estimator produces the same result as the classification model of the argmax of the g-computation estimator for (a) frontline treatment and (b) tailored salvage treatment. The remaining rows are interpreted similarly.

6 Optimizing AML Treatment Regimes

6.1 Study Overview

In this section, we apply the methods discussed previously to the AML-MDS trial concerning 210 patients with leukemia (Wahed and Thall [16, 19]). The data set arose from a randomized trial of four combination chemotherapies given as frontline treatments to patients with poor prognosis acute myelogenous leukemia (AML) or myelo-dysplastic syndrome (MDS). Chemotherapy of AML or MDS proceeds in stages. A ‘remission inducing’ chemotherapy combination is given first, with the aim of achieving a complete remission (CR), which is defined as the patient having less than 5% blast cells, a platelet count greater than 105 mm$^3$ and white blood cell count greater than 103 mm$^3$, based on a bone marrow biopsy. If the induction chemotherapy does not achieve a CR, or a CR is achieved, but the patient suffers a relapse, then salvage chemotherapy usually is given in a second attempt to achieve a CR. The AML-MDS trial used a 2$\times $2 factorial design with chemotherapy combinations fludarabine plus cytosine arabinoside plus idarubicin (FAI), FAI plus all-trans-retinoic acid (FAI+ATRA), FAI plus granulocyte colony stimulating factor (FAI+G), and FAI plus all-trans-retinoic acid plus granulocyte colony stimulating factor (FAI+G+ATRA). The primary aim was to assess the effects of adding ATRA, G or both to FAI on the probability of success, which was defined as the patient being alive and in CR at 6 months.

Table 3 Initial outcomes following frontline treatment

Full size table

Because there were many different salvage treatments, we classified salvage as either containing high dose arabinoside cytosine (HDAC) or not (Other treatment). In the AML-MDS trial, patients were randomized between the four induction combinations, whereas the salvage treatments $B_1$ and $B_2$ were chosen subjectively by the attending physicians, patient by patient. Consequently, considering the multicourse structure of the patients’ actual therapy, the data are observational because salvage treatments were not chosen by randomization. By modeling the stage 2 treatment assignment probabilities, incorporating all covariates that explain treatment assignment, the IPTW regression estimator remains consistent. Similarly, by incorporating all confounders of stage 2 treatment assignment into the stage 2 intermediate outcome component models in Equ. (6), the g-computation estimator also remains consistent. This no unmeasured confounders assumption is important for our causal inference interpretation of counterfactual/potential outcomes, allowing us to consistently estimate the mean outcome under a regime of interest for the entire sample of patients. We found that assignment to $B_1$ treatments was associated with age, while assignment to $B_2$ treatments was associated with log$T^{CP}$. Tables 3 and 4 summarize the counts for the seven possible events illustrated in Fig. 1 for the leukemia data. These include the three induction therapy outcomes (indexed by $R_{1i}$) for each treatment arm and the four possible subsequent outcomes.

Table 4 Outcomes following CR or resistant disease

Full size table

It is well known that age and type of cytogenetic abnormality are highly reliable predictors of the probability of CR and survival time in AML or MDS. In particular, cytogenetic abnormality, characterized by missing portions of the fifth and seventh chromosomes (denoted by (−5,−7)), and older age are strongly associated with a lower probability of CR and shorter survival time. Because this trial’s entry criteria required patients to have at least one unfavorable prognostic characteristic, the distributions of age and cytogenetic abnormality were different from those seen in the population of newly diagnosed AML-MDS patients. For example, only four patients had the comparatively favorable cytogenetic abnormality with an inversion of the 16th chromosome, or T(8,21), a translocation between chromosomes 8 and 21. Consequently, to take advantage of cytogenetic abnormality as a prognostic variable in our regression analyses, we grouped it into three categories: poor {(−5,−7)}; intermediate {diploid, −Y, or insufficient metaphases to classify}; good {+8, 11Q, INV16, T(8,21), MISC}.

To ensure stability of the model fits, six of the seven component models were fitted by restricting the time to the particular event to a fixed upper limit, with the limits set by first examining the observed distribution of each event time. Specifically, the variables $U^D$, $T^C$, $U^{RD}$, $U^{CD}$, $T^{CP}$, and $U^{PD}$ were restricted to 100, 110, 1408, 692, 1326, and 2274 days, respectively. The covariates for the mean sojourn time models are presented in Tables 5 and 6. Backward variable selection, using AIC as the criterion for optimality, was used to determine the significant covariates and their possible two-way interactions in each model. Frontline and salvage treatment was forced into each model.

Unfortunately, many AML patients undergoing chemotherapy to induce CR die during this process, before either CR is achieved or it can be determined that the patient’s disease is resistant to the induction chemotherapy. Although such deaths may be attributed to either the leukemia or the chemotherapy, so-called ‘regimen-related death,’ because both the disease and the treatment cause low white blood cell counts and other adverse events, it often is very difficult to identify a sole cause of death. The patients in this study were especially susceptible to induction death due to their poor prognosis at entry to the trial, with overall rate of death during induction chemotherapy 33% (69/210), varying from 28 to 38% across the four induction regimens (p-value, 0.70; generalized Fisher exact test). In the fitted model for the three induction event times (Table 5), no baseline covariate was significantly associated with $\text {log}T^D$. There did not appear to be any significant difference between the induction treatment effects on $\text {log}T^D$, although ATRA may have had a slightly deleterious effect in that, among the 69 patients who died during induction, the patients in the two ATRA arms died a few days sooner, on average.

Table 5 Models for sojourn time to death, time to resistance, and time to complete remission

Full size table

Resistance to induction treatment occurred in 39 (18.6%) patients, relatively more frequently among patients receiving FAI and FAI+ATRA (31% and 24%, respectively) compared with those who received FAI+G or FAI+G+ATRA (7.8% and 10%, respectively). The times to treatment resistance were similar across the four induction treatments, but with greater variability in the FAI+G arm. Among the 39 patients who were resistant to front-line treatment, 27 were given HDAC as salvage treatment. Two patients in this cohort were censored before observing death. Using backward variable selection with AIC as the criteria of optimality, factors that were associated with time from induction treatment resistance to death are presented in Table 6. About half (48.6%) of the 210 patients achieved CR, with CR rates of 37%, 48%, 53% and 56% in the FAI, FAI+ATRA, FAI+G, and FAI+G+ATRA arms, respectively. Of the 102 patients who achieved CR, 93 (91%) had disease progression before death or being lost to follow-up. Among these, 53 (57%) received HDAC as salvage treatment. Since only nine patients died in CR, an intercept-only model was used for modeling $T^{CD}$.

Table 6 Models for sojourn time from resistance to death, complete remission to disease progression, and from progression to death

Full size table

Interaction effects imply lower order terms, i.e., there were no nested effects.

6.2 Strategy effects

Figure 3a shows the results of the classification model for the argmax of the g-computation model using the proposed two-step prescriptive variable selection method. Figure 3a depicts the proportion of patients, or estimated probability of, having a particular preliminary optimal treatment regime, given baseline information. Using the integrated form of (6), Equation (7) was the outcome in a logistic regression. The significant baseline covariates from the g-computation model were the candidate variables in another variable selection process in the classification model to determine which covariates are prescriptive. Both cytogenetic abnormality and age are prescriptive when determining a patient’s preliminary optimal treatment regime using baseline information. For all of those with poor and intermediate cytogenetic abnormalities, and mostly all of those with good cytogenetic abnormalities, the preliminary regime $d\big ($(2)FAI+ATRA; (0)Other treatment, (1)HDAC$\big )$ was optimal. For those with good cytogenetic abnormality and who are over 70 years of age, the preliminary optimal treatment regime is $d\big ($(3)FAI+G; (1)HDAC, (0)Other treatment$\big )$. This is consistent with the marginal results found in Wahed and Thall [16], who determined that the optimal dynamic treatment regime, marginalized over baseline information, was $d\big ($(2)FAI+ATRA; (0)Other treatment, (1)HDAC$\big )$. Our IPTW regression model found that $d\big ($(2)FAI+ATRA; (0)Other treatment, (1)HDAC$\big )$ was the preliminary optimal treatment regime for all, regardless of baseline information. This is not surprising since the IPTW and g-computation estimators are very different from one another. For each of the preliminary optimal treatment regimes prescribed using the two-step variable selection method, Fig. 3b plots the estimated mean survival time from the g-computation model with 90% point-wise bootstrap confidence intervals using the 5th and 95th percentiles of the bootstrapped sampling distribution of the mean (500 bootstrap re-samples). Though the confidence interval is wide, 20-year-old patients with intermediate cytogenetic abnormalities are estimated to live over 8 years on average from commencement of regime d((2)FAI+ATRA; (0)Other, (1)HDAC), compared to 3 years or less for other ages and cytogenetic groups.

For those who experienced disease progression after complete remission, the optimal salvage therapy can be tailored according to Fig. 3c, which shows the results of the classification model for the argmax of the $\text {log}T^{PD}$ model using the proposed two-step prescriptive variable selection method. As indicated in Table 6, the model for $\text {log}T^{PD}$ depends on age, cytogenetic group, $\text {log}T^{C}$, $\text {log}T^{CP}$, and several interaction effects, but Fig. 3c shows that only $\text {log}T^{CP}$ is needed to prescribe the optimal salvage therapy. Figure 3c depicts the proportion of patients, or estimated probability of, having a particular optimal salvage treatment, given patient information up to stage 2. For those who took (2) FAI+ATRA as their frontline treatment and experienced disease progression after complete remission, (1) HDAC remained their optimal salvage therapy so long as their $T^{CP}$ was greater than 6 logs. However, for patients with $\text {log}T^{CP}$ less than 6 following treatment with (2) FAI+ATRA, most had (0)Other treatment as their optimal salvage therapy. A similar result holds for those treated with frontline therapy (3) FAI+G, except that the decision point to alter the salvage treatment occurs near $\text {log}T^{CP}$=5. Since the IPTW regression model identified $d\big ($(2) FAI+ATRA; (0)Other treatment, (1) HDAC$\big )$ as the preliminary optimal treatment regime, regardless of baseline information, its corresponding graph for tailoring the salvage treatment when $R_{2i}=1$ is the top panel of Fig. 3c. For those who experienced resistant disease, no further tailoring of the optimal salvage treatment was possible, since the optimal salvage therapy was Other treatment for everyone experiencing resistant disease. It should be noted that the logistic curves in Fig. 3c are more smooth than the near-discontinuous curves in Fig. 3a. This indicates that there are other covariates that aide in the prescription of the optimal salvage therapy, but were not deemed significant by the backward variable selection process of the classification model. When age is included in the classification model for the argmax of the $\text {log}T^{PD}$ model, the logistic curves in Fig. 3c become sharper when paneled by age, though the same functional dependence and overall prescription remain the same. For each of the optimal salvage therapies, Fig. 3d plots the estimated mean survival time from disease progression for the g-computation model with 90% point-wise bootstrap confidence intervals using the 5th and 95th percentiles of the bootstrapped sampling distribution of the mean (500 bootstrap re-samples).

References

Bellman, R.: Dynamic programming. Princeton University Press, Princeton (1957)
MATH Google Scholar
Chakraborty, B., Murphy S, Strecher V (2009). Inference for non-regular parameters in optimal dynamic treatment regimes. Statistical methods in medical research
Chakraborty, B., Murphy, S.A.: Dynamic treatment regimes. Ann. Rev. Stat. Appl. 1, 447 (2014)
Article Google Scholar
Estey, E.H., Thall, P.F., Pierce, S., Cortes, J., Beran, M., Kantarjian, H., Keating, M.J., Andreeff, M., Freireich, E.: Randomized phase ii study of fludarabine+ cytosine arabinoside+ idarubicin$\pm $all-trans retinoic acid$\pm $granulocyte colony-stimulating factor in poor prognosis newly diagnosed acute myeloid leukemia and myelodysplastic syndrome. Blood 93(8), 2478–2484 (1999)
Article Google Scholar
Gunter, L., Zhu, J., Murphy, S.: Variable selection for qualitative interactions. Stat. Method. 8(1), 42–55 (2011)
Article MathSciNet Google Scholar
Hernán, M.A., Lanoy, E., Costagliola, D., Robins, J.M.: Comparison of dynamic treatment regimes via inverse probability weighting. Basic Clin. Pharmacol. Toxicol. 98(3), 237–242 (2006)
Article Google Scholar
Hollon, S. D. and A. T. Beck (2004). Cognitive and cognitive-behavioral therapies. In: Lambert, MJ., editor. Garfield and Bergin’s Handbook of Psychotherapy and Behavior Change: An Empirical Analysis (5th ed.). New York: John Wiley & Sons
Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47(260), 663–685 (1952)
Article MathSciNet Google Scholar
Huang, X., Ning, J., Wahed, A.S.: Optimization of individualized dynamic treatment regimes for recurrent diseases. Stat. Med. 33(14), 2363–2378 (2014)
Article MathSciNet Google Scholar
Johnson, G.S., Topp, A.S., Wahed, A.S.: Methods for analyzing dtrs with censored survival data. In: Kosorok, M.R., Moodie, E.E. (eds.) Adaptive treatment strategies in practice: planning trials and analyzing data for personalized medicine, Chapter 10. Siam, Philadelphia (2015)
Google Scholar
Murphy, S.: Optimal dynamic treatment regimes. J. R. Stat. Soc. B 65, 331–366 (2003)
Article MathSciNet Google Scholar
Orellana, L., Rotnitzky, A.G., et al.: Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part i: main content. Int. J. Biostat. 6(2), 1–49 (2010)
MathSciNet Google Scholar
Robins, J.: A new approach to causal inference in mortality studies with a sustained exposure period-application to control of the healthy worker survivor effect. Math. Model. 7(9), 1393–1512 (1986)
Article MathSciNet Google Scholar
Robins, J., Orellana, L., Rotnitzky, A.: Estimation and extrapolation of optimal treatment and testing strategies. Stat. Med. 27(23), 4678–4721 (2008)
Article MathSciNet Google Scholar
Robins, J. M. (2004). Optimal structural nested models for optimal sequential decisions. In Proceedings of the second seattle Symposium in Biostatistics, pp. 189–326. Springer
Wahed, A.S., Thall, P.F.: Evaluating joint effects of induction-salvage treatment regimes on overall survival in acute leukaemia. J. R. Stat. Soc. Ser. 62(1), 67–83 (2013)
Article MathSciNet Google Scholar
Wahed, A.S., Tsiatis, A.A.: Optimal estimator for the survival distribution and related quantities for treatment policies in two-stage randomization designs in clinical trials. Biometrics 60(1), 124–133 (2004)
Article MathSciNet Google Scholar
Wang, H., Zhao, H.: Regression analysis of mean quality-adjusted lifetime with censored data. Biostatistics 8(2), 368–382 (2007)
Article Google Scholar
Xu, Y., Müller, P., Wahed, A.S., Thall, P.F.: Bayesian nonparametric estimation for dynamic treatment regimes with sequential transition times. J. Am. Stat. Assoc. 111(515), 921–950 (2016)
Article MathSciNet Google Scholar
Zhang, N. (2014a). Variable selection for optimal treatment regimes. Ph.D. dissertation, North Carolina State University
Zhang, N. (2014b). Variable selection for optimal treatment regimes

Download references

Author information

Authors and Affiliations

Merck & Co. Inc., West Point, PA, 19426, USA
Geoffrey S. Johnson
AbbVie Inc., Chicago, IL, 60064, USA
Andrew S. Topp
University of Pittsburgh, Pittsburgh, PA, 15261, USA
Abdus S. Wahed

Authors

Geoffrey S. Johnson
View author publications
You can also search for this author in PubMed Google Scholar
Andrew S. Topp
View author publications
You can also search for this author in PubMed Google Scholar
Abdus S. Wahed
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdus S. Wahed.

Additional information

Communicated by Rosihan M. Ali.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Appendix

Colloquially, both Q-learning and g-computation are said to estimate the optimal dynamic treatment regime. The question then becomes “Under what conditions, if any, are the two methods equivalent, and is one method preferable over the other?" Notwithstanding the differences described in Sect. 3.3, Q-learning and g-computation are also different estimators because g-computation, as constructed in Equ. (4), includes additional patient information, $\bar{X}^{C}$, between complete remission and disease progression for the outcome models and response proportions. To make a fair comparison, we momentarily consider the g-computation model where the $R_{2i}$ response proportion models and intermediate outcome models for $T^{CP}_i$ and $T^{CD}_i$ depend only on baseline information. Since Q-learning and g-computation (as we propose) use the same method for identifying the optimal salvage treatment given the estimated optimal frontline treatment, it remains to be shown whether both methods choose the same frontline treatment in all samples or at least under certain conditions. Modeling, as before, the intermediate outcomes with least squares models and the response proportions with logistic regression models, the integrated g-computation estimator that is conditional on baseline information only, maximized over stage 2 treatment assignment, can be written as

$$\begin{aligned}&\underset{b_k,b_l}{{\text {max }}}\ \mu \Big (X_i=x,d_i(a_j;b_k,b_l)=1,\varvec{\theta },\varvec{\psi }\Big )\nonumber \\&\quad = P(R_{1i}=0|A_i=a_j,X_{i}=x,\varvec{\psi }_1)\bigg \{\mu ^D(A_i=a_j,X_{i}=x,\varvec{\theta }^D)\bigg \} \nonumber \\&\qquad + P(R_{1i}=1|A_i=a_j,X_{i}=x,\varvec{\psi }_1)\bigg \{\mu ^R(A_i=a_j,{X}_{i}=x,\varvec{\theta }^R)\nonumber \\&\qquad +\underset{b_k}{{\text {max }}}\ E\Big [\mu ^{RD}(A_i,B_{1i},\bar{X}^R_{i},\varvec{\theta }^{RD})\Big |X_i=x,A_i=a_j,B_{1i}=b_k\Big ]\bigg \} \nonumber \\&\qquad +P(R_{1i}=2|A_i=a_j,X_{i}=x,\varvec{\psi }_1)P(R_{2i}=1|A_i=a_j,X_i=x,\varvec{\psi }_2)\nonumber \\&\qquad \times \bigg \{\mu ^C(A_i=a_j,{X}_{i}=x,\varvec{\theta }^C)+\mu ^{CP}(A_i=a_j,X_i=x,\varvec{\theta }^{CP})+\underset{b_l}{{\text {max }}} \nonumber \\&\qquad \times E\Big [\mu ^{PD}(A_i,B_{2i},\bar{X}^P_{i},\varvec{\theta }^{PD})\Big |X_i=x,A_i=a_j,B_{2i}=b_l\Big ]\bigg \} \nonumber \\&\qquad + P(R_{1i}=2|A_i=a_j,X_{i}=x,\varvec{\psi }_1)P(R_{2i}=0|A_i=a_j,X_i=x,\varvec{\psi }_2)\nonumber \\&\qquad \times \mu ^C(A_i=a_j,X_{i}=x,\varvec{\theta }^C)+\mu ^{CD}(A_i=a_j,X_i=x,\varvec{\theta }^{CD})\bigg \}, \end{aligned}$$

(12)

where $E\Big [\mu ^{RD}(A_i,B_{1i},\bar{X}^R_{i},\varvec{\theta }^{RD})\Big |X_i=x,A_i=a_j,B_{1i}=b_k\Big ]$ denotes the integration of $\mu ^{RD}(A_i,B_{1i},\bar{X}^R_{i},\varvec{\theta }^{RD})$ over the distribution of covariates from stage 2 in $\bar{X}^R_i$ that are not available at baseline in $X_i=x$, which we will denote as $X^R_i$=$\{s\in \bar{X}^R_i|s\notin X_i\}$. Similarly for $E\Big [\mu ^{PD}(A_i,B_{2i},\bar{X}^P_{i},\varvec{\theta }^{PD})\Big |X_i=x,A_i=a_j,B_{2i}=b_l\Big ]$, with $X^P_i$=$\{r\in \bar{X}^P_i|r\notin X_i\}$. Most authors who implement Q-learning use a single regression model on $H^{(A)}_i$ for estimating the mean outcome across stage 1 treatments; however, if the same component models from g-computation are used to construct an estimator for $\mathcal {Q}_A\Big (X_i=x,A_i=a_j\Big )$ = $E\Big [H^{(A)}_i|X_i=x,A_i=a_j\Big ]$ using the law of total expectation, a Q-learning model for stage 1 could be written as

$$\begin{aligned}&E\Big [H^{(A)}_i|X_i=x,A_i=a_j\Big ]\nonumber \\&\quad = P(R_{1i}=0|A_i=a_j,X_{i}=x,\varvec{\psi }_1)\bigg \{\mu ^D(A_i=a_j,X_{i}=x,\varvec{\theta }^D)\bigg \} \nonumber \\&\qquad + P(R_{1i}=1|A_i=a_j,X_{i}=x,\varvec{\psi }_1)\bigg \{\mu ^R(A_i=a_j,{X}_{i}=x,\varvec{\theta }^R)\nonumber \\&\qquad +E\Big [\underset{b_k}{{\textit{max }}}\mu ^{RD}(A_i,B_{1i}=b_k,\bar{X}^R_{i},\varvec{\theta }^{RD})\Big |X_i=x,A_i=a_j\Big ]\bigg \} \nonumber \\&\qquad +P(R_{1i}=2|A_i=a_j,X_{i}=x,\varvec{\psi }_1)P(R_{2i}=1|A_i=a_j,X_{i}=x,\varvec{\psi }_2)\nonumber \\&\qquad \times \bigg \{\mu ^C(A_i=a_j,{X}_{i}=x,\varvec{\theta }^C)+\mu ^{CP}(A_i=a_j,X_i=x,\varvec{\theta }^{CP})\nonumber \\&\qquad +E\Big [\underset{b_l}{{\textit{max }}}\mu ^{PD}(A_i,B_{2i}=b_l,\bar{X}^P_{i},\varvec{\theta }^{PD})\Big |X_i=x,A_i=a_j\Big ]\bigg \} \nonumber \\&\qquad + P(R_{1i}=2|A_i=a_j,X_{i}=x,\varvec{\psi }_1)P(R_{2i}=0|A_i=a_j,X_{i}=x,\varvec{\psi }_2)\nonumber \\&\qquad \times \bigg \{\mu ^C(A_i=a_j,X_{i}=x,\varvec{\theta }^C)+\mu ^{CD}(A_i=a_j,X_i=x,\varvec{\theta }^{CD})\bigg \}. \end{aligned}$$

(13)

Written this way, the similarity between Q-learning and g-computation is even more striking. Even if they are not equivalent under all circumstances, when viewed this way a strong case is made to tailor the salvage therapy when using structural mean models since they are tailored in Q-learning and the g-computation model for frontline treatment closely resembles the Q-learning model. To compare $A^{opt}(X_i=x)$ = $\underset{a_j}{{\textit{argmax }}}E\Big [H^{(A)}_i|X_i=x,A_i=a_j\Big ]$ using (13) vs $A^{opt}(X_i=x)$ = $\underset{a_j}{{\textit{argmax }}}\Big \{\underset{b_k,b_l}{{\text {max }}}\ \mu \big (X_i=x,d_i(a_j;b_k,b_l)=1,\varvec{\theta },\varvec{\psi }\big )\Big \}$ using (12), all that remains is to examine whether

$$\begin{aligned} E\Big [\underset{b_k}{{\text {max }}}\mu ^{RD}(A_i,B_{1i}=b_k,\bar{X}^R_{i},\varvec{\theta }^{RD})\Big |X_i=x,A_i=a_j\Big ] \end{aligned}$$

(14)

is equal to

$$\begin{aligned} \underset{b_k}{{\text {max }}}E\Big [\mu ^{RD}(A_i,B_{1i},\bar{X}^R_{i},\varvec{\theta }^{RD})\Big |X_i=x,A_i=a_j,B_{1i}=b_k\Big ] \end{aligned}$$

(15)

and

$$\begin{aligned} E\Big [\underset{b_l}{{\text {max }}}\mu ^{PD}(A_i,B_{2i}=b_l,\bar{X}^P_{i},\varvec{\theta }^{PD})\Big |X_i=x,A_i=a_j\Big ] \end{aligned}$$

(16)

is equal to

$$\begin{aligned} \underset{b_l}{{\text {max }}}E\Big [\mu ^{PD}(A_i,B_{2i},\bar{X}^P_{i},\varvec{\theta }^{PD})\Big |X_i=x,A_i=a_j,B_{2i}=b_l\Big ]. \end{aligned}$$

(17)

The answers depend on the distributions of $\mu ^{RD}(A_i=a_j,B_{1i}=b_k,\bar{X}^R_{i},\varvec{\theta }^{RD})$ and $\mu ^{PD}(A_i=a_j,B_{2i}=b_l,\bar{X}^P_{i},\varvec{\theta }^{PD})$ over $X^R_i$ and $X^P_i$, respectively. When the stochastic inequality of $\mu ^{RD}(A_i=a_j,B_{1i},\bar{X}^R_{i},\varvec{\theta }^{RD})$ for $B_{1i}=b_1$ and $B_{1i}=b_2$ over $X^R_i$ for a given $A_i=a_j$ is large,

$$\begin{aligned}&E\Big [\underset{b_k}{{\text {max }}}\mu ^{RD}(A_i,B_{1i}=b_k,\bar{X}^R_{i},\varvec{\theta }^{RD})\Big |X_i=x,A_i=a_j\Big ]\\&\approx \underset{b_k}{{\text {max }}}E\Big [\mu ^{RD}(A_i,B_{1i},\bar{X}^R_{i},\varvec{\theta }^{RD})\Big |X_i=x,A_i=a_j,B_{1i}=b_k\Big ], \end{aligned}$$

with the approximation approaching equality as the stochastic inequality grows. This corresponds to a scenario where one stage 2 treatment significantly out performs the other over all of $X^R_i$. When the stochastic inequality is small and the variances are equal, the same approximation holds when the correlation is large, with the approximation approaching equality as the correlation reaches 1. This corresponds to a scenario where one stage 2 treatment is incrementally better than the other, and the treatment effect is the same over all of $X^R_i$. In both cases, g-computation yields the same or nearly the same result for $A^{opt}(X_i=x)$ as Q-learning. When the stochastic inequality is small and either (i) the correlation departs from 1, (ii) the variances are unequal, or (iii) both i and ii, then

$$\begin{aligned} E\Big [\underset{b_k}{{\text {max }}}\mu ^{RD}(A_i,B_{1i}=b_k,\bar{X}^R_{i},\varvec{\theta }^{RD})\Big |X_i=x,A_i=a_j\Big ] \end{aligned}$$

in Equation (13) grows larger, putting more emphasis on identifying $A_i=a_j$ as optimal, while

$$\begin{aligned} \underset{b_k}{{\text {max }}}E\Big [\mu ^{RD}(A_i,B_{1i},\bar{X}^R_{i},\varvec{\theta }^{RD})\Big |X_i=x,A_i=a_j,B_{1i}=b_k\Big ] \end{aligned}$$

in Equ. (12) does not. This corresponds to a scenario where one treatment is incrementally better than the other, on average, and the treatment effect varies over $X^R_i$. The same arguments hold for (16) and (17). Therefore, g-computation and Q-learning, even when constructed using the same component models, are not guaranteed to yield the same optimal treatment at each stage. For Q-learning, the use of additional distributional features of stage 2 intermediate outcomes can be seen as both a blessing and a curse. While the typical patient may not see much difference in expected utility between the salvage treatments, small segments of the population might and Q-learning raises the importance of the corresponding frontline treatment based on these small segments. We view sequential structural mean models as robust to extreme stage 2 observations. Additionally, they do not suffer from nonregulariy issues associated with nested mean models (since the maximum operator is outside of the expectation), facilitating standard large sample theory and the bootstrap for constructing confidence intervals and performing hypothesis tests (Chakraborty et al. [2]).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Johnson, G.S., Topp, A.S. & Wahed, A.S. Optimizing Acute Myelogenous Leukemia Treatment Regimes via Sequential Structural Mean Models. Bull. Malays. Math. Sci. Soc. 45 (Suppl 1), 539–566 (2022). https://doi.org/10.1007/s40840-022-01359-0

Download citation

Received: 27 December 2021
Revised: 02 May 2022
Accepted: 05 July 2022
Published: 27 July 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s40840-022-01359-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Optimizing Acute Myelogenous Leukemia Treatment Regimes via Sequential Structural Mean Models

Abstract