In clinical practice, often multiple treatment alternatives are available for a problem at hand. In such cases, there is a clear need for decision rules that indicate which treatment alternative should ideally be administered to each client or patient under study. Retrieving such optimal decision rules (also referred to as optimal treatment regimes) is a key methodological challenge in the most topical areas of precision health and precision medicine (Chakraborty & Moodie, 2013; Chakraborty & Murphy, 2014; Huang et al., 2019; Kosorok & Moodie, 2015; Laber et al., 2014; Lou et al., 2018; Schulte et al., 2014; Trivedi, 2016; Whitcomb, 2019).

For the estimation of optimal treatment regimes [limited here to single-decision regimes (Tsiatis et al., 2020), while, for the time being, leaving multiple-decision regimes in a multistage treatment context aside], one has to rely on data for a sample of clients, each of whom underwent one of the treatment alternatives, within the context of either a randomized clinical trial or an observational study. In the present paper we will reanalyze, as a motivating running example, data from a randomized clinical trial on three types of aftercare administered to 224 younger women with early-stage breast cancer (the Breast Cancer Recovery Project: Scheier et al., 2005, 2007). Prior to the aftercare, the majority of the women underwent a lumpectomy, removal of axillary nodes, and combined radiation and chemotherapy. The three alternative types of aftercare were (1) standard medical care, (2) standard medical care plus a nutrition intervention (how to adopt a low-fat, high-fruit/vegetable diet), and (3) standard medical care plus an education intervention (with information on breast cancer and training of coping skills). Before starting aftercare, 11 pretreatment characteristics were measured. We further focus here on improvement in physical functioning as outcome variable. The key question to be addressed then is which types of women (in terms of pretreatment characteristics) would improve most with regard to physical functioning from which type of aftercare.

When looking for an optimal treatment regime, one typically must do so within a prespecified family or class of treatment regimes. In the present paper we will focus for this purpose on the family of classification trees, because of their obvious insightfulness. (For examples of classification trees, see Figs. 3 and 4.) At this point, however, an obstacle arises: The problem of estimating an optimal decision tree-based treatment regime has been satisfactorily analyzed at a theoretical level (Laber & Zhao, 2015; Tsiatis et al., 2020; Zhang et al., 2012); a ready-made and easily accessible software solution has further been made available for the case of two treatment alternatives (Holloway et al., 2023; Zhang et al., 2012); however, a similar readily accessible solution for the case of more than two treatment alternatives (as in the running example outlined above) is not yet available (with, e.g., Laber & Zhao, 2015, sharing only a very partial and limited piece of pseudocode in supplementary materials). In the present paper, we will solve for this.

The remainder of this paper is structured as follows: In Sect. 2 we will first formalize and analyze the primary problem that is the focus of the present paper (i.e., the lack of a readily accessible methodology for estimating optimal tree-based treatment regimes in the case of strictly more than two treatment alternatives); subsequently, we will propose a solution for it. Second, we will also briefly discuss two secondary problems that arise during the estimation, along with a proposed solution for them. Section 3 will present an evaluation of the proposed methodology in a simulation study, and Sect. 4 an application of it to the data from the Breast Cancer Recovery Project. We will end with a few concluding remarks.

Method

Primary problem

Notation and formalization

The methodology we propose in the present paper is applicable to data from both randomized clinical trials (RCT) and observational studies. For simplicity’s sake, however, we will chiefly zoom in on the RCT case.

We therefore assume that data are available from an RCT that involves I individuals, clients or patients, (1, …, i, …, I). In the running example of the Breast Cancer Recovery Project, I = 224.

In the RCT, each individual is randomly assigned to one out of K treatment alternatives (1, …, k, …, K). In the running example, K = 3. We further denote the variable that indicates the treatment alternative to which each individual is assigned in the RCT by A.

From each individual, a set of P pretreatment or baseline characteristics X is collected prior to treatment assignment, with X = (X1, …, Xp, …, XP). In the running example, P = 11. In addition, an outcome variable Y is measured; without loss of generality, we further assume that higher values on Y are better.

A treatment regime g then can be defined as a mapping:

$$\begin{array}{c}g:\text{R}\text{a}\text{n}\text{g}\text{e}\left(\mathbf X\right)\rightarrow\left\{1,\dots,k,\dots,K\right\}\\\mathbf x\mapsto g\left(\mathbf x\right)\end{array}.$$

Note that this definition also includes constant functions as treatment regimes—that is to say, so-called trivial or one-size-fits-all treatment regimes, which imply that all individuals are assigned to one and the same treatment alternative.

To formalize treatment regime optimality, we firstly have to prespecify a family of treatment regimes \(\mathcal{G}\) (with most families also including the one-size-fits-all regimes as special cases). In the present paper, \(\mathcal{G}\) is the family of classification trees. Secondly, we need the concept of potential outcomes (Rubin, 1974), with, in the case of K treatment alternatives, K potential outcome variables, \(\mathbf{Y}^{\mathbf{\ast}} = \left(Y^{\ast1},\dots,Y^{\ast k},\dots,Y^{\ast K}\right)\), where \(Y^{\ast k}\) denotes the outcome that would have been observed if the individual under study were assigned to treatment alternative k. (One may note that, as each individual is assigned to a single treatment alternative, effectively only one of the potential outcome variables is observed—which comes down to a problem of structural missingness; we will return to this issue in the discussion of the secondary problems.) Thirdly, we need an optimality criterion to define within the search space \(\mathcal{G}\) an optimal treatment regime gopt. Several criteria are possible in this regard, with the criterion most often used in practice being that of maximizing the expected potential outcome (Tsiatis et al., 2020). If we denote by \(Y^{\ast g\left(\mathbf X\right)}\) the random variable that takes for individual i the value \(Y^{\ast g\left(\mathbf X\left(i\right)\right)}\left(i\right)\), this criterion can be written as

$$g^{opt}=\underset{g\in\mathcal G}{\mathrm{argmax}\;}E\left[Y^{\ast g\left(\mathbf X\right)}\right]=\underset{g\in\mathcal G}{\mathrm{argmax}\;}E_{\mathbf X}\left[E\left[Y^{\ast g\left(\mathbf X\right)}\left|\mathbf X\right.\right]\right].$$
(1)

Analysis of primary problem

We will now analyze the primary problem as formalized by Eq. (1), with \(\mathcal{G}\) being the family of classification trees. We will do so while generalizing derivations by Zhang et al. (2012) and Tsiatis et al. (2020). Ultimately, our analysis will result in a (counterintuitive) transformation of our primary optimal treatment regime estimation problem into a supervised classification problem, with the truly optimal treatment alternative for each client acting as the “supervisor” (i.e., as that client’s “true class”). This is counterintuitive, indeed, as the truly optimal treatment alternative for each client is typically unknown. The key to this counterintuitive mystery is that, in an initial, preparatory stage, the true class for each client (as well as the so-called client-specific misclassification costs that will be further explained below) will be estimated on the basis of the RCT data (including the observed treatment assignment in that RCT). The latter will be further discussed below as the first of the secondary problems.

For our analysis, we first denote

$$E\left[Y\left|A=a,\mathbf X\right.\right]=\mu\left(a,\mathbf X\right).$$

We further assume (Rubin, 2005) that \(Y\left(i\right)=Y^{\ast A\left(i\right)}\left(i\right)\). If we also assume independence of \(\mathbf{Y}^{\mathbf{\ast}}\) and A conditional on X (which is trivial in case of an RCT and requires a so-called assumption of no unmeasured confounders in observational studies), it follows that

$$\mu\left(a,\mathbf X\right)=E\left[Y^{\ast a}\left|\mathbf X\right.\right].$$

In that case, (1) can be rewritten as

$$g^{opt}=\underset{g\in G}{\mathrm{argmax}\;}E_{\mathbf X}\left[\mu\left(g\left(\mathbf X\right),\mathbf X\right)\right].$$

This further implies that

$$\begin{array}{l}g^{opt}=\underset{g\in G}{\mathrm{argmin}\;}E_{\mathbf X}\left[\underset k{\mathrm{max}}\;\mu\left(k,\mathbf X\right)-\mu\left(g\left(\mathbf X\right),\mathbf X\right)\right]\\=\underset{g\in G}{\mathrm{argmin}\;}E_{\mathbf X}\left[1_{g\left(\mathrm X\right)\neq\underset k{\mathrm{argmax}\;}\mu\left(k,\mathbf X\right)}\left(\underset k{\mathrm{max\;}}\mu\left(k,\mathbf X\right)-\mu\left(g\left(\mathbf X\right),\mathbf X\right)\right)\right].\end{array}$$

Hence, assuming that for each client i, (estimates of) all \(\mu\left(k,{\mathbf x}_i\right)\) values are available (and, hence, also the values of \(\underset k{\mathrm{max}}\:\mu\left(k,{\mathbf x}_i\right)\) and \(\underset k{\mathrm{argmax}}\:\mu\left(k,{\mathbf x}_i\right)\)), on the sample level a classification tree g is to be looked for that minimizes

$$\frac1I\sum_i1_{g\left({\mathrm x}_i\right)\neq\underset k{\mathrm{argmax}}\;\mu\left(k,x_i\right)}\left[\underset k{\max\;}\mu\left(k,{\mathbf x}_i\right)-\mu\left(g\left({\mathbf x}_i\right),{\mathbf x}_i\right)\right].$$
(2)

Such a tree is typically built by starting from a root node that includes all units (individuals), and by subsequently recursively partitioning each of the end nodes of the current tree on the basis of a covariate and a split point that are chosen such as to minimize (2).

To arrive at a deeper understanding of Eq. (2), the interpretations listed in Table 1, which involve a transition from “individual” and “treatment alternative” to the more generic terms of “unit” and “class” that are customary in the classification domain, may be helpful. These interpretations imply that Eq. (2) comes down to a search for a classification tree that minimizes an average or total misclassification cost.

Table 1 Interpretation of terms of Eq. (2)

To further put this in a broader context, we may consider four different possible types of misclassification costs (summarized in Table 2) that are distinguished in the classification domain (see, e.g., Höppner et al., 2022, and references therein), the fourth of which applies to our primary problem.

Table 2 Types and forms of misclassification costs

The first line of this table corresponds to what Feng et al. (2021) call the “classical classification paradigm,” which aims to minimize the overall misclassification rate. In contrast, the lower three lines correspond to three types of “cost-sensitive learning.” In particular, the class-dependent type implies that the misclassification cost may vary across pairs of true and assigned classes. Note that the index i in T(i) implies unit dependency only via the true class of unit i, with the misclassification costs being fully captured by a K × K matrix. Note further that the latter matrix may be asymmetric, as in the example of classification of email messages as non-spam versus spam, where erroneously misclassifying truly non-spam mail as spam may be more costly than the other way around. As an example of unit- (instance-, example-, or case-)dependent costs, one may think of credit scoring, where the cost of misclassifying a customer as creditworthy (vs. not creditworthy) may depend on the amount that the customer in question wants to borrow.

In our case, Eq. (2) implies that the misclassification cost depends both on the client (unit) and on the treatment alternative (class) to which the client is (mis)classified by the treatment regime. Hence, we are dealing with a unit- and class-dependent misclassification cost, indeed.

Solution to the primary problem

In the search for a readily accessible software tool for estimating classification trees with minimal misclassification cost, an obvious choice is the mainstream package for tree estimation rpart (Therneau et al., 2022). The rpart package can deal with misclassification costs of different types: A constant cost is the default option, the class-dependent case can be dealt with by making use of the option of a loss matrix, and the unit-dependent case can be handled by using the option of case weights. For the unit- and class-dependent case, however, the situation is somewhat different. If K = 2 and if correct classifications go with a zero cost, one can make use of the fact that the I × 2 misclassification cost matrix can be reduced to a vector of length I; the latter can be included in rpart via the weight vector. If K > 2, however, so far no solution has been made available.

We propose for this purpose a novel code snippet multivalued, a fully documented version of which (including all code used in the analysis of the running example, as reported below) is available through GitHub (https://github.com/KULeuven-PPW-OKPIV/multivalued), and a significant part of which is also available in the online supplementary material. This code snippet is to be used in conjunction with rpart, and is based on the following building blocks:

  1. 1.

    The option in rpart to make use of a user-defined splitting function (Therneau, 2022)

  2. 2.

    The possibility to enter in the optional cost matrix in rpart a rectangular I × K matrix

  3. 3.

    The possibility to enter in the weight vector of rpart the unit numbers, which subsequently can be used to select for each unit the relevant row from the cost matrix

Note that the user-defined splitting function implemented in the code snippet implies that in the tree building, the total misclassification cost as formalized by Eq. (2) is directly minimized (rather than the Gini index that is used as default criterion in rpart). Note further that, whereas utilizing a user-defined split function in rpart can slow down the package’s execution (Therneau, 2022), in our case the slowing down appeared not to be prohibitively large (with, e.g., each of the analyses in our illustrative application taking less than one second).

Secondary problems

Two secondary problems show up in the estimation of tree-based optimal treatment regimes and require a satisfactory solution. The first of these pertains to the estimation of the true class and the misclassification costs for each client. This is hampered by the structural missingness in typical observational studies and RCTs on treatment evaluation, which is because every individual is assigned to a single treatment alternative only. This implies that for each individual, a single potential outcome is known. In the literature, various solutions have been proposed to deal with this issue (Laber & Zhao, 2015; Tsiatis et al., 2020; Zhang et al., 2012). A first of these comes down to (the estimation of) a regression-based outcome model, in which the outcome is modeled as a function of baseline characteristics, treatment assignment, and the interaction between them (Tsiatis et al., 2020). This approach (also called Q-learning, with “Q” referring to “quality”), though, strongly depends on whether the outcome model in question has been correctly specified. As a way out, one may consider relying on flexible modeling solutions, such as random forests (Laber & Zhao, 2015). Still another possibility is to revert to more robust estimators of potential outcomes and their contrasts (Tsiatis et al., 2020; Zhang et al., 2012); such estimators include inverse probability weighted estimators (IPWE, with the probabilities in question pertaining to propensity models to deal with observational studies), and augmented inverse probability weighted estimators (AIPWE, with the augmentation pertaining to a term stemming from an outcome model). A particularly promising characteristic of AIPWEs is their so-called double robustness, meaning that they can be shown to be consistent if the propensity model or the outcome model has been correctly specified (with the propensity model being trivially correct in the case of an RCT).

A second secondary problem to be dealt with is the issue of replicability. This is an issue of concern because of the replicability crisis in many disciplines of science, including not least in psychology (see, e.g., Klein et al., 2018). Moreover, the issue is of particular relevance in the estimation of treatment regimes, and more generally in exploratory subgroup analyses in clinical applications, as in the past such analyses have not infrequently led to conclusions that could not be replicated (e.g., Rothwell, 2005). As a consequence, some authors even disposed of exploratory subgroup analyses as “data dredging” (Feinstein, 1998; Rothwell, 2005). This replicability problem relates in part to the structural missingness referred to above and the estimation uncertainty implied by it. Furthermore, it also immediately relates to the fact that, compared to main effects, considerably larger sample sizes are needed for a reliable estimation of treatment by subgroup interactions, and in particular for the detection of qualitative or disordinal such interactions that constitute the basis of nontrivial treatment regimes (Brookes et al., 2004). On top of all this comes the long-known problem of instabilities of trees with regard to variables and split points (Breiman et al., 1984). (One might argue that the mainstream package for tree estimation rpart includes a cross-validation procedure, which could be considered as a kind of protection to safeguard replicability; yet, in the package, cross-validation is used only in a pruning procedure after the tree building, and as such only yields a limited way out for the instabilities issue.) As a solution for all these problems, we propose making an appeal to multiverse analysis (Steegen et al., 2016), that is to say, to make use of a broad range of analysis options, and to subsequently check which results consistently emerge across this range.

Simulation study

We set up a Monte Carlo simulation study to evaluate the performance of three variants of the methodology to estimate optimal tree-based treatment regimes in RCTs for which we proposed an accessible software solution. The three variants are based on three different ways to estimate the true class and the misclassification costs for each client: (1) a flexible outcome model based on random forests (more information on which will be provided in the Application section), (2) a classical AIPWE approach based on a linear regression outcome model (that includes all baseline covariates and their interaction with treatment) and theoretical propensity probabilities, and (3) the same AIPWE approach in which the theoretical propensity probabilities are replaced by empirical proportions (which, somewhat paradoxically, has been shown to yield a more efficient estimation: see Tsiatis, 2006, p. 206; Tsiatis et al., 2020, p. 44–46). We wanted to evaluate the performance of these methods across a range of settings that varied in terms of data characteristics that could affect that performance. In particular, we generated data \(\left(Y_i,A_i,X_{1i},\dots,X_{5i}\right),\)  \(i=1,\dots,I\), with \(\left(X_1,\dots,X_5\right)\)  iid standard normal, A multinomial with \(P\left(A=a_k\right)=\frac1K,\) and Y generated according to

$$Y_i=1.0+0.25X_{1i}+0.25X_{2i}-0.25X_{5i}-\theta1_{a_i\neq g^{opt}\left(\mathbf{x}_i\right)}+E_i,$$

with θ a prespecified constant, \(g^{opt}\left(\mathbf{x}_i\right)\) the truly optimal treatment alternative for individual i as defined below, and Ei standard normal. In the data generation, we further systematically varied the following four data characteristics in a full factorial design (with 500 data sets within each cell):

  1. (1)

    The number of arms in the RCT, K = 3, 4

  2. (2)

    The sample size per arm, nK = 50, 100, 200

  3. (3)

    The effect size of the difference in outcome between optimal and non-optimal treatment alternatives in each relevant subgroup, θ = 0.50, 1

  4. (4)

    The true optimal treatment regime (OTR) underlying the data, based on three scenarios—a tree-based, a non-tree-based, and a one-size-fits-all or trivial one—as depicted in Fig. 1

    Fig. 1
    figure 1

    True optimal treatment regimes used in simulation study with tree-based (row 1), non-tree-based (row 2), and one-size-fits-all or trivial (row 3) scenario and with three (column 1) and four (column 2) treatment alternatives

We subjected each simulated data set to each of the three OTR estimation methods under study in conjunction with rpart. For the pruning of the estimated optimal tree-based treatment regimes, we used 20-fold cross-validation.

We focused on four performance aspects:

  1. (1)

    The expected outcome of each estimated regime, in terms of its normalized performance gain (NPG), which compares the benefit gained by administering that regime over administering the marginally best treatment alternative aopt to the benefit that could have been gained theoretically,

    $$NPG=\frac{E\left(Y^{\ast\widehat g^{opt}}\right)-E\left(Y^{\ast a^{opt}}\right)}{E\left(Y^{\ast g^{opt}}\right)-E\left(Y^{\ast a^{opt}}\right)},$$

where \(E\left(Y^{\ast\widehat g^{opt}}\right)\) and \(E\left(Y^{\ast a^{opt}}\right)\) were calculated on the basis of a simulated “super-sample” of 106 observations generated on the basis of the true model underlying the simulated data.

  1. (2)

    Classification accuracy, that is, the proportion of patients assigned to their truly optimal treatment alternative.

  2. (3)

    The “Type I error rate” (with a slight abuse of terminology), that is, the proportion of data sets in a cell generated under a one-size-fits-all scenario for which a method erroneously yielded a nontrivial estimated OTR.

  3. (4)

    The “Type II error rate” (with again a slight abuse of terminology), that is, the proportion of data sets in a cell generated under a nontrivial scenario for which a method erroneously yielded a one-size-fits-all estimated OTR.

To identify the most important effects, we subjected the four outcome measures to a repeated-measures ANOVA, with the within-factor pertaining to the type of OTR estimation method and the between-factors to the data characteristics. We will further only discuss effects with an effect size η2 ≥ 0.05 (see Table 3).

Table 3 Effect size (η2) of effects in simulation study with η2 ≥ .05

Sizable main effects were primarily found, the contents of which can be derived from Table 4. As could have been expected, a better performance (in terms of expected outcome and classification accuracy) was found for tree-based as compared to non-tree-based scenarios, for a larger effect size of the differences in outcome between optimal and non-optimal treatment alternatives, and for larger sample sizes. Furthermore, inferential error rates appeared to be better as well in the case of a larger effect size of the differences in outcome between optimal and non-optimal treatment alternatives, and of larger sample sizes; in addition, the random forest-based method appeared to yield slightly better inferential results, with for the “Type I error rate” this being especially true in the case of smaller sample sizes per treatment arm (Fig. 2).

Table 4 Marginal means on four outcome variables for categories of five design variables in simulation study. Marginal means for main effects with effect size η2 ≥ .05 are depicted in bold
Fig. 2
figure 2

Method by sample size per arm interaction for outcome measure “Type I error rate” in simulation study

Illustrative application

Analysis

The data of the Breast Cancer Recovery Project are publicly available as part of the R package quint (Dusseldorp et al., 2016, 2022). As already indicated above, all code for the analyses described in the present section is available through GitHub (https://github.com/KULeuven-PPW-OKPIV/multivalued) and in the online supplementary material.

We analyze the Breast Cancer Recovery data, with all 11 pretreatment characteristics measured at baseline as covariates, including age, physical functioning, and number of comorbidities (such as diabetes, migraine, arthritis, angina, …), and with physical functioning measured at 9-month follow-up minus physical functioning measured at baseline (i.e., improvement in physical functioning) as outcome variable; note that possible objections against the use of a change score as an outcome variable (e.g., Senn, 2006) do not apply here, as we also included physical functioning at baseline as covariate. We estimate the misclassification costs via, on the one hand, a flexible outcome modeling technique, that is to say, random forests, and, on the other hand, AIPWE.

For the random forests, we make use of the ranger package (Wright & Ziegler, 2017), with default values for the tuning parameters (including 500 trees per forest, and with the square root of the total number of variables as the size of the random subset of variables that are considered for the splits in each tree). Moreover, we impose the constraint that the treatment assignment variable is always a member of the subset of variables considered for splits, to avoid a bias of the random forest towards trivial one-size-fits-all regimes (which would be implied by a random non-inclusion of treatment assignment). Importantly, the constraint mentioned above is that treatment assignment is always considered for the splits; therefore, it always leaves room for trees without treatment assignment as splitting variable and, hence, without treatment effect heterogeneity. Ultimately, it will be up to the data to decide whether or not treatment effect heterogeneity will be in place in the random forest. This further implies that the constrained random forest approach can perfectly lead to one-size-fits-all estimated optimal treatment regimes (as will also appear below).

For the AIPWE, we use for the propensity part both the theoretical probabilities and the empirical proportions of individuals assigned to the three treatment arms (standard medical care: 0.34, nutrition: 0.35, education: 0.31); for the outcome model part, we use a regression model that includes as predictors all baseline characteristics as well as their interactions with treatment assignment.

For the multiverse analysis, in addition to using both random forests and AIPWE for the misclassification cost estimation, and in addition to the use of both theoretical probabilities and empirical proportions for the inverse probability weighing, we use for the random forests three random starts; furthermore, we repeat the 20-fold cross-validation five times for the pruning in rpart. All this results in 3 (random forest start) × 5 (pruning cross-validation) = 15 random forest-based solutions, and 2 (probabilities vs. empirical proportions) × 5 (pruning cross-validation) = 10 AIPWE-based solutions.

Results

Out of the 15 random forest-based solutions, nine solutions are a tree with a root only, that is to say, a trivial, one-size-fits-all treatment regime in which all individuals are assigned to education; furthermore, five solutions are a nontrivial regime that involves the baseline characteristics age, number of comorbidities, and physical functioning at baseline. This regime is graphically represented in Fig. 3. It implies that relatively younger women with a higher number of comorbidities (such as diabetes, migraine, etc.) and a relatively higher level of physical functioning at baseline should ideally receive the nutrition intervention, whereas relatively older women should ideally receive standard medical care. Finally, one nontrivial solution involves age only.

Fig. 3
figure 3

Nontrivial random forest-based optimal decision tree for Breast Cancer Recovery data

Out of the 10 AIPWE-based solutions, 6 solutions pertain to a trivial, one-size-fits-all treatment regime in which all individuals are assigned to education; 2 solutions pertain to the nontrivial regime that is graphically represented in Fig. 4 and that involves the number of comorbidities once as the only covariate (with women with a relatively higher number of comorbidities ideally being assigned to the nutrition intervention); 1 nontrivial solution further involves the number of comorbidities twice and 1 nontrivial solution involves the number of comorbidities twice in addition to age.

Fig. 4
figure 4

Nontrivial AIPWE-based optimal decision tree for Breast Cancer Recovery data

Discussion

The whole of our analyses yielded only fairly weak evidence for a nontrivial optimal treatment regime, with 60% of both the random forest- and the AIPWE-based solutions being of the one-size-fits all type, and with the education intervention overall seeming the most beneficial. That being said, 9 out of the 25 obtained solutions included some indication that clients who suffer from a higher number of comorbidities (diabetes, migraine, arthritis, angina, etc.) might benefit more from a nutrition intervention.

With about 75 clients per arm, this RCT can be situated between the first and the second sample size levels of the simulation study in the previous section. This goes with somewhat higher “Type I” and “Type II” error rates, which further adds to a somewhat higher inferential uncertainty.

Concluding remarks

In this paper we proposed a readily accessible methodology for estimating optimal classification trees that minimize a loss function involving a unit- and class-based misclassification cost in the case of a classification problem with K > 2 classes. In conjunction with a multiverse approach to both misclassification cost and tree estimation, the proposed methodology provided an insightful but shaded solution to a problem of optimal treatment assignment with K > 2 treatment alternatives.

We focused here on the use of our proposed methodology within the context of RCTs. The methodology, however, and in particular the AIPWE variant with propensity probabilities, is in principle also applicable to data from observational studies. To be sure, this type of application requires the (untestable) assumption that there are no unmeasured covariates that are associated with both treatment assignment and the potential outcomes (“no unmeasured confounders”: Tsiatis et al., 2020, p. 27ff.)—in addition to other assumptions such as the requirement that the probability for assignment to each treatment alternative be strictly positive in each area of the covariate space.

The proposed methodology was introduced as a tool to estimate optimal tree-based single-decision regimes that involve K > 2 treatment alternatives. However, in line with the suggestion made by Holloway et al. (2023) for their K = 2 method, by calling our proposed methodology repeatedly, its use can also be extended to multiple-decision regimes in a multistage treatment context (with the additional option to include in that case at some decision point also all evolving client information available at that point as covariates).

The proposed methodology was further introduced as a tool to address treatment-related decisions in a clinical context. That being said, though, the methodology could also be used to address formally similar problems in different areas. As a first example, one may think of the choice between K > 2 types of churn management in customer retention (Lemmens & Gupta, 2020), which is a context where misclassification cost can be taken literally and expressed in monetary terms. As a second example, one may think of an identification of optimal data-analytic regimes in data-analytic benchmarking that involves a comparison of K > 2 data-analytic methods (Doove et al., 2017); otherwise, the latter type of application is technically simpler, as in a benchmarking context the outcome of all methods under study when applied to a data set at hand is typically known.

Finally, multiverse approaches similar to the ones that we proposed and applied to cope both with replicability challenges in optimal treatment regime estimation and with stability challenges in tree-based analyses, may be applied more broadly as well. As examples one may think of optimal tree-based treatment regime estimation in the case of two treatment alternatives, of optimal treatment regime searches within families of regimes other than the tree-based ones, and of tree analyses based on structures other than simple classification trees, such as regression trees and model-based recursive partitioning (Zeileis et al., 2008).