Keywords

1 Introduction

Uplift modeling aims to estimate the incremental impact of a treatment, such as a marketing campaign or a drug, on an individual’s behavior. These approaches are very useful in several applications such as personalized medicine and advertising, as it allows targeting the specific proportion of a population on which the treatment will have the greatest impact. Uplift estimation is based on groups of people who have received different treatments. A major difficulty is that data are only partially known: it is impossible to know for an individual whether the chosen treatment is optimal because their responses to alternative treatments cannot be observed. Several works address challenges related to the uplift modeling, among which uplift decision tree algorithms became widely used [15, 17].

Despite their usefulness, current uplift decision tree methods have limitations such as local splitting criteria. A split criterion decides whether to divide a terminal node. However these splits are independent to each other and a pruning step is then used to ensure generalization and avoid overfitting. Moreover, these methods require parameters to set. In this paper, we present UB-DT (Uplift Bayesian Decision Tree) a parameter-free method for uplift decision tree based on the Bayesian paradigm. Contrary to state-of-art uplift decision tree methods, we define a global criterion designed for an uplift decision tree. A major advantage of a global tree criterion is it allows to get rid of the pruning step, since it acts as a regularization to avoid overfitting. We transform the uplift tree learning problem to an optimization problem according to the criterion. Then a search algorithm is used to find the decision tree that optimizes the global criterion. Moreover our approach is easily extended to random forests and we propose UB-RF (Uplift Bayesian Random Forest). We evaluate both UB-DT and UB-RF to state-of-art uplift modeling approaches through a benchmarking study.

This paper is organized as follows. Section 2 introduces an overview of uplift modeling and related work. Section 3 presents UB-DT. We conduct experiments in Sect. 4 and conclude in Sect. 5.

2 Context and Literature Overview

2.1 Uplift Problem Formulation

Uplift is a notion introduced by Radcliffe and Surry [11] and defined in Rubin’s causal inference models [14] as the Individual Treatment effect (ITE).

We now outline the notion of uplift and its modeling. Let X be a group of N individuals indexed by \(i: 1 \ldots N\) where each individual is described by a set of variables \(\mathbb {K}\). \(X_i\) denotes the set of values of \(\mathbb {K}\) for the individual i. Let Z be a variable indicating whether or not an individual has received a treatment. Uplift modeling is based on two groups: the individuals having received a treatment (denoted \(Z=1\)) and those without treatment (denoted \(Z=0\)). Let Y be the outcome variable (for instance, the purchase or not of a product). We note \(Y_i(Z=1)\) the outcome of an individual i when he received a treatment and \(Y_i(Z=0)\) his outcome without treatment. The uplift of an individual i, denoted by \(\tau _i\), is defined as: \(\tau _i = Y_i(Z=1) - Y_i(Z=0)\).

In practice, we will never observe both \(Y_i(Z=1)\) and \(Y_i(Z=0)\) for a same individual and thus \(\tau _i\) cannot be directly calculated. However, uplift can be empirically estimated by considering two groups: a treatment group (individual with a treatment) and a control group (without treatment). The estimated uplift of an individual i denoted by \(\hat{\tau }_{i}\) is then computed by using the CATE (Conditional Average Treatment Effect) [14]:

$$\begin{aligned} CATE : \hat{\tau }_{i} = \mathbb {E}[Y_i(Z=1)|X_i] - \mathbb {E}[Y_i(Z=0)|X_i] \end{aligned}$$
(1)

As the real value of \(\tau _i\) cannot be observed, it is impossible to directly use machine learning algorithms such as regression to infer a model to predict \(\tau _i\). The next section describes how uplift is modeled in the literature.

2.2 Related Work

Uplift modeling approaches Uplift modeling approaches are divided into two categories. The first one (called metalearners) is made up of methods that take advantage of usual machine learning algorithms to estimate the CATE. One of the most intuitive approaches is the two-model approach. It consists of fitting two independent classification models, one for the treated group and another for the control group. The estimated uplift is then the difference between the estimations of the two classification models. While this approach is simple, intuitive and allows the usage of any machine learning algorithm, it has also known weaknesses with particular patterns [12]. The causal inference community has also proposed other metalearners such as X-learner [8], R-Learner and DR-learner [7].

The second category is closer to our work. This category gathers tailored methods for uplift modeling such as tree-based algorithms. Trees are built using recursive partitioning to split the root node to child nodes according to a splitting criterion. [15] defines a splitting criterion that compares the probability distributions of the outcome variable in each of the treatment groups using weighted divergence measures like the Kullback-Leibler (KL), the squared euclidean distance (ED) and the chi-squared divergence. [17] proposes the Contextual Treatment Selection algorithm (CTS) where a splitting criterion directly maximizes a performance measure called the expected performance. Causal machine learning algorithms were also developed such as the Causal Trees algorithm [1] and the Causal Forests [2].

Uplift tree splitting criterion and Bayesian approaches. Building an uplift tree requires to discretize variables to detect areas with homogeneous treatment effects. The global criterion of UB-DT to select a variable on a node takes advantage of on a univariate parameter-free Bayesian approach for density estimation through discretization called UMODL [13]. More precisely, UMODL applies a Bayesian approach to select the most probable uplift discretization model M given the data. This implies finding the model M that maximizes the posterior probability P(M|Data), hence maximizing \(P(M) \times P(Data|M)\). Finally, a global criterion within the Bayesian framework for decision trees is given in [16] but it does not deal with uplift.

3 UB-DT: Uplift Decision Tree Approach

UB-DT is made up of two ingredients: a global criterion \(C({T}) \) for a binary uplift decision tree T and a tree search algorithm to find the most probable optimal tree. We start by presenting the structure of an uplift tree model. Then we describe the new global criterion for an uplift decision tree and the algorithm to give the best tree. Finally we show how the approach is straightforwardly extended to random forests.

3.1 Parameters of an Uplift Tree Model T

Fig. 1.
figure 1

Example of an uplift tree model. Internal nodes are described by the segmentation variable \(X_s\) and the distribution of instances in each of the two children \(\{N_{si}\}\). Leaf nodes containing a treatment effect (i.e. \(W_l=1\)) are described by the class distribution for each treatment. This applies to leaves 4, 5 and 7. Leaf nodes containing no treatment effect (i.e. \(W_l=0\)) are only described by the class distribution (this is the case of leaf 6).

We define a binary uplift decision tree model T by its structure and the distribution of instances and class values in this structure. The structure of T consists of the set of internal nodes \(\mathbb {S}_T\) and the set of leaf nodes \(\mathbb {L}_T\). The distribution of the instances in this structure is described by the partition of the segmentation variable \(X_s\) for each internal node s, the class frequency in each leaf node where there is no treatment effect, and the class frequency on each treatment in the leaf nodes with a treatment effect. More precisely, T is defined by:

  • the subset of variables \(\mathbb {K}_T\) used by model T. This includes the number of the selected variables \(K_T\) and their choice among a set of \(\mathbb {K}\) variables provided in a dataset, we note \(K = |\mathbb {K}|\).

  • a binary variable \(I_n\) indicating the choice of whether each node n is an internal node (\(I_n=1\)) or a leaf node (\(I_n=0\)).

  • the distribution of instances in each internal node s, which is described by the segmentation variable \(X_s\) of the node s and how the instances of s are distributed on its two child nodes.

  • a binary variable \(W_l\) indicating for each leaf node l if there is a treatment effect (\(W_l=1\)) or not (\(W_l=0\)). If \(W_l=0\), l is described by the distribution of the output values \(\{N_{l.j.}\}_{1\le j\le J}\), where \(N_{l . j.}\) is the number of instances of output value j in leaf l. If \(W_l=1\), l is described by the distribution of the class values per treatment \(\{N_{l.jt}\}_{1\le j\le J, 1\le t\le 2}\), where \(N_{l.jt}\) is the number of instances of output value j and treatment t in leaf l.

These parameters are automatically optimized by the search algorithm (presented in Sect. 3.4) and not fixed by the user. In the rest of the paper, the following notations \(N_{s .}\), \(N_{s i .}\), \(N_{l.}\) and \(N_{l ..t} \) will additionally be used to respectively designate the number of instances in node s, in the \(i^{t h}\) child of node s, in the leaf l and treatment t in leaf l.

3.2 Uplift Tree Evaluation Criterion

We now present the new global criterion \(C({T}) \) which is an uplift tree model evaluation criterion. UB-DT applies a Bayesian approach to select the most probable uplift tree model T that maximizes the posterior probability P(T|Data). This is equivalent to maximizing the product of the prior and the likelihood i.e. \(P(T) \times P(Data|T)\). Taking the negative log turns the maximization problem into a minimization one: \(C({T}) = - \log \left( P(T) \times P(Data|T)\right) \), \(C({T}) \) is the cost of the uplift tree model T. T is optimal if \(C({T}) \) is minimal. By exploiting the hierarchy of the presented uplift tree parameters and assuming a uniform prior, we express \(C({T}) \) as follows (cf. Eq. 2):

$$\begin{aligned} \begin{aligned}&C(T) =\underbrace{\log (K+1) + \log \left( \begin{array}{c} K+K_T-1 \\ K_T \end{array}\right) }_{\text {Variable selection}}\\&+\underbrace{\sum _{s \in \mathbb {S}_{T_n}} \log 2+\log K_T+\log (N_{s.}+1)}_{\text {Prior of internal nodes}}+\underbrace{\sum _{l \in \mathbb {L}_T}\log 2}_{\text {Treatment effect W}}\\&+\underbrace{\sum _{l \in \mathbb {L}_T} \log 2 +\sum _{l \in \mathbb {L}_T}(1-W_l)\log \left( \begin{array}{c} N_{l .}+J-1 \\ J-1 \end{array}\right) + \sum _{l \in \mathbb {L}_T}W_l\sum _{t}\log \left( \begin{array}{c} N_{l..t}+J-1 \\ J-1 \end{array}\right) }_{\text {Prior of leaf nodes}}\\&+\underbrace{\sum _{l \in \mathbb {L}_T}(1-W_l)\log \frac{N_{l .} !}{N_{l.1.} ! N_{l.2.} ! \ldots N_{l.J.} !}+ \sum _{l \in \mathbb {L}_T}W_l\sum _{t}\log {\frac{N_{l..t}!}{N_{l.1t}! .. N_{l.Jt}!}}}_{\text {Tree Likelihood}} \end{aligned} \end{aligned}$$
(2)

The next section demonstrates Eq. 2.

3.3 \(C({T}) \): Proof of Equation 2

We express the prior and the likelihood of a tree model, resp. P(T) and P(Data|T) according to the hierarchy of the uplift tree parameters. Assuming the independence between all the nodes, the prior probability of an uplift decision tree is thus defined as:

$$\begin{aligned}&P(T) = P\left( \mathbb {K}_T\right) \times \nonumber \\&\prod _{s \in \mathbb {S}_T} P\left( I_s\right) P\left( X_s \mid \mathbb {K}_T\right) P\left( N_{s i .} \mid \mathbb {K}_T, X_s, N_{s .}, I_s\right) \times \nonumber \\&P\left( \{W_l\}\right) \times \prod _{l \in \mathbb {L}_T} P\left( I_l\right) \left[ (1-W_l) \times p\left( \{N_{l . j}\} \mid \mathbb {K}_T, N_{l .}\right) +W_l\times \prod _{t}P\left( \{N_{l . jt}\} \mid \mathbb {K}_T, N_{l ..t}\right) \right] \end{aligned}$$
(3)

The first line is the prior probability of the variable selection, the second line the prior of internal nodes and the third line the prior of the leaf nodes.

Variable Selection Probability. A hierarichal prior is chosen: first the choice of the number of selected variables \(K_T\), then the choice of the subset \(\mathbb {K}_T\) among \(\mathbb {K}\) variables. By using a uniform prior the number \(K_T\) can have any value between 0 and K in an equiprobable manner. For the choice of the subset \(\mathbb {K}_T\), we assume that every subset has the same probability. Then the prior of the variable selection can be defined as:

$$P(\mathbb {K}_T)=\frac{1}{K+1}\frac{1}{\left( \begin{array}{c} K+K_T-1 \\ K_T \end{array}\right) }$$

Prior of Internal Nodes. Each node can either be an internal node or a leaf node with equal probability. This implies that: \(P(I_s)=\frac{1}{2}\)

The choice of the segmentation variable is equiprobable between 1 and \(K_T\). We obtain:

$$\begin{aligned} P(X_s|\mathbb {K}_T)=\frac{1}{K_T} \end{aligned}$$

All splits of an internal node s to two intervals are equiprobable. We then obtain:

$$\begin{aligned} P\left( N_{s i .} \mid \mathbb {K}_T, X_s, N_{s .}, I_s\right) =\frac{1}{N_{s}+1} \end{aligned}$$

Prior of Leaf Nodes. Similar to the prior of internal nodes, each node can either be internal or a leaf node with equal probability leading to \(P(I_l) = \frac{1}{2}\). For each leaf node, we assume that a treatment can have an effect or not, with equal probability, we get:

$$\begin{aligned} P(\{W_l\})=\prod _{l}\frac{1}{2} \end{aligned}$$

In the case of a leaf node l where there is not effect of the treatment (\(W_l\) = 0), UB-DT describes one unique distribution of the class variable. Assuming that each of the class distributions is equiprobable, we end up also with a combinatorial problem:

$$P\left( \{N_{l . j}\} \mid \mathbb {K}_T, N_{l .}\right) =\frac{1}{\left( \begin{array}{c} N_{l .}+J-1 \\ J-1 \end{array}\right) }$$

In a leaf node with an effect of the treatment (\(W_{i} = 1\)), UB-DT describes two distributions of the outcome variable, with and without the treatment. Given a leaf l and a treatment t, we know the number of instances \(N_{l..t}\) Assuming that each of the distributions of class values is equiprobable, we get:

$$P\left( \{N_{l . jt}\} \mid \mathbb {K}_T, N_{l ..t}\right) =\frac{1}{\left( \begin{array}{c} N_{l..t}+J-1 \\ J-1 \end{array}\right) }$$

Tree Likelihood. After defining the tree’s prior probability, we establish the likelihood probability of the data given the tree model. The class distributions depend only of the leaf nodes. For each multinomial distribution of the outcome variable (a single or two distinct distributions per leaf depending on whether the treatement has an effect or not), we assume that all possible observed data \(D_l\) consistent with the multinomial model are equiprobable. Using multinomial terms, we end up with:

$$\begin{aligned} \begin{aligned}&P(Data \mid T)=\prod _{l \in L}P(D_{l}|M) \\&\prod _{l \in L}\left[ (1-W_l) \times \frac{1}{N_{l .} !/N_{l .1.} ! N_{l .2.} ! \ldots N_{l . J.} !}+W_l\times \prod _{t}\frac{1}{(N_{l..t}!/N_{i.1t}! .. N_{i.Jt}!)}\right] \end{aligned} \end{aligned}$$
(4)

By combining the prior and the likelihood (resp. Eq. 3 and 4) and by taking their negative log, we obtain C(T) and thus Eq. 2 is proved.

3.4 Search Algorithm

The induction of an optimal uplift decision tree from a data set is NP-hard [10]. Thus, learning the optimal decision tree requires exhaustive search and is limited to very small data sets. As a result, heuristic methods are required to build uplift decision trees. Algorithm 1 (see below) selects the best tree according to the global criterion. Algorithm 1 chooses a split among all possible splits in all terminal nodes only if it minimizes the global criterion of the tree. The algorithm continues as long as the global criterion is improved. Since a decision tree is a partitioning of the feature space, a prediction for a future instance is then the average uplift in its corresponding leaf. This algorithm is deterministic and thus it always leads to the same local optimum. Experiments show the quality of the building trees.

figure a

3.5 UB-RF

UB-DT is easily extended to random forests. For that purpose, a split is randomly chosen among all possible splits that improve the global criterion. The number of trees is set by the analyst and the prediction of a forest is the average predictions of all the trees.

4 Experiments

We experimentally evaluate the quality of UB-DT as an uplift estimator and compare UB-DT and UB-RF versus state-of-art uplift modeling approachesFootnote 1.

We use the following state-of-art methods: (1) metalearners: two-model approach (2M), X-Learner and R-Learner, each with Xgboost; (2) uplift trees: CTS-DT,KL-DT, Chi-DT, ED-DT; (3) uplift random forests: CTS-RF,KL-RF, Chi-RF, ED-RF [15]; (4) and causal forests (all forest methods were used with 10 trees).

4.1 Is UB-DT a Good Uplift Estimator?

To be able to measure the estimated uplift we need to know the real uplift and therefore we use synthetic data. Figure 2 depicts two synthetic uplift patterns where \(P(Y = 1|X, T = 1)\) and \(P(Y = 1|X, T = 0)\) are identified for each instance. The grid pattern can be considered as a tree-friendly pattern whereas the continuous pattern is much more difficult. We generated several datasets according to these patterns with several different numbers of instances (also called data size) ranging from 100 to 100,000 instances. Uplift models were built using 10-fold stratified cross validation and the RMSE (Root Mean Squared Error) was used to evaluate the performance of the models.

Fig. 2.
figure 2

Uplift for 2 synthetic patterns. Figure 2a (grid pattern): uplift values for each cell. Figure 2b (continuous pattern): uplift values are \(P(Y|T=0,x1,x2)=1-(x1+x2)/20\) while \(P(Y|T=1,x1,x2)=(x1+x2)/20\).

Results: Figure 3 gives the RMSE for the two synthetic patterns according to the data size for different uplift methods. We see that UB-DT is a good estimator for uplift. With UB-DT, RMSE decreases and converges to zero when data sizes increase both for the grid and continuous patterns. This is the expected behavior of a good uplift estimator. This also means that UB-DT, thanks to its global criterion, avoids overfitting of uplift trees. The two-model approach with decision trees also shows competitive performance. UB-DT clearly outperforms the other tree-based methods, these latter having similar performances. With the continuous pattern, KL-DT, Chi-DT, ED-DT and CTS-DT approaches have lower performances (their RMSE are around 0.5). To avoid a cluttered visualisation, their performances are not included in Fig. 3b.

Fig. 3.
figure 3

The RMSE of tree-based approaches according to data size

4.2 UB-DT and UB-RF Versus State of the Art Methods

Datasets. We conducted experiments on 8 real and synthetic datasets widely used in the uplift modeling community: (1) HillstromFootnote 2 (a classical dataset for uplift modeling with data of customers who either received emails featuring men’s/ women’s products, or received no emails); (2) Criteo [5] (a marketing dataset for uplift modeling) (3) Bank [9] (a marketing campaign conducted by a bank) (4) InformationFootnote 3 (a marketing dataset in the insurance domain, a part of the Information R package); (5) MegafonFootnote 4 (a synthetic dataset generated by a telecom company); (6) StarbucksFootnote 5 (an advertising promotion tested to improve customers purchases); (7) Gerber [6] (a policy-relevant dataset used to study the effect of social pressure on voter turnout); (8) Right Heart Catheterization (RHC) [3] (a real dataset from the medical domain, the treatment indicates whether a patient received a RHC and the outcome is whether the patient died at any time up to 180 d after admission to the study).

Each dataset was used with different settings of treatment and outcome variables. For all datasets, each treatment and outcome variables are binary. Table 1 provides the most relevant specifications about the data sets.

Table 1. Summary of datasets specifications

Results. We evaluate the uplift models by using the qini metric [4]. Qini is a variant of the Gini coefficient. Its values are in \([-1,1]\), the higher the value, the larger the impact of the predicted optimal treatment. Figure 4a (resp. Figure 4b) shows the overall average ranking of tree based methods (resp. meta-learners and forest-based methods) according to its qini performance against each dataset. Compared to other tree-based methods and to the two-model approach with decision trees, Fig. 4a shows that UB-DT achieves the best performance. Table 2 reports the results of the experiment for the qini metric. This table shows that UB-DT is also a good estimator of the uplift on real data. Figure 4b shows that both UB-RF and 2M have the best rank. Table 3 indicates that the random forest strategy improves the performance of the uplift models (qini values are higher with UB-RF than UB-DT). UB-RF has the best performance on 4 out the 14 experiments.

Table 2. Average qini values and standard deviation (multiplied by 100). The best qini value for each dataset is marked in bold.
Table 3. Average qini values and standard deviation (multiplied by 100) across datasets and uplift approaches. In bold, the best value for each dataset
Fig. 4.
figure 4

Overall average ranking of the uplift approaches

5 Conclusion and Perspectives

In this paper, we presented a new parameter-free method called UB-DT for uplift decision trees. We have designed a Bayesian approach to select the most probable uplift tree model T that maximizes the posterior probability P(T|Data). Contrary to state-of-art uplift decision tree approaches, UB-DT is characterized by a global criterion to build a tree, so the splits in one node depend on the splits in the other nodes. This approach avoids overfitting and the need for a pruning step. A search algorithm finds the tree that optimizes this criterion. We showed that our approach is easily extended to random forests and we defined UB-RF. Evaluations on real and synthetic data sets show that UB-DT is a good uplift estimator and our tree and forests methods perform competitively with state-of-art uplift modeling approaches including non tree methods.

This work opens several perspectives. Studies on general trees (with more than two child nodes) is promising. In addition, studies with multiple treatments are still open work in uplift modeling. Moreover, the search algorithm leads to a local optimum and may create under-fitted uplift trees. To go above this horizon effect, it would be interesting to use a post-pruning algorithm [16].