Keywords

1 Introduction

Direct marketing cost is one of the biggest contributors of overall marketing cost. Most companies use data mining tools to develop predictive models to identify customers who are likely to respond to direct marketing campaigns. But improper use of these models may lead to unexpectedly low response rates. This is mainly because many of the customers get annoyed due to repetitive phone calls, emails, etc. Companies thus want to target their customers precisely to retain the brand value and improve customer satisfaction.

Traditionally, response models try to predict the likelihood of response for a particular customer, given a marketing action. These models do not account for likelihood of response irrespective of marketing action. Incremental response modeling as shown in Fig. 1 (also known as uplift modeling or differential response modeling or true uplift modeling) removes this drawback and works towards choosing the customers effectively. Incremental response models are used to identify the customers where there is a true uplift [5]. We want to identify customers who respond to the direct marketing campaign positively and remove the customers from the targeting list who will take an action irrespective of marketing or not.

Fig. 1.
figure 1

True response uplift.

In true uplift modeling, historical treatment and control campaign datasets are used to build a predictive model. The treatment dataset consists of customers to whom the marketing communication was sent and the control set are the ones who are not reached out but kept aside for measurement purpose. Still, we get some response from control set which is called as natural response. The response from our treatment set consists of natural response and uplift in response due to campaign as shown in Fig. 2.

Fig. 2.
figure 2

Potential incremental response [10].

This uplift due to campaign is of importance to marketing division. For a particular customer, true uplift can be predicted by modeling the difference between probability of response, given treatment and probability of response, given no treatment (control). The customers with positive uplift can then be reached out. Standard classification models cannot account for the control set and thus cannot be used directly. In this work, we tried to model uplift using decision trees [1] based on various modified splitting criteria. We propose a model which constructs an uplift decision tree and then applies logistic regression based uplift models on the observations at each terminal node to get the prediction. Experimental results showed that the proposed method outperforms various other uplift modeling techniques available.

2 Literature Review

Despite the importance of the topic, surprisingly there is not much work done on uplift modeling. Basically, there are two ways of building uplift models. The first method is called two-model approach. It is based on creating different classification models on treatment set and control set separately, to predict probability of response for each set. Each of the treatment and control dataset is divided into training and test data. Models are constructed separately using standard machine learning classifiers on each training data. The difference of the two predicted probabilities from treatment and control model gives us the predicted uplift. Models are then used to predict probability of response given treatment and probability of response given no treatment (control) on the test data (consisting of treatment as well as control data) for evaluating the model. If uplift follows different pattern than the probabilities of response in treatment and control then, this approach may not perform well.

The second modeling technique predicts the uplift directly from treatment and control dataset using a single model. The complete dataset is divided into training and test data, each containing both treatment and control. A single model is developed using the training data and further evaluated on test data. Some of the papers that addressed decision tree based models for uplift prediction are [2, 3, 79, 11]. [7] presented the first paper which explicitly discussed the topic of uplift modeling. [7] provided various illustrations relating to uplift based on real world data and came up with modified decision tree based algorithm which was further detailed in [8]. [3] used splitting criterion based on incremental response rate (∆∆P) for construction of decision trees. This algorithm did not account for population size and thus leading to unwanted weightage on small populations with more uplift. In [9], trees are constructed on information theory based splitting criterion which is more in line with modern machine learning algorithms. The divergence measure (Euclidean or Kullback-Leiber) between the distribution of treatment and control response rate is used as a splitting criteria. Also, they have presented generalized versions for calculation of gain, normalization factor and used variance based pruning technique to construct tree. [11] further accounted for multiple treatment case. They also confirmed through experiments that decision trees based uplift model show significant improvement over previous uplift modeling techniques.

The regression based technique for uplift modeling were also been addressed in literature [46]. [5] developed a model by first adding interaction terms between independent variable and treatment-control flag variable. He then created model on the dataset and used it with great success for years in marketing at Fidelity Investments. In [12], two model based approach is followed using logistic regression. [4] proposed a model based on net weight of evidence. He also proposed the modified version of Information value for variable selection for uplift modeling. Most papers in literature used artificially created data to carry out their experiments. In this paper, we have proposed a two-step approach for incremental response modeling and used real time direct marketing campaign datasets to carry out our experiments.

The paper is further arranged into following sections. Section 3 describes the notation and definition of various divergence measures used. Section 4 presents the approach of work, along with the new proposed method. Section 5 shows the results of various experiments done. Section 6 concludes the paper with the result of proposed method. Section 7 provides the scope for future research.

3 Notations and Definitions

Let us introduce and formalize the notation used in the following sections. \( {\text{Y}} \in \left\{ { 1,0} \right\} \) denotes the response variable for a marketing campaign, which is binary (1 for response = yes and 0 for no), flag \( {\text{F}} \in \left\{ { 1,0} \right\} \) denotes the binary variable which is 1 for treatment input and 0 for control input and \( {\text{X}}_{ 1} ,\ldots , {\text{X}}_{\text{m}} \in {\mathbb{R}} \) denotes the set of independent variables. \( \text{P}^{\text{T}} \left( . \right) \) and \( \text{P}^{\text{C}} \left( . \right) \) denote the probability with respect to treatment and control dataset respectively. We want to predict the uplift as difference in predicted probabilities between treatment and control group i.e. \( {\text{P}}({\text{Y}} = 1\left| {{\text{F}} = 1,{\text{X}})} \right. - {\text{P}}\left( {{\text{Y}} = 1 | {\text{F}} = 0,{\text{X}}} \right) \), which is denoted as \( \text{P}^{\text{T}} \left( {\text{Y = 1|X}} \right) - \text{P}^{\text{C}} \left( {\text{Y = 1|X}} \right) \). We used Laplace correction while estimating the probabilities. For any test A of a numeric variable X at split value v at a node, let \( {\text{a}}_{\text{l}} \) and \( {\text{a}}_{\text{r}} \) denote the two outcomes (X ≤ v) and (X > v) of the test and N(left) and N(right) denote the corresponding number of outcomes for each. As we have only considered binary splits, for any categorical variable also, we will have correspondingly two outcomes \( \text{a}_{\text{L}} \) and \( \text{a}_{\text{R}} \) for any test of split. Also, N = N(left) + N(right) denote the number of observations at the node where split is to occur.

4 Proposed Method

In our work, we tried various new splitting criteria for uplift decision tree along with those presented in [11], on real world direct marketing campaign data. The problem with decision tree based approach is that the predicted posterior uplift values will not be a true representation of the actual uplift, because the terminal nodes will have one value as predicted posterior uplift.

We propose a two-step model for incremental response modeling as shown in Fig. 3. In the first step, we build an uplift decision tree using traditional and modified divergence measure as splitting criteria and then in the second step we build uplift models at each of the terminal node using traditional uplift regression methods. This is equivalent to segmenting the complete data (treatment & control) using uplift decision trees and building uplift models in each of the segments. This is same as using clustered weighted modeling approach for uplift modeling. In this paper, we used two-model approach in the second step. We use a logistic model for treatment and control dataset at each terminal node separately and predicted uplift probability is calculated as the difference in individual predicted probabilities.

Fig. 3.
figure 3

Proposed hybrid model or two step model.

In the first step, we generated the uplift tree up to some significant depth, so that there are enough observations at the terminal nodes to build two-model logistic regression. For applying logistic regression, variable selection was first done using net information value [10] of variables and then, removing some of those variables while checking for multi-collinearity. As the logistic regression model is created on treatment and control samples separately at each terminal node, we constrained the number of observations that can be present in treatment and control set at a node after each split in the uplift decision tree. The difference between the predicted probabilities of two models on treatment and control dataset at each terminal node is the predicted posterior uplift probability. Applying logistic regression on terminal nodes is expected to give better results as it assigns posterior probabilities to each input observation that comes at terminal node, rather than giving the same value to entire node. As expected, this proposed model outperformed all the decision tree based models.

We implement the decision tree as described in [9] using various other splitting criteria along with Euclidean and Kullback-Leibler divergence measures.

For any split test A at a node, gain using divergence measure D is defined as follows:

$$ {\text{D}}_{\text{gain}} ( {\text{A) = D}}_{\text{gain}} ( {\text{P}}^{\text{T}} \left( {\text{Y}} \right) : {\text{P}}^{\text{C}} \left( {\text{Y}} \right) / {\text{A) - D}}_{\text{gain}} ( {\text{P}}^{\text{T}} \left( {\text{Y}} \right) : {\text{P}}^{\text{C}} \left( {\text{Y}} \right)) $$
(1)

Where

$$ {\text{D}}_{\text{gain}} ( {\text{P}}^{\text{T}} \left( {\text{Y}} \right) : {\text{P}}^{\text{C}} \left( {\text{Y}} \right) / {\text{A) = }}\frac{\text{N(Left)}}{\text{N}}{\text{D[P}}^{\text{T}} \left( {{\text{Y/a}}_{\text{L}} } \right) : {\text{P}}^{\text{C}} \left( {{\text{Y/a}}_{\text{L}} } \right) ] { + }\frac{\text{N(Right)}}{\text{N}}{\text{D[P}}^{\text{T}} \left( {{\text{Y/a}}_{\text{R}} } \right) : {\text{P}}^{\text{C}} \left( {{\text{Y/a}}_{\text{R}} } \right) ] $$
(2)

Let \( {\text{P}}^{\text{T}} \left( {\text{Y}} \right) {\text{ = [a}}_{ 1} {\text{ = P}}^{\text{T}} \left( {\text{Y = 1}} \right) , {\text{a}}_{ 0} {\text{ = P}}^{\text{T}} \left( {\text{Y = 0}} \right) ] \) & \( {\text{P}}^{\text{C}} \left( {\text{Y}} \right) {\text{ = [a}}_{ 1} {\text{ = P}}^{\text{C}} \left( {\text{Y = 1}} \right) , {\text{a}}_{ 0} {\text{ = P}}^{\text{C}} \left( {\text{Y = 0}} \right) ] \). The divergence measures defined for probability distributions \( {\text{P}}^{\text{T}} \left( {\text{Y}} \right) \) and \( {\text{P}}^{\text{T}} \left( {\text{Y}} \right) \) are defined in (3), (4), (5), (6), (7) and (8).

$$ {\text{Euclidean divergence measure ED = }}\sum\nolimits_{\text{i}} {\left( {{\text{a}}_{\text{i}} {\text{ - b}}_{\text{i}} } \right)^{ 2} } $$
(3)
$$ {\text{Hellinger divergence measure HD = SQRT[}}\sum\nolimits_{\text{i}} {\left( {{\text{a}}_{\text{i}} {\text{ - b}}_{\text{i}} } \right)^{ 2} } ] $$
(4)
$$ {\text{Kullback - Liebler divergence measure KL = }}\sum\nolimits_{\text{i}} {\left( {{\text{a}}_{\text{i}} {\text{Log}}\left( {\frac{{{\text{a}}_{\text{i}} }}{{{\text{b}}_{\text{i}} }}} \right)} \right)} $$
(5)
$$ {\text{J - divergence measure JD = }}\sum\nolimits_{\text{i}} {\left( {\left( {{\text{a}}_{\text{i}} {\text{ - b}}_{\text{i}} } \right){\text{Log}}\left( {\frac{{{\text{a}}_{\text{i}} }}{{{\text{b}}_{\text{i}} }}} \right)} \right)} $$
(6)
$$ {\text{Modified Euclidean divergence measure MED = }}\left( {{\text{a}}_{ 1} {\text{ - b}}_{ 1} } \right)^{ 2} $$
(7)
$$ {\text{Modified J - divergence measure MJD = }}\left( {{\text{a}}_{ 1} {\text{ - b}}_{ 1} } \right){\text{Log}}\left( {\frac{{{\text{a}}_{ 1} }}{{{\text{b}}_{ 1} }}} \right) $$
(8)

For modified Euclidean Divergence and modified J divergence, we have incorporated only the P(Y = 1) while calculating the divergence. The normalization factors considered to penalize the uneven splits are shown in (9) and (10).

$$ {\text{I(A) = H}}\left( {\frac{{{\text{N}}^{\text{T}} }}{\text{N}} ,\frac{{{\text{N}}^{\text{C}} }}{\text{N}}} \right) * {\text{KL}}\left( {{\text{P}}^{\text{T}} \left( {\text{A}} \right) : {\text{P}}^{\text{C}} \left( {\text{A}} \right)} \right){ + }\frac{{{\text{N}}^{\text{T}} }}{\text{N}}{\text{H}}\left( {{\text{P}}^{\text{T}} \left( {\text{A}} \right)} \right){ + }\frac{{{\text{N}}^{\text{C}} }}{\text{N}}{\text{H}}\left( {{\text{P}}^{\text{C}} \left( {\text{A}} \right)} \right){ + 0} . 5 $$
(9)
$$ {\text{J(A) = G}}\left( {\frac{{{\text{N}}^{\text{T}} }}{\text{N}} ,\frac{{{\text{N}}^{\text{C}} }}{\text{N}}} \right) * {\text{D}}\left( {{\text{P}}^{\text{T}} \left( {\text{A}} \right) : {\text{P}}^{\text{C}} \left( {\text{A}} \right)} \right){ + }\frac{{{\text{N}}^{\text{T}} }}{\text{N}}{\text{G}}\left( {{\text{P}}^{\text{T}} \left( {\text{A}} \right)} \right){ + }\frac{{{\text{N}}^{\text{C}} }}{\text{N}}{\text{G}}\left( {{\text{P}}^{\text{C}} \left( {\text{A}} \right)} \right){ + 0} . 5 $$
(10)

Where H(A) denotes the entropy function that is \( {\text{H}}\left( {\text{A}} \right) = - \sum\nolimits_{\text{i}} {{\text{a}}_{\text{i}} * {\text{log(a}}_{\text{i}} )} \) and \( {\text{G}}\left( {\text{A}} \right) = \sum\nolimits_{\text{i}} {{\text{a}}_{\text{i}} ( 1 {\text{ - a}}_{\text{i}} )} \) for probability distribution \( {\text{A = (a}}_{ 1} ,\ldots , {\text{a}}_{\text{n}} ) \). Our final splitting criterion is given in (11). Label to the leaf node (predicted posterior uplift probability) is given as in (12).

$$ \frac{{{\text{ED}}_{\text{gain}} }}{\text{J(A)}} , \frac{{{\text{HD}}_{\text{gain}} }}{\text{J(A)}} , \frac{{{\text{KL}}_{\text{gain}} }}{\text{I(A)}} , \frac{{{\text{JD}}_{\text{gain}} }}{\text{I(A)}} , \frac{{{\text{MED}}_{\text{gain}} }}{\text{J(A)}} $$
(11)
$$ {\text{Label value to leaf node = P}}^{\text{T}} \left( {\text{Y = 1/L}} \right) {\text{ - P}}^{\text{C}} \left( {\text{Y = 1/L}} \right) $$
(12)

Where \( {\text{P}}^{\text{T}} \left( {\text{Y = 1/L}} \right) {\& } {\text{P}}^{\text{C}} \left( {\text{Y = 1/L}} \right) \) denote treatment and control target class distributions at leaf node L. Some modified versions based on changes in normalization factors and computation of gain for each split were also tried out, but they did not perform very well. We found that the euclidean divergence and modified euclidean divergence measure performed best on the considered dataset.

$$ {\text{a}} = {\text{P}}^{\text{T}} \left( {{\text{X}} \le {\upnu} /{\text{Y}} = 1} \right) * {\text{P}}^{\text{C}} \left( {{\text{X}} \le {\upnu} /{\text{Y}} = 0} \right) $$
(13)
$$ {\text{b = P}}^{\text{T}} \left( {{\text{X}}{ \le } {\upnu} / {\text{Y = 0}}} \right) * {\text{P}}^{\text{C}} \left( {{\text{X}}{ \le} \upnu / {\text{Y = 1}}} \right) $$
(14)
$$ {\text{c = P}}^{\text{T}} \left( {{\text{X > }} {\upnu} /{\text{Y = 1}}} \right) * {\text{P}}^{\text{C}} \left( {{\text{X > }} {\upnu} / {\text{Y = 0}}} \right) $$
(15)
$$ {\text{d = P}}^{\text{T}} \left( {{\text{X > }} {\upnu} / {\text{Y = 0}}} \right) * {\text{P}}^{\text{C}} \left( {{\text{X > }} {\upnu } / {\text{Y = 1}}} \right) $$
(16)

Using (13), (14), (15) and (16), Modified Net weight of evidence (NWOE) and Modified Net Information Value (MNIV) as splitting criterion is defined as follows:

$$ {\text{NWOE for left node NWOE}}_{\text{Left}} {\text{ = log}}\left( {\frac{\text{a}}{\text{b}}} \right) $$
(17)
$$ {\text{NWOE for right node NWOE}}_{\text{Right}} {\text{ = log}}\left( {\frac{\text{c}}{\text{d}}} \right) $$
(18)
$$ {\text{MNIV = }}\left( {\text{a - b}} \right) * {\text{NWOE}}_{\text{Left}} { + }\left( {\text{c - d}} \right) * {\text{NWOE}}_{\text{Right}} $$
(19)

We also implemented decision tree based on modified net information value [10] but it could not outperform the measures based on euclidean divergence. Along with two separate logistic regression model approach, we also tried a logistic regression model based on z transformation (single linear model) as described in [6], but it performed very poorly on the dataset considered.

5 Experimental Results

We have used two real time marketing campaign datasets to evaluate the performance of the uplift decision trees as proposed in [11] and the proposed method. Dataset 1 has overall uplift response rate of 4.2 % (population treatment response rate-control response rate) and Dataset 2 has 4.1 % overall uplift response rate. Since we have treatment and control groups, evaluation of the model performance is different from standard classification models procedure. For evaluating the uplift model, we followed the same approach as of [5] and this method is adopted by many industrial practitioners.

We first create the model predicted posterior uplift probability for each observation in validation sample. We use the decile report based on the predicted posterior uplift probability to evaluate the uplift model. The deciles are created by rank ordering all the observations of the validation sample by predicted posterior uplift probability. In each decile, actual uplift is calculated as the difference in observed responses rates of treatment and control group. We checked the correlation of actual uplift (treatment response-control response) with predicted posterior uplift and also rank ordering of actual uplift. Since the application is from direct marketing, we have used the lift in uplift response (top decile uplift response rate/overall data uplift response rate) in the top decile while comparing different methods.

Table 1 shows the comparison of different uplift models. Uplift decision trees based on modified Euclidean Divergence is performing better than Uplift decision trees based on Euclidean Divergence. But the proposed two-step method is working better than the one step uplift decision trees.

Table 1. Comparison of different uplift models.

The validation dataset decile reports using the proposed two step method: Decision Tree using Euclidean Divergence + Logistic Regression is shown in Figs. 4 and 5. Clearly we can see that the proposed method outperforms the other uplift decision trees.

Fig. 4.
figure 4

Response rate by decile using two step method (decision tree using euclidean divergence + logistic regression) on validation dataset1.

Fig. 5.
figure 5

Response rate by decile using two step method (decision tree using euclidean divergence + logistic regression) on validation dataset2.

6 Conclusion

Incremental response modeling or true lift modeling has got little attention in literature and is getting importance among marketing practitioners in recent years. We have implemented decision trees based classifiers for incremental response modeling on real world marketing campaign data. Out of various divergence measures used for splitting criterion, we found that those based on Euclidean measure give the best results. We proposed a model based on constructing a decision tree and then applying two-model logistic regression approach at each terminal node. Based on the experiments on real world marketing campaign data, we found that the two step model outperformed all decision tree based classifiers. Also tree based classifiers have been shown to outperform various other uplift models in literature. Hence, the proposed model holds significant importance in its area.

7 Future Work

The dataset consisted in each of the terminal nodes can be viewed as supervised clusters created from data. As we have tried logistic regression (two model based approach) on these clusters (dataset on terminal nodes), the future work can be based on trying different machine learning classification models for uplift modeling on these nodes.