1 Introduction

Over the past few decades, telecommunication, and especially mobile telecommunication, has become a dominant communication medium. Many countries report over 100 % saturation of the telecom market, indicating the importance customers assign to this form of communication. Public regulators and standardization bodies have made the telecom market highly competitive, by enabling customers to easily move from one telecommunication provider to another. Such transitions are known as churn. Throughout this paper, we refer to the transition from one carrier to another as churn.

Churn is one of the most costly items affecting a telecom carrier’s bottom line as it decreases revenue. Additionally, a carrier’s investment in winning a new customer is far greater than the cost of preserving an existing one (Hadden et al. 2007). In many cases, churn can be lessened by offering customers incentives to remain customers of the existing carrier. Therefore, salient reasons exist for predicting which customers are likely to churn in the near future as a means to try and prevent their churn.

In addition to identify likely churners, a good churn prediction system should provide a sufficiently long horizon forecast in its predictions. First, a long forecast horizon is required to enable the customer care department to approach a customer and make a retention offer. Second, a long forecast horizon is advantageous in that the further away a customer is from actually making the churn decision, the easier it is to prevent that decision at a significantly lower cost.

Most churn prediction systems (see for example, Coussement and Van den Poel 2008; Datta et al. 2000; Gopal and Meher 2008; Pendharkar 2009; Song et al. 2006; Neslin et al. 2006; Idris et al. 2012; Huang et al. 2012; Lu et al. 2012; Jadhav and Pawar 2011) consider each customer individually. The goal of these systems is to predict which customers are likely to churn in the immediate future, usually defined as between 1 and 3 months. Such systems rely on hundreds of complex key performance indicators (KPIs) that are generated for each customer from their call attributes, financial attributes, and service information-all evaluated over long periods (many months and years). These KPIs then serve as input to a statistical regression model, usually a logistic regression variant, that outputs a churn score. This approach has one major drawback in that it relies on the assumption that a churning customer changes some attributes (calling patterns or otherwise) prior to switching carriers. While this may be true in some cases, certainly many scenarios occur in which these assumptions are violated. For example, customers who have come to believe that they have found a better deal with a competitor may churn immediately.

The dominant approach to churn prediction focuses exclusively on the individual customer without taking social influence into account. Clearly, churn entails many social aspects, as witnessed in other consumer areas (Doyle and Barn 2007; Yang et al. 2007). A dominant example is when a churning customer influences other customers to churn as well. Thus, developing churn prediction systems that take social aspects into account poses an emerging theoretical challenge with potentially great practical implications.

Several works (e.g., Kawale et al. 2009; Birke 2006; Dasgupta et al. 2008; Peres et al. 2010; Dierkes et al. 2011) have started exploring the possibility of using social ties to predict churn. The dominant approach in this direction is known as diffusion. The underlying assumptions of diffusion are that recent churners are known and that they are likely to affect the churning decisions of their social neighborhoods. To predict churn, the network of subscribers is modeled as a weighted directed graph on which nodes represent the customers, and weights on the edges correspond to the strength of the social connections between them. Next, a diffusion process is used to model the flow of information from recent churners to their social environment. Specifically, each node in the network that corresponds to a recent churner is assigned an initial numerical value, termed energy. A decaying diffusion process propagates this energy across the global network until convergence. At that point, each subscriber in the network has some associated energy corresponding to the amount of churning information or influence that has been assigned to him. The individual churn scores are then derived directly from these energies (Dasgupta et al. 2008).

In addition to churn prediction, diffusion processes play central roles in the study of various fields in marketing and social network analysis. These include the study of the spread of innovation, viral marketing, and more (Ma et al. 2008; Wang et al. 2010; Karnstedt et al. 2010; Iyengar et al. 2011; Gruhl et al. 2004). A recent survey of this literature can be found at Peres et al. (2010) and a pioneering study of graph theoretic problems that stem from some of these applications can be found in Kempe et al. (2003). We believe that understanding the properties of diffusion-based models can enhance the research on these topics as well.

As noted above, the basic assumption underlying diffusion is that friends who churn affect ones’ churn propensity. Recently, Nitzan and Libai (2010) showed that exposure to a churning customer increases the likelihood of churn by 80 % (after controlling for homophily, i.e., user similarity). Interestingly, this study also demonstrated that two-thirds of customers who churn do not have an immediate churning acquaintance, defined as a person called by the churner. This means that diffusion can explain the churn of at least 33 % of the churning population, which is a significant social effect.

Two technical factors influence the performance of diffusion: the way in which the strength of a connection between customers is estimated, and the specific algorithm used to diffuse energy across the graph.

Most papers to date, which we are aware of, modeled the connection strength between customers using the number of minutes of calls between subscribers, normalized by the total number of calls made by a customer (Dasgupta et al. 2008; Kawale et al. 2009; Peres et al. 2010; Birke 2006). This has an obvious advantage in that it is easy to measure and is easy to justify as a measure of the strength of relationships. Adamic and Adar (2001) suggested that a more informative measure is to count the number of shared friends every pair of customers have between them. Richter et al. (2010) modified this measure by normalizing it to the total number of subscribers called, through the use of point-wise mutual information. However, to the best of our knowledge, no study has shown a rigorous comparison of the effect of these measures on the accuracy of the diffusion process.

Our goal in this paper is to investigate three important aspects of diffusion-based algorithms for churn prediction. First, we study the effective predictive horizon of diffusion algorithms, which is of major importance to the applicability of these algorithms to realistic business problems. We demonstrate that diffusion algorithms can provide impressive churn prediction capabilities, though only for a limited horizon. For longer prediction horizons, we show that a significant deterioration in performance occurs. Second, we compare different methods of estimating the connection strength between users, and show that some methods have significantly better performance than others. Finally, we show that combining results from diffusion processes based on different types of connection strength measures improves the accuracy of prediction. Specifically, in one of the carriers, this method yielded an average of 50 % improvement compared to the standard baseline measure. These findings can guide telecommunication carriers who wish to implement diffusion algorithms, and have important insights into human behavior.

2 Diffusion processes for churn prediction

2.1 Description of the diffusion process

The diffusion process we studied is based on Dasgupta et al. (2008). We describe it in detail in this section (see also Algorithm 1). The diffusion process starts with the construction of a call graph and a seed. The call graph is a directed graph in which each node corresponds to a subscriber in the network and the weight on each directed edge reflects the strength of connection between the caller (head of edge) and the callee (tail of edge). The weight associated with each edge is based on call data, such as the total number of calls or the total duration of calls over a period of time. The seed is a list of subscribers who are known to have churned during a predefined period of time, typically a subset of the time period that was used to construct the graph, e.g., the last 2 weeks, etc. Each such churner is assigned with an initial positive energy and all other subscribers are assigned with zero energy. Finally, a diffusion-like process is initiated in the graph, where at each iteration, nodes transfer a fraction of their energy to their outgoing neighbors in the graph. The exact value depends linearly on the weight associated with the edge and on a spreading coefficient \(d \in (0,1)\) that determines the fraction of energy that can be given away. After the stopping condition is met, each subscriber is associated with a certain amount of energy, where higher values are considered higher probability candidates for churning. In the remainder of this paper, we use the implementation of the diffusion algorithm as described in Dasgupta et al. (2008) and according to additional details obtained from personal communication with A. Nanavati. Note that many variants of the above process are possible, e.g., a non-uniform initialization of the churn energies.

figure a

2.2 Additional aspects of the diffusion process

The diffusion process described in Sect. 2.1 gives rise to several natural questions that we attempt to investigate in this work. These include the effect of the prediction horizon, the effect of the used performance measure, and the usage of enhancement techniques. We describe them in detail in the next sections.

2.2.1 Predictive horizon

The predictive horizon is the period of time in the future for which an algorithm can provide predictions. Churn prediction is typically used as a means for customer preservation. Since resources are limited, only a small fraction of the population may be contacted by customer care service, with the main goal being to reach the customers that are most likely to churn, as early as possible. Therefore, the time horizon plays a role in the efficiency of the obtained results. An optimal predictor would provide long-term prediction so as to be most effectively used to prevent churn by customers at risk, a task which naturally requires time and effort on the part of the telecommunication company. Previous work (Nitzan and Libai 2010) demonstrated a degradation in prediction accuracy over time when using a “word-of-mouth” scenario, for a specific similarity measure and a simple diffusion. Our goal is to ascertain whether the sharp degradation is a general characterizer of diffusion processes for churn prediction and whether it is affected by the choice of relationship measure, addressed in the next section.

2.2.2 Relationship measure

Most previous works used a social affinity measure that is based on direct correspondence between two subscribers. Several recent works demonstrated the usefulness of alternative measures for various applications (for example, Adamic and Adar 2001; Sarkar et al. 2010; Richter et al. 2010). Therefore, our second goal is to determine the effect of the relationship measure (i.e., the corresponding edge weight) on the accuracy of the algorithm with respect to churn prediction. In this work, we compare four types of measures:

  • Calls measure The weight is proportional to the count of calls between the caller and the callee. This is the most common measure used in the context of diffusion, and it provides similar results to measures that are based on the number of minutes (normalized). Let \({\bf {v}}_i\) denote the calls vector of the \(i\)th caller, where \({\bf {v}}_i[j]\) is the number of times caller \(i\) called \(j\). In the calls measure, \(w_{ij} = {\bf {v}}_i[j]\) (where \(w_{ij}\) denotes the weight of the edge from \(i\) to \(j\)).

  • Weighted number of shared neighbors measure: The weight is proportional to the number of shared outgoing neighbors (Adamic and Adar 2001) in the call graph, weighted by the number of calls made to those neighbors. The weighting was added to emphasize the effect of frequently called subscribers, with whom a caller is assumed to have stronger connections. Let \({\bf {v}}_i\) denote the calls vector of the \(i\)’th caller, where \({\bf {v}}_i[k]\) is the number of times caller \(i\) called \(k\). In the shared measure, \(w_{ij} = \sum _{k}({\bf {v}}_i[k]\cdot {\bf {v}}_j[k])\). Note that contribution to the sum is not zero only for callers \(k\) to whom both \(i\) and \(j\) called.

  • Social measure: This measure, defined in Richter et al. (2010), measures the point-wise mutual information between caller and callee, according to the number of shared versus unshared outgoing neighbors, if and only if the caller made at least one call to the callee. We refer to this measure as social for two reasons. First, it takes into account the fact that two subscribers talked to each other. Second, when estimating connection strength among linked subscribers, the social environment of the caller and callee is weighted such that a much higher weight is given to cases where two subscribers share a large proportion of their friends. None of the other three measures shares both attributes. Let \({\bf {v}}_i\) denote the calls vector of the \(i\)’th caller, where \({\bf {v}}_i[k]\) is the number of times caller \(i\) called \(k\). In the social measure, \(w_{ij} = 1[{\bf {v}}_i[j] > 0]\cdot MI({\bf {v}}_i,{\bf {v}}_j)\) where \(1[\cdot ]\) equals 1 when the condition in the brackets holds, and \(MI({\bf {v}}_i,{\bf {v}}_j) = -P_{i,j}\log_2P_{i,j} -P_{i,\wedge j}\log_2P_{i,\wedge j} -P_{\wedge i,j}\log_2P_{\wedge i,j} - P_{\wedge i,\wedge j}\log_2P_{\wedge i,\wedge j}\). Here, \(P_{i,j} = sum_k(1[{\bf {v}}_i[k] > 0,{\bf {v}}_j[k] > 0])\), \(P_{i,\wedge j} = sum_k(1[{\bf {v}}_i[k] > 0,{\bf {v}}_j[k] = 0])\), \(P_{\wedge i,j} = sum_k(1[{\bf {v}}_i[k] = 0,{\bf {v}}_j[k] > 0])\) and \(P_{\wedge i,\wedge j} = sum_k(1[{\bf {v}}_i[k] = 0,{\bf {v}}_j[k] = 0])\).

  • Cosine of the angle between call vectors measure The weight is proportional to the cosine between the two vectors representing the calls that the caller and the callee performed to all other nodes in the calls graph. Cosine similarity is frequently used as a measure of similarity between sparse entities, for example, in information retrieval, and more recently in social networks Crandall et al. (2008); Kossinets and Watts (2009). Let \({\bf {v}}_i\) denote the calls vector of the \(i\)th caller, where \({\bf {v}}_i[k]\) is the number of times caller \(i\) called \(k\). In the cosine measure, \(w_{ij} =1[{\bf {v}}_i[j] > 0]\frac{{\bf {v}}_i\cdot {\bf {v}}_j}{\left\| {\bf {v}}_i\right\| \cdot \left\| {\bf {v}}_j\right\| }\) where \(1[\cdot ]\) equals 1 when the condition in the brackets holds and 0 otherwise , \({\bf {v}}_i\cdot {\bf {v}}_j\) is the dot product of \({\bf {v}}_i,{\bf {v}}_j\), and \(\left\| {\bf {v}}_k\right\| \) denotes the Euclidean norm of \({\bf {v}}_k\).

Naturally, each measure will result in a different network structure, such as in cases in which the measures are asymmetric. Section 2.3 describes these measures in more detail and addresses some of the differences in their resulting network structure.

The four similarity measures require different computational complexities. Whereas the calls measure can be built using a simple sparse matrix, computing the other three measures requires second-order computations, and are thus more expensive computationally.

2.2.3 Methods for performance enhancement

In many applications, using more features or sources of information can increase model accuracy. In the current study, a natural question is whether the accuracy of diffusion processes can be enhanced by combining scores from several sources. This question may also have implications on other applications of marketing and social networks. In this paper, we focus on the possibility of combining different diffusion processes and show that indeed such a combination can significantly increase the performance of churn prediction methods.

2.3 The interplay between the relationship measure and the coverage

It is important to note that the graphs obtained by each of the measures described above are different in structure (i.e., which nodes are connected through non-zero weighted edges) as well as in the edge weights.

The call graph represents all the users who made or received at least one call during the corresponding period. Hence, it has the largest coverage among the studied relationship measures. Other measures invoke smaller graphs, as we demonstrate using the following simple example: consider a network of 5 nodes, numbered \(1,2,3,4,5\). Let \({\bf {v}}_i[j]\) denote the number of times caller \(i\) called \(j\) and consider the following five vectors: \({\bf {v}}_1 = [0, 100, 0, 0, 1]\), \({\bf {v}}_2 = [0, 0, 100, 0, 1]\), \({\bf {v}}_3 = [0, 0, 0, 100, 0]\) and both \({\bf {v}}_4\) and \({\bf {v}}_5\) are zero vectors. In this example, callers 4 and 5 made no outgoing calls, caller 1 called caller 2 100 times and caller 5 once. Caller 2 called caller 3 100 times and caller 5 once, and caller 3 called caller 4 100 times. The call graph that corresponds to this data is shown at the top of Fig. 1. By definition, in the calls measure, the weights are \(w_{ij} = {\bf {v}}_i[j]\). In this figure, large weights are denoted by solid lines and small weights by dashed lines.

Let us now consider each of the measures that we study on this call graph.

When constructing the social graph, we took into account the social environment as well as the immediate connections. The mutual information between two nodes (A and B) takes into account the number of nodes that both A and B called, the number of nodes that A called and B did not, the number of nodes that B called and A did not, and the number of nodes that neither called. The resulting network appears in the middle of Fig. 1. Again, solid lines represent large weights and dashed lines represent small weights. The strength of the connection between nodes 1 and 2 is high because their social environment is non-empty (node 5 is an outgoing neighbor of both). The other weights are smaller, and although we keep a large fraction of the edges, low valued edges are pruned in our implementation. Therefore, depending on the actual settings of the algorithms, the resulting network may not retain connections to nodes 3, 4, 5, or any subset of those.

Consider the network that is constructed from this data using the weighted number of shared neighbors measure. Formally, this measure is defined by the dot product between call vectors above: \(w_{ij} = {\bf {v}}_i\cdot {\bf {v}}_j\), if \({\bf {v}}_i[j] > 0\) and zero otherwise. The resulting network is shown at the bottom of Fig. 1. As shown, it contains only one edge -the edge from node 1 to node 2, because these are the only two nodes that have shared outgoing neighbors and are connected.

Using a similar consideration, we can show that the cosine network has the same structure as the shared one, with different weights: \(w_{ij} = {\bf {v}}_i\cdot {\bf {v}}_j / (\left\| {\bf {v}}_i\right\| \left\| {\bf {v}}_j\right\| )\). Specifically, with the shared number of connections, the link between node 1 and node 2 would be rated as having a strength of 1, which is the minimal strength value between friends who share a connection. This is in compared to cosine, where the strength is 0.0001 (on a range of [0,1]). As will be demonstrated, even though the cosine and shared networks have the same structure, the results in terms of converged energy are different.

Table 1 summarizes the weights that corresponds to each of the measures.

This small example demonstrates that even before running the diffusion algorithm, the choice of relationship measure has a major effect on the network structure, and in particular on the node coverage. While the calls measure guarantees that all callers and callees are part of the networks, the social measure prunes a fraction of the smallest edges resulting in a smaller network. The shared and cosine measures networks might be significantly smaller in terms of nodes and edges. As will be demonstrated this indeed happens and should be taken into account when choosing a measure. Various modification of these two measures may be attempted (e.g., adding some constant to the actual number of mutual neighbors). We leave the study of such modifications and their effects on the diffusion algorithm to future work.

Fig. 1
figure 1

The networks resulting from various relationship measures. From top to bottom calls, social, shared neighbors, and cosine

Table 1 Values of weights obtained by the various measures for the networks of Fig. 1 \(W_{ij}\) corresponds to the weights of the edge from \(i\) to \(j\).

3 Experiments

This section describes the experiments that we conducted to check the hypotheses raised in the paper. We describe in detail the data sets that were used as well as the methods for results evaluation. We present the results of these experiments in Sect. 4.

3.1 Data sets

We used call data from two telecommunication providers located in different countries. Each data set contains the call data records (CDRs) of the provider, which is a list of the calls made by subscribers of the provider. Each record contains the number of the caller, number of the callee, a time stamp of the call and its duration and sometimes additional information. We only used the tuple caller, callee, time stamp. For each carrier, we had a list of churners and their corresponding churn dates. We divided the data into 2-week periods, with an overlap of 1 week between each pair of consecutive periods. The seed (list of churners − nodes with positive energy) is the set of churners who churned during the 2-week period according to the churn dates provided by the operator. When predicting, day 1 refers to the first day after the 2-week data. Unless otherwise noted, the reported results are averaged over these time periods. Note that each experiment utilizes 2 weeks of data to predict from day 1 onwards, and is performed independently of other experiments.

Following the discussion of Sect. 2.3, Table 2 summarizes the average number of nodes in each graph (averaged over time periods). Figure 2 shows a logarithmic histogram of the undirected node degrees. The social graph contains about 55 % of the nodes of one carrier and 79 % of the second, and the shared and cosine graphs contain about 20 % of the nodes of one carrier and 54 % of the second. As explained earlier, an option exists to disregard edges below a certain threshold for the social graph (as in Richter et al. 2010), which implies a tradeoff between the size and accuracy that can result from different choices of this threshold. A higher threshold (i.e., retaining only stronger edges) decreases the size of the graph and leaves only edges that represent stronger connections. We did not explore the effects of the threshold parameter. The data summarized in the table demonstrates that indeed different measures yield different networks in terms of nodes and edges.

Table 3 describes the fraction of the edges \(a \rightarrow b\) in which \(b\) churned in a small time period after \(a\). The fraction is taken over all the edges of the graph. We present the fraction for each metric and for two time periods—up to 1 week and between 8–14 days.

In all relationship measures, we see that more nodes with a small number of connections exist as compared to nodes with a large number of connections. The decrease in all relationship measures is rapid, and closely follow an exponential decay fit (\(R^2 \ge 0.9\) for both carriers on all).

Each carrier defines a subset of the subscribers that are considered relevant for churn prediction. In the induced network we refer to them as the set of targeted nodes. These can be different segments of the population (private, business, etc.) or based on customer value. Table 2 also summarizes the average number of nodes that are targeted by the carrier in each graph. Interestingly, the number of targeted nodes is less affected by the choice of measure. Finally, churn rates are 1 % per month for the first carrier, and 0.4 % for the second carrier, which are considered relatively low rates.

Table 2 Nodes statistics for the two datasets, for the four similarity measures
Fig. 2
figure 2

Histogram of the number of edges for each carrier and each relationship measure. Isolated nodes (churners that made no calls in the relevant time period, for example) are not taken into account. Note that the vertical axis is logarithmic

Table 3 Fraction of edges that connect churners who churned within certain time differences

3.2 Evaluation of the results

Lift (Lima et al. 2009) is one of the most commonly used performance measures for this type of application. For a given fraction \(P\), where \(0<P\le 1\), the lift is defined as the ratio between the number of churners among the fraction of \(P\) subscribers that are ranked highest by the proposed system, and the expected number of churners in a random sample from the general subscribers pool of equal size. For example, a lift of \(k\) at a fraction \(P=0.05\) means that among the 5 % of the subscribers that are the highest ranked by the algorithm, \(k\) times as many churners exist than in a random sample of the population.

The lift curve characterizes the performance of a given churn prediction system. This curve plots as a function of fraction of the population (\(0<P\le 1\)) to the lift value obtained for this fraction. In general, it is a monotonically decreasing function, because the larger the fraction, the more difficult it is to provide meaningful lift. By definition for \(P=1\), the lift is 1.

Typically, since carriers can only invest targeted efforts in a small fraction of the population, the lift that corresponds to the small fractions is more important to the carriers. In practice, common working points range from 0.1 to 10 %.

4 Experimental results

We implemented the diffusion algorithm of Dasgupta et al. (2008) using the IBM Parallel Machine Learning toolbox (Toledano et al. 2008). The output of each run of the diffusion algorithm is a list of nodes, along with their corresponding level of energy (which we refer to as scores). Our experiments compare the performance of the algorithm over the entire population. In cases in which nodes did not receive a score by a measure, we set this score to zero.

4.1 Effects of the predictive horizon

In this set of experiments, we studied the effect of the predictive horizon on the quality of the results. We calculated the lift for every period, for prediction horizons varying between 5 and 90 days (at intervals of 5 days). In all the experiments we conducted, on both carriers and on all four measures, we see a clear deterioration of the predictive accuracy as the prediction horizon is increased. The effect is more evident in smaller fractions of the population. After approximately 30 days, the decrease is significant (30–40 % is some cases) and beyond 60 days, the lift values are almost constant.

Figures 3 and 4 plot the lift values of the first carrier for working points of 1 and 5 % correspondingly. Figures 5 and 6 plot the lift values along with the best exponential fit (solid line) of the form \(f(x) = Ae^{-\alpha \cdot x} + B\), in which \(A,B,\alpha \in \mathfrak {R}\). These curves represent the lift on the entire population. Similar phenomena can be observed in both carriers regardless of the fraction of the population. Table 4 presents the values obtained in the exponential fit for the various measures. In this fit, the resulted decay parameters are in the range [0.025, 0.047] and the fit is extremely high (average \(R^2\) is 0.98). In the limit of infinite time horizon (\(x\rightarrow \infty \)), the theoretical value of the lift approaches 1. Therefore, we expect the fit to also approach 1 for large horizons. Probably due to the limited maximal horizon of 90 days, we did not reach that value, but we did reach a close one (the mean value of \(B\) is 1.23). Results on the second carrier are similar. We tested the difference in performance among the four similarity measures for statistical significance using the Fisher paired test (Demšar 2006). The social measure was found to be statistically significantly better than the other three measures. The null hypothesis, namely, that the other 3 measures (calls, cosine, and shared) have indistinguishable performance could not be rejected at P = 0.05.

Table 4 Fitted parameters of the exponential fit

In Nitzan and Libai (2010), the authors showed that the hazardous effect that churners have on their neighbors decreases exponentially over time. Our results are consistent with their findings and indicate that diffusion is especially suited for short to medium prediction horizons.

Fig. 3
figure 3

Lift as function of time horizon for working point of 1 %

Fig. 4
figure 4

Lift as function of time horizon for working point of 5 %

Fig. 5
figure 5

Lift as function of time horizon for working point of 1 %, with best exponential fit (solid line). This figure is best viewed in color

Fig. 6
figure 6

Lift as function of time horizon for working point of 5 %, with best exponential fit (solid line). This figure is best viewed in color

4.2 The effect of different relationship measures

As mentioned above, we tested several relationship measures on the same populations, time frames, and predictive horizons. This allowed us to estimate the contribution of the measures directly, given that all other parameters were equal.

Figure 7 shows the lift at the 1 % fraction of the population as a function of the prediction horizon, for the first carrier. As this figure demonstrates, up to horizon of 40 days, the calls measure is outperformed by the social, shared, and cosine measures, as well as some combined measures to be discussed at the next section. Table 6 quantifies these improvements, averaged over several prediction horizons, and shows that the differences between the individual measures are often significant. Additionally significant, the social measure outperforms the calls measure in both carriers. Therefore, we deduce that the social measure is a good candidate for being the default measure of choice in diffusion, while taking into account its larger computational complexity. However, individual carriers may find that other measures offer (sometimes greatly) superior performance for their specific population

Fig. 7
figure 7

Lift as functions of time horizon, on all basic measures and combinations in working point of 1 %. This figure is best viewed in color

4.3 Combining multiple diffusion scores

The previous section demonstrated that similarity measures differ in performance. Therefore, we hypothesize that each measure may be better at identifying a different set of churners. Hence, in this section, we demonstrate that combining multiple diffusion measures can improve the quality of the results for churn prediction.

First, the Spearman correlation (Spearman 1904) between the scores resulting from the different relationship measures is shown in Table 5, for different fractions of the population. The top 1 % refers to 1 % of the population which received the highest averaged score across all measures. Interestingly, for most pairs of measures, the correlation coefficients that correspond to these top 1 % of scores are small in their absolute values and differ in their signs, while the correlation for the entire population is somewhat higher and always positive. This suggests a large disagreement between measures on the ranking of the most likely churning subscribers, and a general agreement on the scoring of less likely churn candidates. The only pairs that have a high correlation are shared and the cosine. This is not surprising, given that these measures are highly related by definition, and they form the same network structure (for a more detailed discussion, see Sect. 2.3).

Table 5 Spearman correlation coefficient values

We tested several classifiers for combining the four similarity measures. The inputs to the classifiers were the diffusion scores, and the classifiers tested were logistic regression and regression tree. Since the regression tree maps large fractions of the populations to the same score, we also tried to smoothen the scores by adding to each tree score the average over all measures. We term this the soft decision tree. The classifiers were constructed for each horizon using the first (training) period, and applied to the test periods. For comparison, we also provide the lift obtained by a simple averaging of the four diffusion scores.

The results of this analysis are shown in Fig. 7, which depicts the lift values over time, and in Table 6, which shows the average lift values, using the calls measure as the baseline. As shown, the decision and regression trees offer the best improvements in lift, obtaining results that are far superior than the ones obtained for each measure separately, with an improvement over the baseline of over \(50\,\%\) for the first carrier and around \(18\,\%\) for the second. This lends additional evidence to the finding that different measures identify different churners, and that these can be exploited through a learning algorithm. Our finding may also hint that different measures identify subscribers who churn for different social influences. This, however, requires additional study.

Figure 8 shows the structure of the regression tree for the working point of 20 days for the first carrier. The tree was pruned to a depth of five levels to avoid over fitting. All measures we tested appear in the tree. Interestingly, the most indicative feature is the social measure, in which a low social score indicates a lower propensity to churn (which is further partitioned using the calls measure). Conversely, a high social score, combined with a high score in cosine and shared is indicative of an extremely high churn likelihood.

Fig. 8
figure 8

Decision tree structure for horizon of 20 days (Carrier 1). Values in leaf nodes indicate the likelihood of churn for subscribers mapped to these nodes

Table 6 Average percentage improvement in diffusion performance using different similarity measures and combination classifiers, compared to the calls measure

5 Conclusions and future work

Diffusion processes play a central role in applications of viral marketing and social networks. A deeper understanding of the properties of such processes can lead to theoretical and practical insights on these applications. In this context, churn is a useful ground truth because it is relatively easy to define and measure.

In this paper, we addressed a number of phenomena related to the use of diffusion-based algorithms for churn prediction in Telco networks. We demonstrated three main phenomena: the fact that accuracy deteriorates sharply with the prediction horizon, the effect of the social affinity measure used, and the usefulness of combining social affinity measures for enhancing the performance of churn prediction algorithms.

During this study, we fixed several parameters of the diffusion algorithm such as the initial energy and the spreading coefficient \(d\). The values were set as suggested in Dasgupta et al. (2008), which also provide a discussion on the effect of changing the value of \(d\). We leave for future work to check whether the optimal value of \(d\) should depend on the chosen social affinity measure, and whether the initial energy has a major effect on the results.

Several research directions stem from this work: our paper showed that significant gains can be achieved from both the use of different social affinity measures and their combination via ensemble methods. We believe that these gains could be further improved by both introducing additional measures and by additional ways of combining them. Additionally, as is the case in many such applications, each carrier may find that a different set of measures is best for its specific population, though the social measure is the best single measure among those we tested.

The deterioration of prediction quality as a function of the time horizon has implications for marketing applications. A theoretical explanation of this phenomenon may lead into novel algorithms and deeper insights. One possible conjecture is that this phenomenon stems from the expander-like structure of such networks. Yet, deducing a generic theoretical explanation of this deterioration would be challenging.

The fact that different measures identify different churners is interesting from a social perspective, as it may be due to different modes of churn or to differences in the churning subscribers. Future work should investigate this phenomena in more detail.