1 Introduction

In recent times, Online Social Networks have emerged as an additional dimension of human life. Among many one of the important phenomena of online social networks is the diffusion of information using which information propagates from one part of the network to the other [5]. How we can make use of this influence in the context of viral marketing such that the profit can be maximized remains an active area of research [4, 12].

Problem Background. Commercial houses use online social networks for viral marketing purposes [3]. The goal here is to identify a small set of influential users to activate initially that leads to maximum influence (in turn profit). These initially active nodes are called seed nodes. In particular, given a social network and positive integer k the problem is to choose a subset of k nodes to maximize the influence in the network [6]. This problem remains an active area of research in the domain of social network analysis [1].

Motivation. The traditional seed set selection methodologies for the influence and profit maximization problem consider that all the seed nodes will be deployed at one go before the diffusion process starts. However, some recent studies show that instead of one go if we split the budget into two or more parts and conduct the diffusion process in two or more rounds, then the number of influenced nodes increases [2, 13]. Naturally, a question arises, does the same thing happen even for the Profit Maximization Problem. In this paper, we elaborately investigate this question.

Our Contributions. Our goal is to choose optimal seed sets for both the phases such that the aggregated profit is maximized. In particular, we make the following contributions in this paper:

  • We study the Profit Maximization Problem using social networks in the two-phase setting. To the best of our knowledge, this is the first study on profit maximization in this direction.

  • We develop a mathematical model for this problem that captures the expected profit at the end of the diffusion process. Subsequently, we show that selecting an optimal seed set for the first phase even considering the optimal seed set for the second phase can be selected efficiently is an \(\textsf {NP}\)-Hard Problem.

  • We propose two algorithms, namely the single greedy and the double greedy approach along with their detailed analysis and both of them work based on marginal profit gain computation.

  • Finally, we conduct an extensive set of experiments with real-world datasets to show that the proposed methodologies can lead to more amount of profit than the baseline methods.

Organization of the Paper. The rest of the paper is organized as follows. In Sect. 2, we describe the background and define the problem formally. Section 3 contains the proposed solution methodologies. Section 4 contains the experimental evaluation of the solution approaches. Finally, Sect. 5 concludes our study.

2 Background and Problem Definition

We represent the input social network by a simple (un)directed, and edge-weighted graph denoted by \(G(V, E, \mathcal {P})\). Here, \(V(G)=\{u_1, u_2, \ldots , u_n\}\) are the set of n users and \(E(G)=\{e_1, e_2, \ldots , e_m\}\) are the set of m social ties. \(\mathcal {P}\) denotes the edge weight function that maps each edge to its corresponding influence probability; i.e.; \(\mathcal {P}:E(G) \longrightarrow (0,1]\). For any edge \((uv) \in E(G)\), \(\mathcal {P}_{uv}\) denotes the influence probability of the user u on v. If \((uv) \notin E(G)\) then \(\mathcal {P}_{uv}=0\). Each user of the network is associated with a cost and benefit value that are characterized by the cost and the benefit function denoted as C and b. Hence, \(C:V(G) \longrightarrow \mathbb {Z}^{+}\) and \(b:V(G) \longrightarrow \mathbb {Z}^{+}\). For any \(u \in V(G)\), let C(u) and b(u) denote the cost and benefit associated with the user u.

To conduct the diffusion process in the network, a subset of the users are chosen as a seed user and they are considered to be influenced at time step \(t=0\). Now, the information is diffused in the network based on some rules. In this paper, we consider that the information in the network is diffused by the rule of the Independent Cascade Model (ICM). The node is either ‘uninfluenced’ or ‘influenced’. Every influenced node activates its inactive neighbor in discrete time step t and an influenced node can’t change to its previous uninfluenced state. The diffusion process ends when no more node activation is possible. The diffusion process can be expressed as a live graph and they are \(2^{m}\) many where m denotes the number of edges in G. We denote these graphs as \(L(G)=\{ \mathcal {G}_{1}, \mathcal {G}_{2}, \ldots , \mathcal {G}_{2^{m}}\}\). More about live graphs can be found in [6]. Next, we state the influence of a seed set in Definition 1.

Definition 1 (Influence of a Seed Set)

Assume that \(\mathcal {S} \subseteq V(G)\) is the seed set. Now, at the end of the diffusion process starting from \(\mathcal {S}\), the number of nodes that are influenced is called the influence of \(\mathcal {S}\). We denote this by \(\sigma (\mathcal {S})\) where \(\sigma ()\) is the social influence function that maps each subset of the user to their corresponding influence value, i.e., \(\sigma : 2^{V(G)} \longrightarrow \mathbb {R}_{0}\) with the condition \(\sigma (\emptyset )=0\).

Definition 2 (Social Influence Maximization Problem)

Given a social network \(G(V, E, \mathcal {P})\), and a positive integer k the goal of the social influence maximization problem is to choose a subset of k nodes \(\mathcal {S} \subseteq V(G)\) such that their initial activation leads to maximum number of influenced nodes. Mathematically, this problem can be stated as follows:

$$\begin{aligned} \mathcal {S}^{OPT} = \underset{\mathcal {S} \subseteq V(G) \text { and }|\mathcal {S}| \le k}{argmax} \ \sigma (\mathcal {S}) \end{aligned}$$
(1)

Here, \(\mathcal {S}^{OPT}\) denotes an optimal seed set of size k.

Definition 3 (Benefit Earned by a Seed Set)

Given a social network \(G(V, E, \mathcal {P})\) and a seed set \(\mathcal {S}\) we denote the benefit obtained by \(\mathcal {S}\) as \(\beta (\mathcal {S})\) and defined in terms of expectation over the set of all possible live graphs. Mathematically, this can be defined using Eq. 2.

$$\begin{aligned} \beta (\mathcal {S})= \mathbb {E} \ [\underset{v \in I_{\mathcal {G}_{i}}(\mathcal {S}) \ \cap \ V(G)}{\sum } \ b(v) \ ] \end{aligned}$$
(2)

Here, \(I_{\mathcal {G}_{i}}(\mathcal {S})\) denotes the set of influenced nodes from the seed set \(\mathcal {S}\) in the i-th live graph. \(C(\mathcal {S})\) denotes the total cost of the seed set, i.e., \(C(\mathcal {S})= \underset{v \in \mathcal {S}}{\sum } C(v)\).

The expectation mentioned in Definition 2 is taken over the probability distribution of the benefit values in different live graphs. For a seed set \(\mathcal {S}\), we denote its profit by \(\phi (\mathcal {S})\) and can be defined using Eq. 3.

$$\begin{aligned} \phi (\mathcal {S})= \beta (\mathcal {S}) \ - \ C(\mathcal {S}) \end{aligned}$$
(3)

Definition 4 (Profit Maximization Problem)

Given a social network \(G(V,E, \mathcal {P})\) where users of the network are associated with cost and benefit value and a fixed amount of budget \(\mathcal {B}\) is given. The goal is to choose a subset of the nodes \(\mathcal {S} \subseteq V(G)\) such that the profit is maximized. Mathematically, this problem can be stated as follows:

$$\begin{aligned} \mathcal {S}^{OPT}=\underset{\mathcal {S} \subseteq V(G) \text { and }C(\mathcal {S}) \le \mathcal {B}}{argmax} \ \phi (\mathcal {S}) \end{aligned}$$
(4)

Definition 5 (Two-Phase Profit Maximization Problem)

Given a social network \(G(V,E, \mathcal {P})\) where users of the network are associated with cost and benefit value and budget for two phases \(\mathcal {B}_{1}\) and \(\mathcal {B}_{2}\), our goal is to choose seed nodes \(\mathcal {S}_{1}\) and \(\mathcal {S}_{2}\) to maximize the profit at the end of second phase such that \(C(\mathcal {S}_{1}) \le \mathcal {B}_{1}\) and \(C(\mathcal {S}_{2}) \le \mathcal {B}_{2}\).

3 Mathematical Model and Solution Methodologies

Mathematical Model. Let, a live graph \(\mathcal {G} \in L(G)\) with its generation probability \(P(\mathcal {G})\) is destined to occur and \(\mathcal {S}_{1}\) be the seed set for the Phase I. The diffusion process starts on a live graph \(\mathcal {G}\) with seed set \(\mathcal {S}_{1}\) by the rule of IC Model and observe this diffusion process till time step d. Then, we will have the information regarding which nodes are influenced and which are not. We call this as the partial observation till time step d and denoted as Y. So, at the end of time step d, we have already activated nodes denoted by \(A_{Y}\) and newly activated nodes (at time step d) \(R_{Y}\). These two sets \(A_{Y}\) and \(R_{Y}\) are determined from the partial observation Y.

Now, as we have the partial observation Y, for a subset of the edges of the live graph \(\mathcal {G}\) we are sure whether they have appeared or not and based on that we can update the generation probability \(P(\frac{\mathcal {G}}{Y})\). Now, the second phase needs to begin, and assume that \(\mathcal {S}^{OPT(Y,\mathcal {B}_2)}_{2}\) denotes the optimal seed set for Phase II when the partial observation Y and the budget is \(\mathcal {B}_2\). At the time step d, we deploy the nodes in the set \(\mathcal {S}^{OPT(Y,\mathcal {B}_2)}_{2}\) along with the nodes in \(R_{Y}\) both of them together will act as seed set for Phase II. Now, our goal is to calculate the expected profit that can be earned in Phase II. Now, it is important to observe that in Phase II , the nodes from which the profit can be earned will be the subset from \(V(G) \setminus A_{Y}\). So, for the given partial observation Y (hence, newly activated nodes \(R_Y\)) along with an optimal seed set for Phase II, i.e., \(\mathcal {S}^{OPT(Y,\mathcal {B}_2)}_{2}\) will be equals to \(\displaystyle {\sum _{\mathcal {G} \in L(G)}P(\frac{\mathcal {G}}{Y}})[\phi ^{V(\mathcal {G})\setminus A_{Y}}(R_{Y} \cup S_{2}^{OPT(Y,B_{2})})]\). Here, \(\phi ^{V(\mathcal {G})\setminus A_{Y}}(\mathcal {S})\) denotes the profit earned by the seed set \(\mathcal {S}\) from the graph \(V(G) \setminus A_{Y}\).

We can observe that given a live graph \(\mathcal {G}\), seed set of Phase I; i.e.; \(\mathcal {S}_{1}\), and the time step d, we can get the partial observation Y. Hence, \(\mathcal {S}^{OPT(Y,\mathcal {B}_2)}_{2}\) can be written as \(S_{2}^{OPT(X, S_{1},d,\mathcal {B}_{2})}\). Now, to develop an objective function where the decision variable will be the seed set for Phase I, we assume that given the partial observation Y, we will select an optimal seed set for Phase II. It is important to observe that at the starting of Phase I, the partial observation Y is not known. Let our objective function be \(\mathbb {F}(S_1, d, \mathcal {B}_2 )\) as the expected profit with respect to all possible occurrences Y. Assuming that d and \(\mathcal {B}_2\) are already given, so we can write \(\mathbb {F}(\mathcal {S}_{1} , d, \mathcal {B}_{2} )\) as \(f(\mathcal {S}_{1})\). In the following derivation, by \(\mathcal {S}^{'}_{2}\) we denote the set \(R_{Y} \cup S_{2}^{OPT(Y,B_{2})}\).

$$\begin{aligned} \begin{array}{c} f(S_{1}) = \displaystyle {\sum _{Y}P({Y}})\Bigg \{{\Big \{ \displaystyle {\sum _{\mathcal {G}}P(\frac{\mathcal {G}}{Y}})\phi (A_{Y}) + \displaystyle {\sum _{\mathcal {G}}P(\frac{\mathcal {G}}{Y}})[\phi ^{V(\mathcal {G})\setminus A_{Y}}(\mathcal {S}^{'}_{2})]}\Big \}\Bigg \}\\ = \displaystyle {\sum _{Y}P({Y}})\displaystyle {\sum _{\mathcal {G}}P(\frac{\mathcal {G}}{Y}})\Bigg \{{\Big \{\phi (A_{Y}) + [\phi ^{V(\mathcal {G})\setminus A_{Y}}( S_{2}^{OPT(X,S_{1},d,\mathcal {B}_{2})})]}\Big \}\Bigg \}\\ = \displaystyle {\sum _{Y}P({Y}})\displaystyle {\sum _{\mathcal {G}}P(\frac{\mathcal {G}}{Y}})\Bigg \{{\phi ^{\mathcal {G}}(S_{1} \cup S_{2}^{OPT(X,S_{1},d,\mathcal {B}_{2})})}\Bigg \}\\ \Bigg \{ \because \displaystyle {\sum _{Y}P({Y}})\displaystyle {\sum _{\mathcal {G}}P(\frac{\mathcal {G}}{Y}}) = \displaystyle {\sum _{Y}}\displaystyle {\sum _{\mathcal {G}}} {\frac{P(\mathcal {G},Y)}{P(Y)}} P(Y)\\ = \displaystyle {\sum _{Y}}\displaystyle {\sum _{\mathcal {G}}} P(\mathcal {G},{Y}) \ = \displaystyle {\sum _{\mathcal {G}}}\displaystyle {\sum _{Y}} P(\mathcal {G},{Y})\ = \displaystyle {\sum _{\mathcal {G}}} P(\mathcal {G})\Bigg \}\\ \end{array} \end{aligned}$$

The objective function formulated of our Two-Phase Profit Maximization Problem is as follows:

$$\begin{aligned} \therefore f(\mathcal {S}_{1})= \displaystyle {\sum _{\mathcal {G}}}P(\mathcal {G}){\phi ^{\mathcal {G}}(\mathcal {S}_{1} \cup \mathcal {S}_{2}^{OPT(X,\mathcal {S}_{1},d,\mathcal {B}_{2})})} \end{aligned}$$
(5)

Now, it is important to observe that the developed model considers the optimal seed set selection in Phase II, which is itself an NP-Hard problem. So, Theorem 1 holds.

Theorem 1

Finding the optimal seed set \(\mathcal {S}_{1}\) that maximizes \(f(\mathcal {S}_{1})\) as mentioned in Eq. 5 is NP-Hard.

Solution Methodologies. In this section, we describe two solution methodologies, namely Single Greedy and Double Greedy for the Two-Phase Profit Maximization Problem and both of them are based on Marginal Profit Gain which is stated in Definition 6.

Definition 6 (Marginal Profit Gain)

Given a social network \(G(V, E, \mathcal {P})\), a seed set S, and a node \(u \in V(G) \setminus S\) we denote the Marginal Profit Gain for the node u with respect to the seed set S as \(\phi _u(S)\) and it is defined as the difference of profit earned when u is added to S and when u is not in S. The profit function \(\phi ()\) may be non-monotone as well. As we are considering the ‘gain’, for any seed set S and node \(u \in V(G) \setminus S\), \(\phi _u(S)\) is defined only when \(\phi _u(S) > \phi (S)\). Mathematically, this can be defined using Eq. 6.

$$\begin{aligned} \phi _u(S) = \phi (S \cup \{u\}) - \phi (S) \quad such \; that \; \phi _u(S) > \phi (S) \end{aligned}$$
(6)

Single Greedy Approach. Now, we describe the single greedy algorithm in two-phase setting. We have the following inputs: the social network \(G(V, E, \mathcal {P})\), Budgets for both the phases \(\mathcal {B}_{1}\) and \(\mathcal {B}_{2}\), and the duration of Phase I which is d. Now, in the first phase, until the budget \(\mathcal {B}_1\) is exhausted, we iteratively select seed node based on the marginal profit gain. In each iteration, for every non seed node u, we compute the marginal profit gain to its cost ratio and the node maximizes this quantity is found out. If the marginal profit gain of this node is strictly positive then it is included in the seed set of Phase I. It may so happen that the allocated budget for the first phase has not been exhausted totally. If so, the remaining budget of Phase I is added to Phase II. Now, we conduct diffusion process based on the IC Model starting from the seed set \(\mathcal {S}_1\) till d-th time step. Thus, at the end of first Phase we have \(A_Y\) and \(R_Y\) as the influence of \(\mathcal {S}_1\) at time step d. Now, we begin Phase II with the updated budget. The process of seed set selection is quite similar to the first phase with one difference. During the first phase we are dealing with the entire social network, and hence, while computing the marginal profit gain we consider the whole network. However, for the second phase we deal with the network obtained by deleting the already activated nodes from the original network. Accordingly, during the second phase while computing the marginal profit gain we consider the remaining network. Algorithm 1 shows the pseudocode of the proposed approach.

figure a

Next, we analyze this algorithm to understand its time and space requirement. Let, \(\mathcal {C}_{min}\) denotes the minimum cost among all the nodes; i.e.; \(\mathcal {C}_{min}=\underset{u \in V(G)}{min}\ \ C(u)\). So it is easy to observe that the maximum number of nodes that can be selected as seed in Phase I will be of \(\mathcal {O}(\frac{\mathcal {B}_1}{\mathcal {C}_{min}})\) and also these many iterations are required. In each iteration, the main computation involved is the marginal profit gain. It is easy to observe that in every iteration the number of nodes for which the marginal profit gain needs to be computed is of \(\mathcal {O}(n)\). Now, for a given seed node computing the marginal gain is equivalent to traversing the graph and this takes \(\mathcal {O}(m+n)\) time. In the worst case, the size of \(\mathcal {S}_1\) can be of \(\mathcal {O}(n)\). Hence, for one marginal gain computation time requirement is of \(\mathcal {O}(n \cdot (m+n))\). As, there are \(\mathcal {O}(n)\) many marginal profit gain computation, hence time requirement for this purpose is of \(\mathcal {O}(n^{2} \cdot (m+n))\). Now, choosing the node that causes the maximum marginal profit gain that takes \(\mathcal {O}(n)\) time. So, the time requirement for the Phase I is of \(\mathcal {O}(\frac{\mathcal {B}_1}{\mathcal {C}_{min}} \cdot n^{2} \cdot (m+n))\). Now, before starting Phase II, we need to delete the already activated nodes in Phase I from the Graph. Now, the number of already activated nodes are of \(\mathcal {O}(n)\). In the worst case, they may be incident with \(\mathcal {O}(n^{2})\) many edges. So, deleting these vertices from the graph requires \(\mathcal {O}(n^{2})\) time. The analysis of the second phase will remain the same except one difference. As in Phase II, the graph has been reduced by deleting the already activated nodes so \(\mathcal {C}_{min}\) may not be minimum cost of the nodes in Phase II. Let, it be \(\mathcal {C}^{'}_{min}\) and this means \(\mathcal {C}^{'}_{min}=\underset{u \in V(G) \setminus A_{Y}}{min} \ C(u)\). However, the remaining computations remains the same. So, the time requirement for Phase II will be of \(\mathcal {O}(\frac{\mathcal {B}_2}{\mathcal {C}^{'}_{min}} \cdot n^{2} \cdot (m+n))\). Now, summing everything up, the total time requirement of the single greedy approach will be of \(\mathcal {O}((\frac{\mathcal {B}_1}{\mathcal {C}_{min}}+\frac{\mathcal {B}_2}{\mathcal {C}^{'}_{min}}) \cdot n^{2} \cdot (m+n))\). Now, the extra space consumed by this method is to store the seed sets for both the phases which can be of \(\mathcal {O}(n)\). Hence, Theorem 1 holds.

Theorem 2

The time and space requirement of Single Greedy Approach is of \(\mathcal {O}((\frac{\mathcal {B}_1}{\mathcal {C}_{min}}+\frac{\mathcal {B}_2}{\mathcal {C}^{'}_{min}}) \cdot n^{2} \cdot (m+n))\) and \(\mathcal {O}(n)\), respectively.

It is easy to convince that two-phase setting is a generalization of single-phase setting and it has been mentioned in [14] that the single greedy approach in one-phase setting does not lead to any constant factor approximation guarantee. Next, we describe the double greedy approach.

Double Greedy Approach. First phase of this method goes like this. We initialize two sets \(\mathcal {S}_{1}\) and \(\mathcal {T}_{1}\). The first one is with \(\emptyset \) and the second one is with V(G). Now, for every node \(u \in V(G)\), we compute two measures \(r^{+}_{u}\) and \(r^{-}_{u}\) which are mentioned in Eq. 7 and 8, respectively.

figure b

Now, if \(r^{+}_{u} \ge r^{-}_{u}\) and the cost of the current nodes is less than the available budget, then the set \(\mathcal {S}_{1}\) is updated as \(\mathcal {S}_{1} \cup \{u\}\) and the budget \(\mathcal {B}_{1}\) is updated as \(\mathcal {B}_{1}- \mathcal {C}(u)\), though \(\mathcal {T}_{1}\) remains the same. If the budget is not sufficient or if \(r^{+}_{u} < r^{-}_{u}\), \(\mathcal {T}_{1}\) is reduced by deleting the current node and in that case \(\mathcal {S}_{1}\) remains the same. After repeating these steps we obtain the seed set for Phase I; i.e.; \(\mathcal {S}_{1}\). Now, we conduct the diffusion process and observe upto the time step d and this obtain the the already activated nodes and recently activated nodes. If there is any unutilized budget of Phase I, that has been added to the budget of Phase II. For the seed set selection of Phase II, we repeat the same process however on the reduced graph; i.e.; the graph obtained by deleting the already activated nodes in the first phase. Now, we proceed to describe the analysis of the double greedy approach.

figure c

The analysis is quite similar to the single greedy approach in two-phase setting. As stated previously, computing one marginal profit gain computation requires \(\mathcal {O}(m+n)\) time. It is important to observe that in this method we are performing two marginal profit gain computations per node. Hence, the time requirement for the seed set selection of the first phase is of \(\mathcal {O}(n \cdot (m+n))\). Other than the marginal gain computations, all the remaining statements from Line 7 to 17 will take \(\mathcal {O}(1)\) time. Now, in the worst case the size of \(\mathcal {S}_{1}\), \(R_{Y}\), and \(A_{Y}\) can be of \(\mathcal {O}(n)\). So, there can be \(\mathcal {O}(n^{2})\) many edges associated with the vertices of \(A_{Y}\). Hence, deleting the set \(A_{Y}\) leads to the modification of the \(\mathcal {O}(n^{2})\) many adjacency matrix entries of the input social network. Thus, performing the deletion step after Phase I requires \(\mathcal {O}(n^{2})\) time. Like Phase I, it is easy to observe that the time requirement for seed set selection in Phase II will be of \(\mathcal {O}(n \cdot (m+n))\). Hence, the total time requirement for the double greedy approach in two-phase setting will be of \(\mathcal {O}(n \cdot (m+n) + n^{2})= \mathcal {O}(n(m+n))\). The extra space consumed by this algorithm is to store the sets \(S_{1}\), \(T_{1}\), \(S_{2}\), \(T_{2}\), and S. In the worst case, all of them will consume \(\mathcal {O}(n)\) space. Hence, Theorem 3 holds.

Theorem 3

Running time and the space requirement of the double greedy approach in the two-phase setting is of \( \mathcal {O}(n(m+n))\) and \( \mathcal {O}(n)\), respectively.

4 Experimental Evaluation

In this section, we describe the experimental evaluation of the proposed solution approaches. First, we mention the datasets that we have used.

Datasets. We have used three datasets for our experiments namely, email-Eu-core [11, 15], soc-sign-bitcoin-alpha [7, 8] and wiki-Vote [9, 10]. They have been downloaded from https://snap.stanford.edu/data/ and listed in Table 1.

Table 1. Basic Statistics of the Datasets

Experimental Setup. In our experimental setup Independent Cascade Model is used for diffusion process. The influence probability \(\mathcal {P}_{uv}\) is set to 0.01. The cost setting of a \(u \in V(G)\) is \(C(u) \longrightarrow [50,100]\) and benefit setting is \(b(u) \longrightarrow [800,1000]\). Hence, a cost and benefit is associated for every node u in the graph. These settings are followed for all these three datasets specified above. We have evaluated datasets for Budget = \(\{500, 1000, 1500, 2000, 2500\}\). Each budget is split into a ratio of \(60\%\) at time step \(d = 3\). The budgets \(\mathcal {B}_{1}\) and \(\mathcal {B}_{2}\) are consumed in first phase and second phase, respectively. The first phase of experiment is run for 100 iterations which generates 100 sets of recently active nodes. Now, the second phase runs for 100 times for all 100 sets of recently active nodes. The maximum of the profits from the 100 recently active nodes is the profit earned from the second phase. We compare the performance of the proposed solution approaches with the following methods, namely Random, High Degree, Clustering Coefficient and Single Discount baseline algorithms are compared as per the experimental setup elaborated above.

Results and Discussions. Now, we describe the experimental results. Figure 1 shows the budget vs. cardinality of the seed set plots selected by proposed as well as baseline methods over different budget values for different datasets. From this figure, we observe that in most of the problem instances the number of seed nodes selected by the single greedy algorithm is more than the other methods. As an example, for the Email-Eu-Core Dataset when the budget value is 2500, among the baseline methods the cardinality of the seed sets selected by all of them is 33. However, in the same setting the number of seed nodes selected by the single greedy approach is 36. This observation is consistent even for the other two datasets. For the Soc-Sign-Bitcoin-Alpha Dataset when the budget value is 2000, among the baseline methods the maximum number of seed nodes selected by both Random and Single Discount Heuristics and the number is 29. However, the same for the single greedy approach is 31. For the Wiki-Vote Dataset, for the budget value 2500, among the baseline methods the maximum number of seed nodes selected by Random and the number is 34 whereas the same by single greedy algorithm is 39.

Fig. 1.
figure 1

Budget Vs. Cardinality of the Seed Set Plots in two-phase setting for all the datasets

Fig. 2.
figure 2

Budget Vs. Differences in Profit in single and two-phase for all the datasets

Figure 2 shows the budget vs. difference between profit in two-phase and one-phase for different datasets. From the figure, we observe that for most of the budget values the difference in the earned profit remains positive where as for most of the baseline methods this difference is negative. As an example, for the ‘email-Eu-core’ dataset when the budget value is 2500, for the single greedy algorithm the earned profit in single-phase and two-phase is 56354.35 and 69676.21, respectively. Hence, the gain in two-phase is 13321.86 which is approximately \(24 \%\). In general, we observe that when the budget value increases (upto 2000) this gain increases. However, if we increase the budget further the gain decreases. This is due to the fact when the budget exceeds a threshold, then the earned benefit does not increase much. However, the cost of the seed set increases as the budget increases. Consequently, the profit does not increase even if the budget increases.

Another important observation, we make from Fig. 2 is that the performance of the single greedy and double greedy algorithm is complementary, meaning for the budget values when the single greedy approach does not perform well in those budget values the double greedy approach. As an example, for the Email-Eu-Core dataset for the budget value 500, the gain in case of the single greedy approach is negative where as the same for the double greedy approach is positive. For the other budget values the scenario is just the reverse.

5 Conclusion and Future Direction

In this paper, we have studied the problem of profit maximization in two-phase setting using online social networks. For this problem, first we develop a mathematical model for this problem based on IC Model of diffusion that captures the profit in an expected sense. Subsequently, we propose two solution methodologies namely single greedy and double greedy. Experimental analysis with real-world datasets show that effectiveness of the proposed solution methodologies. Now, in this study we have not taken care of how the total budget \(\mathcal {B}\) is divided into \(\mathcal {B}_{1}\) and \(\mathcal {B}_{2}\) which we pose for future study.