Keywords

1 Introduction

Due to their fast development, Online Social Networks (OSNs) become powerful mediums for spreading innovations. The market of OSNs advertisement is growing explosively. A managing partner of a consulting firm claimed in [1] that the e-commerce industry in the U.S. was worth approximately $500 billion in 2018 and had been one of the fastest growing areas of the economy.

In OSNs, users can share their opinions about an event or a promoted product with other users, and this kind of information dissemination is in unprecedented prosperity nowadays. In the IM problem, a positive integer k is given and the problem aims to find a set of k seeds which maximizes the total number or expected number of active nodes in a social network. In [2], Kempe et al. firstly consider the issue of choosing influential sets of individuals as a discrete optimization problem, prove the problem is NP-hard under both Independent Cascade (IC) model and Linear Threshold (LT) model and propose a greedy algorithm which yields (1–1/e)-approximation due to the submodularity and monotone properties of the influence spread. As shown in [3], computing exact influence in general networks under the LT model is #P-hard. Many follow-up, such as [4,5,6,7,8,9], further study IM problem in term of improving efficient implementation of algorithms, extending the maximization from a practical view and so on. When an OSN provider is hired to conduct the viral marketing campaign, it not only receives commission from the advertiser but also pay for the information propagation. Therefore, the OSN provider needs to account for both the benefit and cost of influence spread to maximize its profit. As an extension of IM, Tang J. et al. propose the PM problem whose aim is to find all the nodes that can maximize the profit in online social networks and define a general profit metric that can be expressed as the difference between benefit function and cost function in [11, 12]. Given that the benefit function is the total benefit brought by all the nodes activated, it is a submodular function. As for cost function, it is also submodular when it represents the total cost incurred by the whole active nodes, different from the budget constraint studied by previous works. Therefore, profit metric is a nonsubmodular and non-monotone function.

A lot of studies in [13,14,15] focus on submodular optimization while nonsubmodular optimization has been attracting more scholars’ attention for many years. As summarized in [16], there are many approaches to solve nonsubmodular optimization problems. One of them is DS decomposition mentioned in [17,18,19,20]. As shown in [17, 18], every set function \(f:2^X\rightarrow R\) can be expressed as the difference of two monotone nondecreasing submodular functions g and h, i.e., \(f=g-h\), where X is a finite set. Based on the theorem, many algorithms such as the modular-modular algorithm and iterated sandwich algorithm are proposed.

We observe some real marketing process and come to the conclusion that although the vast majority of companies select some users to help them promote products, the number of selected users is severely limited. The reason may be that excessive marketing not only bring profit but also incur greater cost. In this paper, we formulate the profit maximization problem with a cardinality constraint and accurately explain the profit function from the marginal increment perspective. The constrained profit maximization (CPM) problem aims to select at most k nodes such that the total profit generated by activated nodes is maximized. Different from existing common methods, we can obtain the marginal gain of profit from the perspective of marginal increment. Generally speaking, the profit spread function is neither submodular or monotone. Hence the CPM problem is a kind of nonsubmodular optimization problem. Due to its definition, profit function can be naturally expressed as the difference between two submodular functions. Therefore, we design a Marginal increment-based Prune and Search (MPS) algorithm. A algorithm whose goal is reducing the search space is devised in the pruning phase and two algorithms inspired by the classic greedy algorithm and the DS decomposition method respectively are designed for selecting seed nodes.

The rest of this paper is organized as follows. In Sect. 2, we propose our formulation of the nonsubmodular CPM problem. Then, we derive MPS algorithm, a two-phase algorithm based on marginal increment method, in Sect. 3. Our experiment settings are introduced in Sect. 4 and results confirm the effectiveness of MPS algorithm. Concluding remarks and suggestions on the future works are given in Sect. 5.

2 Problem Formulation

2.1 Constrained Profit Maximization

In this section, we give CPM problem as follows. It should be emphasized that the marginal increment and CPM problem is not restricted by the diffusion model, and we take IC model for example. Given a directed graph \(G=(V,E,P)\), a constant k, benefit \(b_v\) and cost \(c_v\) for each node \(v\in V\), CPM problem aims to find a seed set S which includes at most k nodes to maximize the return profit.

We formulate the profit function from the perspective of marginal increment and describe the profit function value as accurate as possible. For the given directed graph \(G=(V,E,P)\), nodes in V represent users and edges in E represent the connections among users. Each node v is associated with benefit \(b_v\) and cost \(c_v\). For any directed edge \(<u,v> \in E\), refer to v as a neighbor of u and refer to u as an inverse neighbor of v. Denote \(N_u=\{v:v\in V,<u,v>\in E\}\). Let \(p_{u,v}\), associated with each edge \(<u,v>\in E\), represents the activate probability from node u to node v. The diffusion process starts with a given set \(S\subseteq V\) which includes all the active nodes. When a node u firstly becomes active, for each node \( v\in N_u\), it has a single chance to activate v and succeeds with probability \(p_{u,v}\). The diffusion process ends when there are no more nodes can be activated. Denoted \(\phi \left( S\right) \) as the profit generated by a seed node set S. It is obvious that \(\phi \left( S\right) =\beta \left( S\right) -\gamma \left( S\right) \), where \(\beta \left( S\right) \) is the benefit of influence spread generated by a seed set S and \(\gamma \left( S\right) \) is the total spread cost incurred by the whole nodes activated by S. The goal of constrained profit maximization problem is to find a seed set S satisfying \(|S|\le k\) to maximize the profit \(\phi (S)\). Then, we will discuss more details about \(\phi (S)\).

Property 1

\(\phi (S)=\beta (S)-\gamma (S)\) is a non-submodular and non-monotone function in general.

2.2 Analysis of Profit Function

In this section, we will analyze the property of benefit function and cost function respectively from the perspective of marginal increment. The analysis will assist us reacquainting profit function.

Marginal Increment. Before discussing more details about benefit function and cost function, we recall some definitions about marginal increment provided in [21, 22] as follows.

Definition 1

Suppose that \(f:2^V\rightarrow R\) is non-negative set function, where V is the ground set. For any subset A of V, \( \bigtriangleup _{v}f(A)=f(A\cup \{v\})-f(A)\) is called the marginal gain of \(v\in V\backslash A\) at A. In addition, \(\bigtriangleup _B f(A)=f(A\cup B)-f(A) \)is defined as the marginal gain of \(B\subseteq V\backslash A\) at A.

It is well-known that for a non-negative set function \(f:2^V\rightarrow R^{+}\) defined on the ground set V, f is a monotone function if \(f(A)\le f(B)\) for any \(A\subseteq B\subseteq V\) and is submodular, if \( f(A)+f(B)\ge f(A\cup B)+f(A\cap B)\) for any \(A,B\subseteq V\). According to above definitions, we can easily infer two properties. One is that for a given set \(S\subseteq V \) and any subset \(A\subseteq S\), \(f(S)=f(S\backslash A)+\bigtriangleup _A f(S\backslash A)=f(A)+\bigtriangleup _{S\backslash A} f(A)\) for any given set function f. The other is that when f is monotone, \(\bigtriangleup _A f(S\backslash A)\le \bigtriangleup _B f(S\backslash B)\) and \(\bigtriangleup _{S \backslash A} f(A)\ge \bigtriangleup _{S \backslash B} f(B)\) for any subset \(A\subseteq B\subseteq S\).

Benefit Spread Function. At first, we formulate the benefit spread function from the perspective of marginal increment. According to the result of Jiang T. et al. in [11, 12], it is obvious that benefit spread function \(\beta (S)\) of influence spread generated by a seed set S is the benefit brought by all the nodes activated. Then, let \( b_{v}^{X} \in \left( 0,b_{v}\right] \) be the benefit spread of \(v\in V\) with seed node set X, and for any \(v\in V\), \(b_{v}^{\emptyset } =0\). Denote \(\bigtriangleup b_{v}^{X} \left( u\right) =b_{v}^{X\cup \{u\}}-b_{v}^{X}\) as the marginal gain of the profit on node v when a new node u is selected as seed node. Then following formulas are proposed to calculate \(\bigtriangleup b_{v}^{X}\left( u\right) \), the marginal gain of benefit spread function at node v.

Property 2

The marginal gain \(\bigtriangleup b_{v}^{X} \left( u\right) =\left( b_{v}-b_{v}^{X}\right) p_{u,v}b_{u}^{X}\) for any \(v\in N_{u}\) and for any \(w\in N_{v}\), we have

$$\begin{aligned} \bigtriangleup b_{w}^{X} \left( u\right) =\frac{b_{v}- b_{w}^{X}}{1-p_{(u,v)} b_{v}^{X} } p_{v,w} \bigtriangleup b_{v}^{X}\left( u\right) . \end{aligned}$$
(1)

Furthermore, if a node can be reachable from node u, we can update the marginal gain according to the topology order in recursive manner and we define \(\bigtriangleup b_{w}^{X} \left( u\right) =0\) for node w which is unreachable from node u.

Then, we can conclude that the objective function of benefit spread can be expressed as

$$\begin{aligned} \beta \left( X\right) =\sum _{v\in V} b_v^X =\sum _{v \in V}\sum _{u \in X} \bigtriangleup b_v^{X^u} \left( u\right) , \end{aligned}$$
(2)

where \(X^u\) is the set of nodes that have already activated before node u. Denote \(X=\{v_1,v_2,\ldots ,v_{\widehat{k}}\}\) as a node set which contains all the nodes that can be selected as seed, \(\widehat{k} =\left| X\right| \), \(X^k=\{v_1,v_2,\ldots ,v_k \}\), \(k=1,2,\ldots ,\widehat{k}\) and \(X^0=\emptyset \) for convenience. Then \(\beta \left( X\right) \) can be rewritten as \( \beta (X)=\sum _{k=1}^{\widehat{k}}\bigtriangleup _k \beta \left( X^{k-1}\right) \), where \(\bigtriangleup _k \beta \left( X^{k-1}\right) =\sum _{v\in V}\bigtriangleup b_v^{X^{k-1}}\left( v_k\right) \). We also have a property of \(\beta (X)\) as follows.

Property 3

\(\beta (X)=\sum _{v\in V}\sum _{u\in X} \bigtriangleup b_v^{X_u}\left( u\right) =\sum _{k=1}^{\widehat{k}} \bigtriangleup _k \beta \left( X^{k-1}\right) \) is submodular and monotone decrease with \(b_{v}^{X^{k-1}}\), for any \(v\in V\) and \(k=1,2,...,\widehat{k}\)

Cost Function. We have formulated and analyzed the property of benefit spread function, and it is time to turn to the cost spread function. Generally speaking, the cost function changes during the spread process. The cost of influence propagation induced by a seed set X is the total cost incurred by all the nodes activated. Therefore, from the marginal increment perspective, let \(c_v^X\) be the cost spread of \(v\in V\) with seed node set X. Denote \(\bigtriangleup c_v^X \left( u\right) =c_v^{X\cup \{u\}}-c_v^X\) as the marginal gain of the cost on node v when a new node u is selected as seed node, and for any \(v\in V\), if \(v\notin X\), \(c_v^{\emptyset } =0\). We formulate the cost spread function as

$$\begin{aligned} \gamma \left( X\right) =\sum _{v\in V}c_v^X=\sum _{v \in V}\sum _{u \in X}\bigtriangleup c_v^{X^u}\left( u\right) , \end{aligned}$$
(3)

where \(X^u\) is the set of nodes that have already activated before node u. Let \(X=\{v_1,v_2,\ldots ,v_{\widehat{k}}\}\) represent the node set that includes all the candidate nodes, \(\widehat{k} =\left| X\right| \), \(X^k=\{v_1,v_2,\ldots ,v_k \}\), \(k=1,2,\ldots ,\widehat{k}\) and \(X^0=\emptyset \). Then \(\gamma \left( X\right) =\sum _{k=1}^{\widehat{k}} \bigtriangleup _{k} \gamma \left( X^{k-1}\right) \), where \(\bigtriangleup _{k} \gamma \left( X^{k-1}\right) =\sum _{v \in V}\bigtriangleup c_v^{X^{k-1}}\left( v_k\right) \). We can see that another property of cost function \(\gamma \left( X\right) \) can be proposed as follows.

Property 4

\(\gamma (X)=\sum _{k=1}^{\widehat{k}}\bigtriangleup _k \gamma \left( X^{k-1}\right) \) is submodular and monotone decrease with \(c_v^{X^{k-1}}\), for \(v\in V\), and \(k=1,2,\ldots ,\widehat{k}\).

3 Algorithm for Constrained Profit Maximization Problem

We devise a marginal increment-based two-phase algorithm MPS for CPM problem. One of the highlights of the MPS algorithm is that benefiting from the method of marginal increment, we can compute the benefit spread function and cost spread function more accurately. At the first phase of MPS, we use Modified Iterative Prune (MIP) algorithm to reduce search space. Then, we can select seed nodes by one of the two different algorithms proposed in Sect. 3.3.

3.1 Marginal Increment Computation

First of all, we consider \(\phi \left( X\right) =\beta \left( X\right) -\gamma \left( X\right) \), where \(\beta (X)\) and \(\gamma \left( X\right) \) is the total benefit and cost of all the nodes activated respectively. From the perspective of marginal increment, for any \(X\subseteq V\) only containing seed nodes, the benefit and cost can be expressed as \(\beta \left( X\right) =\sum _{v \in V} b_v^X =\sum _{v \in V}\sum _{u \in X} \bigtriangleup b_v^{X^u}\left( u\right) \), \( \gamma \left( X\right) =\sum _{v \in V}c_v^X =\sum _{v \in V}\sum _{u \in X} \bigtriangleup c_v^{X^u} \left( u\right) \). We design two algorithms to compute \(\beta (X)\) and \(\bigtriangleup \beta ^{X}\left( u\right) \) respectively. And the method for computing \(\gamma (X)\) and \(\bigtriangleup \gamma ^{X}\left( u\right) \) is the same as that for computing \(\beta (X)\) and \(\bigtriangleup \beta ^{X}\left( u\right) \). Inspired by some existing studies of nonsubmodular optimization, there are several ways to deal with \(\phi \left( X\right) \). It should be emphasized that this profit spread metric can be viewed as the difference between two submodular functions. Therefore, algorithm based on DS decomposition can be used naturally. We propose a two-phase framework to solve the CPM problem, which includes pruning phase and search phase.

figure a
figure b

3.2 Pruning Phase

figure c

Considering the constraints of PM problem mentioned in [12], the marginal profit gain is bounded below by the smallest benefit gain less the largest cost gain and bounded above by the largest benefit gain less the smallest cost gain. Apparently, for any node \(v\in V\), if the lower bound of marginal profit gain is positive, v may be selected in an optimal solution of PM. Similarly, a node v cannot be selected in an optimal solution when its marginal profit gain is bounded up by a negative number. Starting with \(A_{0}=\emptyset \), \(B_{0}=V\), \(t=0\), algorithm mentioned in [12] extends the idea in an iterative manner and return \(A_{t}\) as \(A^*\), \(B_{t}\) as \(B^{*}\) when they are converged. It is proved that for any global maximizer \(S^*\) of a unconstrained profit problem, it holds that \(A_{t}\subseteq A_{t+1}\subseteq A^*\subseteq S^*\subseteq B^* \subseteq B_{t+1} \subseteq B_t\) for any \(t\le 0\). Therefore, only the nodes in \(B^*\backslash A^*\) need to be further examined for seed selection. However, when a cardinality constraint is added, the problem becomes more difficult. In order to make the computation easier, we remove some nodes in V to obtain a node set \(B\subset V\) satisfying the condition that the size of \(S^{*} \backslash B\) is as small as possible. And based on it, we design Modified Iterative Prune (MIP) algorithm. It requires skill to choose an appropriate start baseline B. Many methods mentioned in [23] can be studied further more. In the experiments, we adopt two different strategies and compare their performance.

3.3 Search Phase

Considering the natural DS decomposition form of profit function, we propose two algorithms in the search phase. One practical algorithm is the Marginal Increment Greedy Algorithm shown as Algorithm 4. The other available method is the Improved Modular-Modular algorithm given as Algorithm 5.

Marginal Increment Greedy Algorithm. Given a directed graph G(VE) where V represents the whole user and E is the set of their relations. After conducting MIP algorithm, we can obtain \(A^*\) and \(B^*\). As above discussion reveals, all the nodes in \(A^{*}\) has a non-negative marginal profit gain and may be selected as seed node for CPM problem while every \(v\in V\backslash B\) cannot be contained in any optimal seed node set. If \(\left| A^{*}\right| \le k\), the seed node set X is initialized as A, else \(X=\emptyset \). In each iteration, MIGA algorithm selects node u which has the largest marginal profit gain \(\bigtriangleup \phi ^{X}(u)=\bigtriangleup \beta ^{X}\left( u\right) -\bigtriangleup \gamma ^{X}\left( u\right) \) and adds it into X. Repeat the process until no node left has positive marginal gain or \(\left| X\right| >k\). In the meanwhile, we propose another algorithm, which is based on the ModMod procedure and also be combined with MIP algorithm to solve the CPM problem.

figure d

Improved Modular-Modular Algorithm. The ModMod procedure introduced by Iyer and Bilmes in [17] aims to optimize set function expressed as the difference between submodular functions. And given that only nodes in \(B^*\) should be considered for CPM, we can obtain two modular upper bounds of \(\gamma \left( Y\right) \) which are both tight at a given set \(X^{t}\) for any \(Y\subseteq V\) as

$$\begin{aligned} \gamma \left( Y\right)\le & {} m_{X^{t}}^{1}\left( Y\right) =\gamma \left( X^{t}\right) -\sum _{u \in X^{t}\backslash Y}\bigtriangleup \gamma ^{B^{*}\backslash \{u\}}\left( u\right) +\sum _{u \in Y\backslash X^{t}} \bigtriangleup \gamma ^{X^{t}}\left( j\right) ,\end{aligned}$$
(4)
$$\begin{aligned} \gamma \left( Y\right)\le & {} m_{X^{t}}^{2}\left( Y\right) =\gamma \left( X^{t}\right) -\sum _{u \in X^{t}\backslash Y}\bigtriangleup \gamma ^{X^{t}\backslash \{u\}}\left( u\right) +\sum _{u \in Y\backslash X^{t}} \bigtriangleup \gamma ^{\emptyset }\left( j\right) . \end{aligned}$$
(5)

For convenience, we denote \(m_{X^{t}} \left( Y\right) \) represent above two tight upper bound, i.e., \(m_{X^{t}} \left( Y\right) \) can be explained as \(m_{X^{t}}^{1} \left( Y\right) \) or \(m_{X^{t}}^{2} \left( Y\right) \). Let \(\pi \) be any permutation of V that places all the nodes in \(X^{t}\subseteq V\) before the nodes in \(V\backslash X^{t}\). Let \(S_i^{\pi }=\{\pi \left( 1\right) ,\pi \left( 2\right) ,\ldots ,\pi \left( i\right) \}\) be a chain formed by the permutation, where \(S_0^{\pi }=\emptyset \) and \(S_{|X^{t}|}^{\pi }=X^{t}\). Define \(h_{X^{t}}^{\pi } \left( \pi \left( i\right) \right) =\beta \left( S_{i}^{\pi } \right) -\beta \left( S_{i-1}^{\pi } \right) \). Then, \(h_{X^{t}}^{\pi } \left( Y\right) =\sum _{v \in Y}h_{X^{t}}^{\pi }\left( v\right) \) is a lower bound of \(\beta \left( Y\right) \) for any \(Y\subseteq V\) tight at \(X^{t}\).

figure e

4 Experiments

In this section, we conduct experiments on three data sets to test the effectiveness of MPS algorithm for optimizing \(\phi \), and compare it with other different algorithms. When cost function represents the total cost of seed nodes, i.e. \(\phi \left( X\right) =\sum _{v \in V} b_{v}^{X}-\sum _{u \in X} c\left( u\right) \), we can view it as a special case of \(\phi \left( X\right) \).

4.1 Experiment Setup

A synthetic graph and two real-world social graphs are used in our experiment. We will describe them more precisely in the following. We can see a wide variety of relationship can be represented by these social graphs.

Synthetic: This is a relatively small acyclic directed graph randomly generated with 2708 nodes and 5278 edges.

Facebook: The Facebook data set consists of 4039 users and 88234 edges collected from survey participants and has been anonymized.

Wikipedia: The Wikipedia data set is generated by a voting activity, containing 7115 nodes and 103689 edges. Nodes in the network represent wikipedia users and a directed edge from node i to node j represents that user i voted on user j.

All the data sets come from Stanford Large Network Dataset Collection and only have the relationship between two nodes. We use the Independent Cascade propagation model. For ease of comparison, we have some assumption as follows. For the propagation probability, we use a trivalency model in [21], selecting a value from \(\left( 0.1, 0.01, 0.001\right) \) at random. And the profit of each node is set to be 1. The strategies we use in the experiment include:

Random: It randomly selects k nodes. We run the algorithm 10 times and take their average value as the expected profit.

MaxDegree: We select top k nodes according to their degree.

MGIA: We conduct the MIGA algorithm where \(A^{*}=\emptyset \) and \(B^{*}=V\).

MPS with \(B^{1}\): It includes two phases. MIP algorithm is conducted and the MIGA is carried out subsequently. In the pruning phase, all the nodes are sorted according to their degree and top 500 nodes is included in B.

MPS with \(B^{2}\): This algorithm is similar to MPS with \(B^{1}\) mentioned above while each node in the start baseline B has at least two adjacent edges.

The algorithms are all implemented in MATLAB and the experiments are carried out on a machine with an Inter Core i5-6200U 2.30 GHz CPU and 8 GB memory. The running time of MPS algorithm is greatly sensitive to the choice of B and under above setup the MPS algorithm takes several hours in average.

4.2 Exprement Result

In this section, we value the effectiveness of MPS. First of all, we compare the effect of two different start baseline. Then, we show the relation of cost and profit spread value. The last but not least is the comparison with other seed nodes selection algorithms.

Fig. 1.
figure 1

Profit spread value versus different prune start baseline

Fig. 2.
figure 2

Profit spread value versus different cost

Fig. 3.
figure 3

Comparison with other methods

In Fig. 1, we can see that different start baselines can produce different effect on the effectiveness of MPS algorithm. Due to the different definitions of \(B^{1}\) and \(B^{2}\), it is inevitable that more potential seed nodes are excluded from \(B^{1}\), resulting in a small profit spread value in Facebook and Wikipedia. And in Fig. 2, it is obvious that the change speed of profit spread value decreases with the increasing cost. Judging from this result, the constraint of seed node set’s size is important when the cost is not small. The Fig. 3 illustrates that MPS outperforms other algorithms excluding MIGA. And the difference between MIGA and MPS with \(B^{2}\) is small. It should be noted that the difference is smaller in Wikipedia network than in Facebook network. Therefore, MPS may perform better in a network which contains more nodes and edges. We can also arrive at a conclusion that MPS will perform better with the a more ideal start baseline \(B_{0}\).

5 Conclusion and Future Works

In this paper, we have studied the CPM problem and formulate it from an incremental marginal gain perspective. Given that the objective function of CPM problem lacks submodularity in general, we design MPS algorithm to optimize the profit function. In the first phase, MIP algorithm is used to reduce the search space. The MIGA and IMM algorithm are devised in the second phase to select at most k seed nodes. Based on the marginal increment method, these algorithms calculate the profit function as accurate as possible. Experimental results show that our MPS algorithm substantially outperform some other algorithms, and it is also perform well with submodular profit metric after a few adjustments.

Several research directions of CPM problem deserve further study. The first and most problem is how to select the baseline of MIP algorithm. We can see that B in MIP algorithm directly affects the algorithm’s efficiency and performance and MPS algorithm is feasible in large-scale social networks with a ideal B. Therefore, the property of a ideal baseline B need to be studied. Beyond that, more efforts on study whether MPS algorithm can generate a constant approximation is significant.