1 Introduction

In social network analysis (SNA) (Pinheiro 2011), the social relationships are researched by network theory. Nodes are used to represented individual actors. Connections or links are used to represented relationships between the individuals, such as friendship, kinship, organizational position, sexual relationships, etc. Compared with traditional social network, modern social network has more form, like websites networks, blog networks, citation networks, collaboration networks, product co-purchasing networks, road networks, e-mail networks, etc. Although SNA has been studied a lot, mining the underlying relationship of the network is still difficult.

In recent years, social marketing techniques has been used to increase brands or products awareness (Hartline et al. 2008; Kempe et al. 2003; Domingos and Richardson 2001; Leskovec et al. 2007; Richardson and Domingos 2002). Comparing with traditional marketing techniques, word-of-mouth marketing is becoming more and more useful in recent year (Brown and Reinegen 1987; Goldenberg et al. 2001; Mahajan et al. 1999), which also needs to understand the social network more clearly.

In this paper, we try to model the message transmitting process in social network as wave propagation process. In a social network, each node indicates a candidate (person, animal, website etc.), and each edge is defined as the relationship between them. We let the earliest message sources act as the wave sources and model the message transmitting process as wave propagation process. The wave propagation process provides our work with the following contributions: (1) As the wave propagation process has time-dependent property, our model can simulate the message transmitting process step by step. (2) Message broadcasting algorithm can choose suitable persons to broadcast a message all over the network. (3) For there are many rumors in the world, we give a rumor transmitting algorithm to find the initial rumor makers.

The paper is organized as follow. In Sect. 2 we talk about some related work in mining social networks. In Sect. 3 we discuss the wave propagation on networks more clearly. Section 4 denotes two kinds of applications of our model, including message broadcasting problem and rumor transmitting problem. In Sect. 5 we discuss the complexity and stability of our two algorithms. Section 6 gives the empirical results of the two. Finally, Sect. 7 gives our conclusion.

2 Related work

In Rogers’s book “Diffusion of Innovations” Rogers (2003), the concept of diffusion in the social network was first proposed. With successive grouping of consumers adopting the new technology, Rogers makes his well-known S curve, as early adopters adopt an innovation first, then early majority, late majority, and the laggards adopt it last.

With the rapid development of the Internet industry, the great need of searching give birth to PageRank algorithm (Brin and Page 1998), which can measure the relative importance for the element of a hyperlinked set of documents. Though PageRank has been used successfully, it is something complex.

In some recently studied, heat diffusion process (Ma et al. 2008) has been used in the study of mining social networks. They use heat diffusion process to simulate the diffusion of the message and can model the diffusion of messages well. But physicians have proved that the inverse process of heat diffusion process is unstable, it limits the use of heat diffusion method in mining social networks.

3 Wave propagation on social networks

3.1 Propagation on undirected social networks

An undirected social network can be described as a graph G=(V,E), where V is the vertex set representing persons in the network, and V={v 1,v 2,…,v n }. E is the set of edges representing the connections between persons, and E={(v i ,v j )|an edge from v i to v j exists}.

Here we makes an analogy, the message transmitting process is similar to the wave propagation process. Then the message sensitivity of each person in the social network can be viewed as an amplitude function f i (t) associated to each node in the graph. The value f i (t) describes the amplitude of node v i at time t, which begins at time zero with the initial distribution of amplitudes given by f i (0). The F(t)=[f 1(t),f 2(t),…,f n (t)]T denotes the amplitude vector function with all the f i (t) as its constituents.

By using the analogy above, the message transmitting process can be viewed as an wave propagation process in some special area, i.e. the network. We build the model as follow. At time t, each node i receives an amount of amplitude from all its neighbor j during a time step Δt. To simplify the problem, we notice the received amplitude of node i should be proportional to the time step Δt and the amplitude differences between it and all its neighbors f j (t)−f i (t). Based on this simplification, we assume the receiving amplitude of node i from node j at time t is c 2(f j (t)−f i (t))Δt, where c is the wave velocity and Δt is the time step. As the total amount of amplitude received by node i is from all its neighbors, we can write it as:

$$ \frac{\partial^2f_i(x,t)}{\partial t^2}=c^2\sum_{j:(v_i,v_j)\in E} \bigl(f_j(t)-f_i(t)\bigr) $$
(1)

where f i (t) is the amplitude of node i at time t, and E is the set of edges.

To find a closed solution of (1), we express it in a matrix form:

$$ \frac{\partial^2F(t)}{\partial t^2}=c^2HF(t) $$
(2)

where F(t)=[f 1(t),f 2(t),…,f n (t)]T and H is the discrete form of Laplace-Beltrami operator:

$$ H_{ij}= \left\{ \begin{array}{@{}l@{\quad}l} 1{:} & (v_i,v_j)\in E\ \mbox{or}\ (v_j,v_i) \in E \\ -d_i{:} & i=j \\ 0: & \mathrm{else} \end{array} \right. $$
(3)

While (2) is a simple form, it is hard to calculate. We here make an approximation method and derive an iterative form.

For the second derivative of function f i (t), we can make an approximation:

$$ \frac{\partial^2f_i(x,t)}{\partial t^2}=\frac{f_i(t+\Delta t)-2f_i(t)+f_i(t-\Delta t)}{\Delta t^2}+O \bigl(\Delta t^2\bigr) $$
(4)

Substitute (4) into (1) and get:

$$ f_i(t+\Delta t)-2f_i(t)+f_i(t- \Delta t)=c^2\Delta t^2\sum_{j:(v_i,v_j)\in E} \bigl(f_j(t)-f_i(t)\bigr) $$
(5)

Using the approximation above, we derive an iterative form:

$$ F(t+\Delta t)=c^2\Delta t^2HF(t)+2F(t)-F(t-\Delta t) $$
(6)

where d i denotes the degree of node v i .

In order to describe the wave propagation process more clearly, we set an example here. The graph in the example includes five nodes (see Fig. 1). Initially, suppose node 1 is the source with amplitude 1, then the vector F(0)=(1,0,0,0,0)T. The matrix H is:

$$H=\left( \begin{array}{@{}c@{\quad}c@{\quad}c@{\quad}c@{\quad}c@{}} -1 & 1 & 0 & 0 & 0 \\ 1 & -4 & 1 & 1 & 1 \\ 0 & 1 & -2 & 0 & 1 \\ 0 & 1 & 0 & -2 & 1 \\ 0 & 1 & 1 & 1 & -3 \end{array} \right) $$

The graph is shown in Fig. 1.

Fig. 1
figure 1

Nodes connections

Without loss of generality, we set the wave velocity c=1, and the time step Δt to be 0.01. Then the amplitude of each node is shown in Fig. 2.

Fig. 2
figure 2

Amplitude of all nodes

Now we depict Fig. 1 and Fig. 2 in the view of message transmitting process in social network. Node 1 is the source, i.e. the message publisher, node 2 is the acquaintance of node 1 and other nodes are all acquaintances of node 2. At first, the amplitude of node 2 increases quickly and that of node 1 decreases, indicating node 1 sends a message to 2. Then 2 send it to others. The figure shows the trend of message sensitivity with time among these persons.

3.2 Propagation on directed social networks

In many real social networks, the status between two persons is not equivalent. For example, in army, lieutenant can command solder, while the opposite is impossible, i.e. the connection is single-directed. Similar to the analysis at Sect. 3.1, we can adopt the wave equation to discuss the propagation on directed social networks.

For a directed graph G(V,E), any edge (v i ,v j ) presents a single-directed connection from node v i to node v j . At time t, each node v i can receive amplitude from all its neighbors who have a connection to it. Supposing this process has three characters: (1) the receiving should be proportional to the time step Δt; (2) the receiving should be proportion to the amplitude of its neighbors; (3) if there’s no neighbor, the receiving is 0. According to the assumptions, the receiving is \(c^{2}\sum_{j:(v_{j},v_{i})\in E}{\theta_{j}f_{j}(t)\Delta t}\), where θ j is the transmitting percentage coefficient of node v j .

For the sending, we also suppose it satisfies the following assumptions: (1) the sending should be proportional to the time step Δt; (2) the sending should be proportional to the current amplitude of node v i ; (3) each node has the same ability to transmit which denotes \(\theta_{j}=\frac{1}{d_{j}}\). According to the assumptions, the sending is c 2 f i (tt, and each of its neighbor receives \(\frac{c^{2}f_{i}(t)\Delta t}{d_{i}}\). Thus we get:

$$ \frac{\partial^2f_i(x,t)}{\partial t^2}=c^2\biggl(-\beta_{i}f_i(t)+ \sum_{j:(v_j,v_i)\in E}{\frac{f_j(t)\Delta t}{d_j}}\biggr) $$
(7)

where β i is a sign function to judge whether node v i has any out-links: if has, β i =1, else β i =0.

Similar to the analysis of undirected social network in Sect. 3.1, we can derive the iterative form:

$$ F(t+\Delta t)=c^2\Delta t^2HF(t)+2F(t)-F(t-\Delta t) $$
(8)

where H satisfies:

$$H_{ij}=\left\{ \begin{array}{@{}l@{\quad}l} \frac{1}{d_j} & (v_j,v_i)\in E \\[0.5mm] -\beta_i & i=j \\[0.5mm] 0 & \mathrm{else} \end{array} \right. $$

4 Underlying relationships finding

In every social network, the underlying relationships are usually hidden but important. For example, in an e-mail network, there may be some key persons who always send e-mails to others. In some aspects, if we can know who these persons are, we can make use of them to send e-mails and influence other company members more efficiently. Though we have mastered all the connections between each pairs of persons, we don’t know who the key persons are. To finding these key persons, we must know all the underlying relationships in the network first, i.e. the underlying influencing scope of all the persons.

Seeing all the persons as nodes and the connections between each other as edges, we can build the social network. By associating influencing node sets to each node, the underlying influencing scope of persons indicates all these sets. Then our task is to find these sets, and that can be done as follow: (1) set an initial amplitude on node v i and others 0; (2) simulate the wave propagation process in the network; (3) after a long time, statistic the number of nodes which v i successfully influences. Where successfully influence for a node indicates the amplitude of it is higher than a given adoption threshold γ during the simulation.

4.1 Message broadcasting

The Message Broadcasting problem often comes from real situations. For example, in a company, an important decision must be broadcasted to all the members as soon as possible. This thing is usually done as follow: the CEO inform some directors, and the directors inform their managers, and then the managers inform others. For the CEO, what he should do is just selecting k well-chosen directors to broadcast his decision. And how to choose these directors is the problem.

This can be seen as an application of underlying relationship finding. First, we find the influencing node sets of all the nodes. In this process, we set an adoption threshold γ. If at some time t, the amplitude of node j is greater than γ, then we called node j has been successfully influenced. Then we choose k nodes whose influencing node sets is the biggest, and they’re our “well-chose”. Note that the social network we study here has finite members, so no boundary condition is necessary. We use greedy algorithm (Fig. 3) to deal with this problem.

Fig. 3
figure 3

Flowchart of Message Broadcasting problem

4.2 Rumor spreading

The Rumor Spreading problem is arising more and more attention this decade (Fountoulakis and Panagiotou 2010), especially since the tragedy of 9.11. After 9.11, more and more rumors came up and disturb our normal life, which should be controlled. When a rumor come up, who made it at first is always important. That is, we want to know who is the rumor maker.

After modeling the message transmitting process as wave propagation process (see Sect. 3), this problem can be described as: given current state of amplitude in the network, we want to know what the initial state is. We can use the inverse process of wave propagation process to study this problem.

For a undirected graph, according to (6), we can solve the inverse process by:

$$ F(t-\Delta t)=c^2\Delta t^2HF(t)+2F(t)-F(t+\Delta t) $$
(9)

Given the current state of amplitude and then this iterative form can solve the inverse process. Here boundary condition is also unnecessary. But one problem is that we don’t know how long the wave has already propagated. In the view of calculation, we don’t know when to terminate.

To decide the termination criterion, we make a natural assumption that when a rumor came at first, the number of the persons who knew it is the least. And in the iterative process, we can do this by adding an source-like threshold σ to judge the source node. If at some time t the amplitude of a node i is greater than σ, then it may be a source. Then what we should do is to see when the number of source-like nodes is least and terminate our iteration.

5 Complexity and stability

Supposing a social network has N nodes and M edges, we discuss the complexity of each algorithm:

  1. 1.

    Message Broadcasting

    For this problem, our algorithm has two part. First, we simulate the wave propagation process and find the influencing node sets of each node, which costs O(N 2 T), where T is the simulation time. Then we use the greedy algorithm to choose k nodes. In greedy algorithm, this part just cost linear time O(kN). So the total complexity is O(N 2 T+kN).

  2. 2.

    Rumor Spreading

    The Rumor Spreading problem needs us to solve the inverse problem of wave propagation. For each time step, our iterative form costs O(N 2) and the termination criterion judgement costs O(N). In general, the total complexity is O(N 2 T+NT), where T is also the simulation time.

In our model, we mainly use wave propagation process and its inverse process. For the wave propagation process, the iterative form (scheme (6)) is always stable. For its inverse, we use scheme (9) to calculate. In scheme (9), function F defined on nodes of network is approximated by linear interpolation functions. Consulting the definition about accuracy of finite volume method, we can also say that scheme (9) has first order temporal accuracy. Since the length of edges equals 1, so the stable condition Courant-Friedrichs-Lewy condition (Subramani and Rajagopalan 2003) is cΔt<1. By setting our simulation time step Δt<1, our iterative form can get a stable solution.

6 Empirical analysis

Lots of data sets can be used in our model, and here we use an e-mails dataset of a company. In the empirical analysis, we discuss several paraments first, and then set them in our algorithm to solve the message broadcasting problem and rumor transmitting problem.

6.1 Data set

Here we use the network of Enron (Leskovec et al. 2009; Klimmt and Yang 2004). The network has 36692 nodes and 183831 undirected edges. Nodes of the network are email addresses and if an address i sent at least one email to address j, the graph contains an undirected edge from i to j. Note that non-Enron email addresses act as sinks and sources in the network as we only observe their communication with the Enron email addresses.

6.2 Results of message broadcasting problem

Given Enron network, we have to decide several parameters in our algorithm first. In our simulation, we let the wave velocity c=1 and the adoption threshold γ=0.01 while the initial amplitude of a source node is f s (0)=1. As we have discussed above, if at some time step t s node’s amplitude is greater than γ, we think it has been successfully influenced. Besides, we use Δt=0.01 as the simulation time step and we use 3 different simulation time T=0.1,0.2,0.5. The empirical results are shown in Table 1 and Fig. 4.

Fig. 4
figure 4

Coverage with nodes number in different time period

Table 1 Nodes ID and coverage in different time period

Where coverage indicates the number of influenced nodes at time T.

In Table 1 we notice at early time, the message transmits quite slowly, but the transmiting speed goes up explosively with time. Take 7 nodes for example, when T=0.1, it can just cover 6680 nodes (%18.21), while T=0.5, it goes up to 25298 nodes (%68.94). Fig. 4 shows the coverage size at different time with different number of source nodes.

6.3 Results of rumor transmitting problem

In our algorithm, current state of all nodes must be given first. We give it by assuming some nodes to be the source (here we choose node 28521 and 28522), and then simulate the wave propagation process for some time period and let the current state of amplitude to be the input. As for the source-like threshold σ, we let σ=0.001. If at some time step, one node’s amplitude is greater than σ, then it may be a source. Other parameters like c and Δt are the same as in Sect. 6.2. When the number of source-like nodes is least, we terminate our algorithm.

Our empirical results are shown in Fig. 5. For clarifying, we omit the edges in all graphs.

Fig. 5
figure 5

Empirical results of Rumor Transmitting problem

Where red nodes indicate the source-like nodes while grey not. The upper left indicates the current state (29901 red nodes), the upper right (18741 red nodes) and lower left (6017 red nodes) indicate two middle states, and the lower right (2 red nodes) indicates the terminated state. We also plot the tiny structure (red box in terminated state) in Fig. 6. Using our algorithm, we successfully find the rumor makers(node 28521 and 28522).

Fig. 6
figure 6

Tiny structure of terminated state

7 Conclusion

In this paper, we try to build a framework of mining social networks using Wave Propagation models. We mainly discuss two Wave Propagation models and two applications of our framework, including attempts to solving message broadcasting problem and rumor spreading problem. The complexity analysis shows our framework is scalable for large networks, and we can guarantee the stability of our algorithm by suitable setting some parameters. And the empirical analysis of Enron e-mail dataset shows that our work is promising.