1 Introduction

With the wide development of online social network, it has become the carrier of social relationship and message propagation. Currently, more and more people participate in information propagation and information access through online social network. So, social software has been an important tool in online social network, where more and more people are using social software to chat with each other and express their individual viewpoints, then the size of users and exchanged messages increases sharply. For example, the social software, such as Facebook, Twitter and WeiboFootnote 1, can provide many chances to exchange individual information (node attribute), which include friend records, individual hobbies and location information. Then, the online social network has become a research hot spot with its huge scale, complex structure and huge amount of information. Many scholars focus on the relationship of online social network, off-line society and economics. For example, through analyzing hierarchical structure, social related attributes and message propagation method, we may know how social network influences people’s living conditions, lifestyle and social relationship, such as public opinion propagation, false message propagation and network crime .

Currently, online social network provides many service applications, and it can transmit different information contents to users timely. In general, because of the relevance of personal attributes and information content, the interactions among users can dynamically form, change and influence the structure of social network. As online social network consists of a number of online social organizations and individuals, if there is more number of nodes (users) in one community, then we may consider that the community is more social and active, which is seen as one of online social organizations. In online social network, online community is gradually integrated into people’s daily life and plays an important role, which provides a fixed social circle for information propagation and sharing to express personal requirements and relationships. Currently, the researches of overlapping and detecting online community are of great significance to the analysis of structure, function and characteristics of online social network (Newman and Clauset 2015). The related researches are mainly carried out in different directions, such as member influence, user interest degree and structure model. Also, with the rapid development of Internet, the ways of information propagation are greatly changed. Information propagation can be seen as people’s interactive behaviors, which have many characteristics such as many to many relationships, real time and quickness. In online social network, information propagation is influenced by a variety of factors, such as network structure, node characteristics and information content. So, the related researches about information propagation have great significance to locate target groups, protect user privacy, monitor internet public opinion and so on Li et al. (2015).

In online social network, node attributes and node wishes (or personal willingness) are important factors to influence community structure and information propagation. Personal willingness can be used to describe the subjective initiative of node (user) to communicate information with outside world. Personal willingness takes full account of related node attributes. The personal willingness is greater, the corresponding user is more willing to make communication with outside world, then the user is more likely to join the corresponding community. So, personal willingness may reduce the probability of generating large-scale communities so as to improve the accuracy and reliability of community detection and increase the stability of community structure.

2 Our contributions

In this paper, we propose a social community detection and message propagation scheme based on personal willingness in social network. In the proposed scheme, the social community detection algorithm uses modularity degree to detect social communities with interest degree and personal willingness, and the message propagation method introduces edge feature and node feature and then constructs willingness vector. The main contributions of this paper are as follows:

(1):

The social community detection algorithm extracts node attributes and then uses modularity degree, interest degree and personal willingness to sophisticatedly detect social communities. The algorithm is based on community willingness and personal willingness to detect community structure. The algorithm gives priority to the initiative of nodes in the community detection, and thus, it reduces the probability of generating large-scale communities.

(2):

The message propagation method is based on the exponential model, which constructs feature vector by edge feature and node feature, willingness vector by personal willingness and community willingness, and related basic relationship by propagation probability and propagation delay. Because the model constructs willingness vector related to the initiative of nodes, the model is conducive to the stability and continuity of information propagation.

(3):

We make the experiments mainly from two aspects of community detection and message propagation to demonstrate the effectiveness of joining personal willingness: (a) in the community detection, the experiments show that personal willingness is greater, the corresponding user is more willing to make communication with outside world, then the user is more likely to join the corresponding community, and thus, the proposed algorithm may reduce the probability of generating large-scale communities so as to improve the accuracy and reliability of community detection and increase the stability of community structure; (b) in the message propagation, the experiments show that personal willingness may influence the construction of message propagation model, where the personal willingness is greater, the corresponding node is more willing to propagate messages.

3 Related work

At present, with the development and evolution of social network, social network has become a research hot spot. The related researches of social network mainly include overlapping community discovery (Devi and Poovammal 2016), community detection (Shang et al. 2016), privacy protection (Buccafurri et al. 2016), message propagation (Liu and Li 2016) and social networking (Guo et al. 2016). In this paper, we focus on community detection and message propagation.

3.1 Community detection

Currently, the community detection algorithms (Shen et al. 2009; Blondel et al. 2008; Ahn et al. 2010; Shi et al. 2013; Von Luxburg 2007; Whang et al. 2013; Lancichinetti et al. 2009; Pons and Latapy 2006; Raghavan et al. 2007; Zhao et al. 2016) mainly use the different methods to divide the online social network structure, which include (1) graph-based partitioning algorithm (Shen et al. 2009; 2) module degree algorithm (Blondel et al. 2008; 3) edge clustering algorithm (Ahn et al. 2010; Shi et al. 2013; 4) hierarchical clustering algorithm (Von Luxburg 2007; 5) seed dispersal method (Whang et al. 2013; Lancichinetti et al. 2009; 6) random walk algorithm (Pons and Latapy 2006) and (7) label propagation method (Raghavan et al. 2007; Zhao et al. 2016). Also, as the social network researches are deepened, many scholars consider that node attributes should be added to community detection. Steinhaeuser and Chawla (2010) proposed a community detection algorithm to classify communities, which is based on weight-based edge and node attribute’s similarity. They recommended implementing their method with the random walk approach. Dang and Viennet (2012) proposed a community detection method based on Luovain algorithm, which uses module degree and node attribute’s similarity to make weighted sum. Kewalramani (2011) used the similarity of metadata (based on correlation) to detect communities by clustering means in Twitter. Deitrick and Hu (2013) made emotional analysis based on the Twitter content that the users send in a period of time, and they improved the efficiency of the related community detection method. Xu et al. (2016) analyzed community detection and friendship prediction in mobile social network and proposed a method of constructing community structure based on combination entropy. Guo et al. (2016) proposed a relation-weight-clustering model to construct twitter users’ network, where their model takes twitter users’ “@” and “RT@” behaviors into account. Tagarelli et al. (2017) proposed a novel modularity-driven ensemble-based approach to multilayer community detection, where it may find consensus community structures that not only capture prototypical community memberships of nodes, but also preserve the multilayer topology information and optimize the edge connectivity in the consensus via modularity analysis. Amelio and Pizzuti (2016) proposed a framework for community discovery in temporal multiplex networks by extending the evolutionary clustering approach to encompass both time and multiple dimensions. In their extended framework, the problem of finding community structures for time-evolving networks with multiple types of ties is reformulated by adding the concept of dimensional smoothness. Sun and Lin (2013) proposed a probabilistic generative model to detect latent topical communities among users, where social tags and resource contents are leveraged to model user interest in two similar and correlated ways. Their primary goal is to capture user tagging behavior and interest and to discover the emergent topical community structure. Jaho et al. (2011) proposed a framework for interest similarity-based community detection in social networks, where nodes are clustered according to common interests. Their proposed framework detects communities over weighted graphs, where graph edge weights are defined based on measures of similarity between nodes’ interests in certain thematic areas. Hutair et al. (2017) proposed a novel algorithm that clusters the nodes in social networks into communities based on their geodesic location and the similarity between their interests. Yang et al. (2011) proposed a node interest similarity method-based P2P trust model, which takes both node interest bias and reputations in each interest domain into consideration and uses interest domain reputation vector to maintain the behaviors of node in specific interest domain. Their proposed method uses interest similarity between nodes to weight domain local trust recommendation. Currently, the above-mentioned papers only use node interest as a measure to construct the algorithms or models of detecting communities. As the related works do not analyze user’s behaviors to divide behavior attributes of user in more precise degree, personal willingness is not considered into the existing related works. Compared with node interest, personal willingness may be used to describe the subjective initiative of user to communicate information with outside world. The personal willingness is greater, the corresponding user is more willing to make communication with outside world. So, we focus on what personal willingness can influence in community detection and message propagation in this paper.

3.2 Message propagation

The researches of message propagation mainly focus on modeling online social network. The early models include the independent cascade model (Goldenberg et al. 2001) and the linear threshold model (Young 2000). Based on the independent cascade model, Kempe et al. (2003) proposed a decreasing cascade model. Centola (2010) analyzed the impact of the healthy network structure on message propagation, and then, they considered that the information spreads faster and farther in the good cluster network. Liu and Li (2016) proposed a novel data propagation scheme to maximize data propagating rate under the limiting overhead of propagating messages. In their scheme, a time- consistent Markov model is designed to analyze the interest transformation of each neighbor node to decide when to forward the message and select which node to forward the message further. Also, two utility functions are studied to evaluate the service ability of nodes with different interests in messages. Yang and Counts (2010) constructed another model to capture three main characteristics of information propagation: speed, scale and scope. They analyzed that the factors affecting the three characteristics are user’s related attributes and information propagation labels, and they also gave a quantitative measure of the three characteristics. Lagnier et al. (2013) proposed a linear threshold model based on decaying reinforced user-centric. The model can predict how information spreads in online social network, whose influence factors include interest of users, influence of adjacent users and propagation willingness of users. Based on the asynchronous independent cascade (AsIC) model, Saito et al. (2011) constructed the target function on propagation probability and attribute vector of neighboring nodes and then built a maximum likelihood model to solve propagation rate and delay parameter. Spiro et al. (2012) proposed a time-based model in the Twitter platform, which gives a statistical analysis of message propagation. They considered that the influence factors of delay are: (1) relevant attributes of user, such as user’s authority and active degree; (2) relevant attributes of information, such as message label, URL and hot searching events. Ouadrhiri et al. (2017) proposed a message propagation control model under epidemic routing protocol in delay-tolerant networks. They model the messages propagation under the epidemic routing protocol by an ordinary differential equation, and they derive the optimal retention of a message by a node while taking into account that all nodes are infected. Their simulation results show that the proposed model can reach the same performances of epidemic routing while minimizing the resource consumption. Kim and Yoo (2012) examined the role of sentiment in information propagation. They make use of political communication in the Twitter space and relate emotion expressions in a message to the degrees of responses generated by the message. They also compare differences between user reply and retweet behavior with respect to sentiment variables. Itakura and Sonehara (2013) proposed the importance of Twitter’s mention function as another method of message propagation. In their work, the graphs constructed from Twitter’s retweet, mention and reply functions show structural differences, which suggest that the mention function is the most efficient method of reaching the mass audience.

4 Preliminaries

Given a weighted and directed graph G(VE) representing the relationship of users in online social network, the set of nodes is represented by V, the set of edges is represented by E denoting the set of messages transferred by users, the number of nodes is represented by \(k=|V(G)|\) and the number of edges is represented by \(n=|E(G)|\), then a social network with k nodes may be represented by an adjacency matrix \(B_{k\times k}\), where \(B_{i,j}\) denotes

$$\begin{aligned} B_{i,j}=\left\{ \begin{array}{rcl} 1, &{} &{} {\mathrm{if}~\mathrm{the}~\mathrm{node}~i~\mathrm{and}~\mathrm{the}~\mathrm{node}~j~\mathrm{are}~\mathrm{connected}}\\ 0, &{} &{} {\mathrm{otherwise}} \end{array} \right. \end{aligned}$$

and the number of edges is

$$\begin{aligned} n=\sum \limits _{i=1}^{k}\sum \limits _{j=1,j\ne i}^{k}B_{i,j}. \end{aligned}$$

Definition 1

Node degree (user degree) It denotes the influence range of the user i in social network; namely, the number of all edges associated with the node (user) i is the node degree, represented as follows:

$$\begin{aligned} m_i^\mathrm{in}=\sum \limits _{j=1}^{k_\mathrm{in}}B_{j,i},\,\, m_i^\mathrm{out}=\sum \limits _{j=1}^{k_\mathrm{out}}B_{i,j} \end{aligned}$$

where \(k_\mathrm{in}\) is the number of all edges associated with the node i and \(k_\mathrm{out}\) is the number of all edges associated from the node i. So, \(m_i^\mathrm{in}\) is the in-degree of the node i and \(m_i^\mathrm{out}\) is the out-degree of the node i.

Definition 2

Edge weight\(w_{i,j}\) The adjacency matrix \(W_{k,k}\) is used to represent the edge weights in social network, where the edge weight between the node i and the node j is expressed by the matrix element \(w_{i,j}\in W_{k,k}\). The node i and the node j are connected more closely, representing that the messages are transferred more frequently between the node i and the node j; thus, the possibility of message propagation is greater, the value of \(w_{i,j}\) is greater; on the contrary, if the connection between the node i and the node j is sparse, then the possibility of message propagation is smaller.

Definition 3

Module degreeD It is the ratio of the edge density in the community and the edge density among the associated communities, whose formula Newman and Girvan (2004) is as follows:

$$\begin{aligned} D={\frac{1}{n}\cdot \sum \limits _{i=1}^{z}\sum \limits _{j=1}^{z}\left[ B_{i,j}-\frac{m_i^\mathrm{out}\cdot m_j^\mathrm{in}}{n}\right] \cdot \sigma (c_i,c_j)} \end{aligned}$$

where n is the number of edges, z is the number of nodes, \(c_i\) denotes the community that the node i belongs to, \(c_j\) denotes the community that the node j belongs to and \(\sigma \) is the function used to compute the max value of edge number between \(c_i\) and \(c_j\). From the above formula, we can know that when the module degree of network is the minimum value of 0, the initialization of each node is independent to form a single community; when the module degree of network is the maximum value of 1, all nodes are detected into a community. The process of community detection and classification is based on the modularity maximization principles: (1) the terms of negative value should be excluded; (2) the nodes with greater interest similarity should be partitioned to the same communities.

Definition 4

Personal willingness\(\varepsilon _u\)of user (node) It is the willingness degree chosen by a user when the user joins a community, where \(\varepsilon _u\in [0,1]\) and the user may choose his own willingness value according to his requirements. When \(\varepsilon _u=1\), the user is completely open to outside world, and the information from outside world is unconditional accepted; \(\varepsilon _u=0\), the user is completely hidden to outside world; namely, the user does not accept any information from outside world.

Definition 5

Node willingness\(\varepsilon _{i,j}\) It is a measure to control the number of connection (or the edge intimacy) between the node i and the node j where \(\varepsilon _{i,j}\in [0,1]\). Namely, node willingness is used to control the number of transferred messages between the node i and the node j, which is influenced by the personal willingness of the node i and the node j. Node willingness \(\varepsilon _{i,j}\) is proportional to the number of transferred message between the node i and the node j, which may dynamically be adjusted. When \(\varepsilon _{i,j}=0\), the communication willingness between the node i and the node j is 0; namely, the node i and the node j are not willing to make communication each other. When \(\varepsilon _{i,j}=1\), the node i and the node j are the most willing to make communication each other.

Definition 6

Community willingness\(\varepsilon _{c}\) It is a measure to set the communication degree for the community with respect to other communities, where \(\varepsilon _{c}\in [0,1]\). Community willingness may be set by every community according to the different requirements, where every member from community is required to satisfy the condition of community willingness.

Definition 7

Edge intimacy degree\(l_{i,j}^\varepsilon \) It indicates the frequent degree of interaction or the frequent degree of message transmission between the node i and the node j, which is influenced by node willingness \(\varepsilon _{i,j}\). Edge intimacy degree \(l_{i,j}^\varepsilon \) is computed as follows:

$$\begin{aligned} r_{i,j}^\varepsilon= & {} r_{i,j}\cdot \varepsilon _{i,j},\\ r_{i}^\varepsilon= & {} \sum \limits _{x=1}^{m^\mathrm{out}} r_{i,x}^\varepsilon ,\,\, r_{j}^\varepsilon =\sum \limits _{x=1}^{m^\mathrm{in}} r_{x,j}^\varepsilon ,\\ l_{i,j}^\varepsilon= & {} \frac{r_{i,j}^\varepsilon }{\sqrt{r_{i}^\varepsilon \cdot r_{j}^\varepsilon }} \end{aligned}$$

where we set that \(r_{i,j}\) is the original number of message transmission from the node i to the node j without node willingness \(\varepsilon _{i,j}\), \(r_{i,j}^\varepsilon \) is the number of message transmission controlled by node willingness \(\varepsilon _{i,j}\), \(m^\mathrm{out}\) is the out-degree of the node i and \(m^\mathrm{in}\) is the in-degree of the node j.

Definition 8

Information content feature It is defined as \(\mathrm{Cnt}=\{\mathrm{cnt}^1,\mathrm{cnt}^2,\ldots \mathrm{cnt}^N\}\), where \(\mathrm{cnt}^k\) is the feature of the kth piece of message content.

Definition 9

User feature set It is defined as \(U=\{u^1,u^2,\ldots u^N\}\), where \(u^k\) is the attribute feature of the kth user.

Definition 10

Manhattan distance It is the sum of the distance from the multiple dimensions, whose formula is as follows:

$$\begin{aligned} \mathrm{dist}(X,Y)=\sum \limits _{i=1}^n|x_i-y_i| \end{aligned}$$

where \(x_i\in X\) and \(y_i\in Y\).

5 Community detection

In the section, we propose a community detection scheme based on the attributes of node and edge, including edge intimacy degree, node willingness and node interest degree. Our proposed scheme makes weights to the edges of social network according to node degree and edge intimacy degree and then constructs a method for module degree detection with node interest degree and node willingness. Our proposed scheme uses a hierarchical structure approach to detect communities.

5.1 Edge weight

According to the related definitions, the computation of edge weight is based on edge intimacy degree and node degree (Yao et al. 2015). As node willingness \(\varepsilon _{i,j}\) is added to the computation of \(l_{i,j}^\varepsilon \), the computation of edge weight is also related to node willingness. Edge weight is defined as:

$$\begin{aligned} w_{i,j}=\eta \cdot l_{i,j}^\varepsilon +(1-\eta )\cdot \sqrt{\frac{B_{i,j}}{m_i^\mathrm{out}\cdot m_j^\mathrm{in}}} \end{aligned}$$

where \(\eta \in [0,1]\) is the impact factor. In online social network, because the attributes of node are different, the impact factors of community division are also different. We consider that edge intimacy degree has greater influence to community division when \(\eta \) is greater; on the contrary, it has smaller influence. When \(\eta =0\), community division is only based on topological structure or community structure. When \(\eta =1\), edge weight is only edge intimacy degree, where edge intimacy degree indicates the frequent degree of interaction or message transmission between two nodes. When edge intimacy degree is greater, the possibility of information transmission is higher. Therefore, the impact factor \(\eta \) is introduced to the above formula, which makes balance to decrease the influence of the attributes of node.

5.2 Weighted module degree

Module degree is the most commonly used method for community partition. It is also a general method to measure the quality of community structure. Based on module degree, the community structure may be clearly detected, and the partitioned community structure is more stable. Therefore, our proposed scheme uses module degree to detect community structure with interest degree and personal willingness, which can increase the stability of community structure and reduce the probability of generating large-scale communities. The work of Yao et al. (2015) proposed a novel measure to compute weighted module degree and its increment. According to Definition 3 and the formula of edge weight, weighted module degree is defined as follows:

$$\begin{aligned} D^*={\frac{1}{w}\cdot \sum \limits _{i=1}^{z}\sum \limits _{j=1}^{z}\left[ w_{i,j}-\frac{w_i^\mathrm{out}\cdot w_j^\mathrm{in}}{w}\right] \cdot \sigma (c_i,c_j)} \end{aligned}$$

where \(w_{i,j}\) is the edge weight, \(w_i^\mathrm{out}\) is the out-weight of the node i, \(w_j^\mathrm{in}\) is the in-weight of the node j and w is the sum of all the edge weights. Based on the above formula, when the node i joins the community \(c_j\) that the adjacent node j belongs to, the increment \(\Delta D^*\) of module degree is defined as follows:

$$\begin{aligned} \Delta D^*= & {} \left[ \frac{w_{c_j}+w_{i,c_j}}{w}-\frac{(w_{c_j}^\mathrm{in}+w_{i}^\mathrm{in})(w_{c_j}^\mathrm{out}+w_{i}^\mathrm{out})}{w^2}\right] \\&-\left[ \frac{w_{c_j}}{w}-\frac{w_{c_j}^\mathrm{out}\cdot w_{c_j}^\mathrm{in}}{w^2}-\frac{w_{i}^\mathrm{out}\cdot w_{i}^\mathrm{in}}{w^2}\right] \end{aligned}$$

where \(w_{c_j}\) is the sum of the internal edge weights of the community \(c_j\), \(w_{i,c_j}\) is the sum of the edge weights of the node i and other associated nodes from the community \(c_j\), \(w_{c_j}^\mathrm{in}\) is the sum of the in-weights of the community \(c_j\), \(w_{c_j}^\mathrm{out}\) is the sum of the out-weights of the community \(c_j\), \(w_{i}^\mathrm{in}\) is the in-weight of the node i and \(w_{i}^\mathrm{out}\) is the out-weight of the node i.

5.3 Node interest degree

Users who share a common interest are more likely to join the same community, and they will share messages of mutual interest. When users join one community, users can define the labels of their own interests as the form of \(<keyword, weight>\) representing user’s interest. Therefore, the clustering approach to the message vectors of nodes is used to extract the interest similarity of nodes (Rao and Raju 2016).

For a message \(\mathrm{cnt}_i\), it can be expressed as

$$\begin{aligned} \mathrm{cnt}_i=\{(n_1,w_1);(n_2,w_2);\ldots (n_N,w_N)\} \end{aligned}$$

where the ith keyword in the message \(\mathrm{cnt}_i\) is represented by \(n_i\), the weight of the ith keyword is represented by \(w_i\), and they are ranked in descending order.

Therefore, based on the structure of message \(\mathrm{cnt}_i\), the interest similarity between the node i and the node j can be defined as:

$$\begin{aligned} \mathrm{Int}(i,j)=\mathrm{sim}(i,j)=\frac{\sum \limits _{n=1}^m w_{i,n}\cdot w_{j,n} }{\sqrt{\sum \limits _{n=1}^m w_{i,n}^2}\cdot \sqrt{\sum \limits _{n=1}^m w_{j,n}^2}}\cdot \varepsilon _{i,j} \end{aligned}$$

where m denotes the number of public keywords, \(w_{i,n}\) denotes the weight of the nth public keyword of the message from the node i, \(w_{j,n}\) denotes the weight of the nth public keyword of the message from the node j and \(\varepsilon _{i,j}\) is node willingness. If the value \(\mathrm{Int}(i,j)\) is greater than a preset threshold T, then the probability that the node i and the node j are detected into the same community is greater; on the contrary, the probability is less.

5.4 The proposed detection algorithm

In the section, we propose a community detection algorithm based on module degree detection (Yang and Counts 2010), combined with interest degree and personal willingness. Compared with the previous module measure, our proposed algorithm introduces node interest degree and node willingness to guarantee the structural stability of the community members. The proposed algorithm can not only provide hierarchical community structure detection, but also adjust node willingness according to the realistic requirement of community after a period of time so as to further screen the community members to make the community more stable and the message propagation more fluent. The proposed algorithm is described as follows (shown in Fig. 1), which consists of three subalgorithms:

  • Step 1 input the adjacent network G(VE) and the total number k of nodes, then measure the increment of module degree, and output a set \(C_1\) of communities (see Algorithm 1 for details).

  • Step 2 input the set \(C_1\) of communities and a given threshold T, compute the interest degrees of nodes, then repeatedly screen the set \(C_1\) according to the interest degrees, where the interest degrees are compared with the given threshold T, and finally output a set \(C_2\) of communities and a set \(\Phi \) of community willingness (see Algorithm 2 for details).

  • Step 3 input the set \(C_2\) of communities and the set \(\Phi \) of community willingness, then screen the set \(C_2\) according to the condition whether the corresponding node willingness can satisfy a given requirement or not, and finally output a set C of communities and the total number of communities (see Algorithm 3 for details).

Fig. 1
figure 1

Community detection

1.The description of Algorithm 1

Algorithm 1 filters the communities according to module degree: (1) in the adjacent social network G(VE) with the total number k of nodes, the algorithm initializes each node willingness as the mean value of personal willingness of the corresponding nodes, then gets the new weight value of each edge by the formula of edge weight and initializes each node to form a community, where the total number of communities is k; (2) for each community i(\(i\in G\)), the algorithm calculates all the increments \(\Delta D^*\) of module degree, where we assume the community i tries to join all the adjacent communities; (3) the algorithm looks up the adjacent community with the maximum value of the increments (\(\Delta D^*>0\)), and then, the community i joins the corresponding adjacent community; (4) as long as the values of \(\Delta D^*\) are changing, the process of merging communities will continue by cycle iteration until the communities cannot be partitioned into the communities of higher level; (5) after the end of the partition, the algorithm returns a set \(C_1\) of communities.

figure a

2.The description of Algorithm 2

Algorithm 2 again filters the communities returned from Algorithm 1 according to node interest degree: (1) based on the set \(C_1\) of communities and the given interest threshold T, for every node \(i_k\) from the community i(\(i\in C_1\)), the algorithm calculates the node’s interest degrees between the node \(i_k\) and all the associated nodes from the adjacent community j according to Sect. 5.3; if all the interest degrees of the node \(i_k\) are more than T, then the node \(i_k\) is deleted from the community i and joins the adjacent community j (shown in Fig. 1); (2) similarly, as long as the values of \(Int(i_k,j_k)\) are changing, the process of filtering communities will continue by cycle iteration; (3) after the end of the filter, the algorithm gets a set \(C_2\) of communities; (4) based on the set \(C_2\) of communities, the algorithm calculates the corresponding community willingness for each community, which is the mean value of all personal willingness of the nodes from the corresponding community, and then, the corresponding community willingness is saved to the set \(\Phi \); (5) the algorithm returns the sets \(C_2\) and \(\Phi \).

figure b

3.The description of Algorithm 3

Algorithm 3 adjusts the detection of the communities according to the willingness: (1) based on the set \(C_2\) of communities, for every community \(c_i\) from \(C_2\), the algorithm looks up the corresponding community willingness \(\varepsilon _{c_{i}}\) from the set \(\Phi \)(\(\varepsilon _{c_{i}}\in \Phi \)); (2) for every node from \(c_i\) with all \(i\in 1,2,\ldots |C_2|\), the algorithm computes the mean value of all the node willingness between the node and its adjacent nodes; if the mean value is less than \(\frac{\varepsilon _{c_{i}}}{\varpi }\)(\(\varpi \) is the parameter of adjusting willingness, where we may set the parameter value according to our requirement), then it shows the communication willingness of the node is lower than those of other nodes from the same community \(c_i\); thus, the node is removed from \(c_i\) (shown in Fig. 1); (3) the algorithm finally returns a set C of communities and the number of all the communities.

figure c

6 Message propagation model

In the section, we show a message propagation model based on personal willingness for social network. In the actual message propagation process, the message transfer is not necessarily synchronous, and the message propagation delay may be generated. Therefore, the message propagation model is commonly constructed on the propagation probability and delay, where the messages are asynchronously transferred and the propagation delays are different. The asynchronous independent cascade (AsIC) model (Saito et al. 2009) is a message diffusion model based on continuous time delay, which can handle and monitor the message asynchronous and cascaded propagation process according to the propagation probability and delay. So, based on the AsIC model, we introduce the features of node attribute and message content to the proposed model and build the model by the exponential mechanism. Also, the proposed model uses the propagation probability \(p(i,j,\mathrm{cnt},\varepsilon )\) and the propagation delay \(\tau (i,j,\mathrm{cnt},\varepsilon )\) as the basic valuated functions. In the proposed model, we introduce and quantify most of the factors that influence message propagation, so that the proposed model can guarantee the security and reliability of message propagation.

6.1 Feature extraction

We need to make feature extraction from social networkFootnote 2. In this paper, the extracted features are divided into node features and edge features. Node features include characteristics of communication subjects, characteristics of communication objects and message characteristics. Edge features include relationship characteristics between communication subjects and communication objects, and relationship characteristics between communication objects and propagated information contents. As feature extraction may depict related attributes, related attributes can correspond to relevant features (Zhou et al. 2015).

(1):

Node Feature

(1):

Characteristics \(\Psi _s\) of communication subjects

(a):

Node influence: it is the sum of nodes that a node associates with over a period of time, where we set that it is the node’s out-degree \(m_i^\mathrm{out}\); thus,

$$\begin{aligned} \mathrm{Inf}(i)=m_i^\mathrm{out}={\sum \limits _{j=1}^{k_\mathrm{out}}B_{i,j}}, \end{aligned}$$

\(\mathrm{Inf}(i)\) denotes the node influence of i.

(b):

Node authority: it is the difference between in-degree and out-degree of a node. The greater node authority is, the greater the likelihood that messages will be transmitted is; thus,

$$\begin{aligned} \mathrm{Aut}(i)=\left\{ \begin{array}{rcl} x\in [0,1],&{}&{}\quad {\mathrm{if}~m^\mathrm{in}-m^\mathrm{out}<0 }\\ m^\mathrm{in}-m^\mathrm{out},&{}&{}\quad \mathrm{otherwise} \end{array} \right. \end{aligned}$$

\(\mathrm{Aut}(i)\) denotes the node authority of i and x is randomly picked.

(c):

Node activity: it is the average number of messages that a node releases, which is computed in days. The propagated messages through highly active nodes are more likely to be released and easier to be received or forward by other nodes (users); thus,

$$\begin{aligned} Act(i)=r_{i}^\varepsilon =\sum \limits _{x=1}^{m^\mathrm{out}} r_{i,x}^\varepsilon , \end{aligned}$$

Act(i) denotes the node activity of i.

(2):

Characteristics \(\Psi _r\) of communication objects

(a):

Node propagation willingness: it is user’s willingness to propagate received message, whose definition is shown as the following formula. In the definition, we compute the logarithmic value of the ratio of the number \(m_\mathrm{transmit}\) of messages forwarded by a user to the number \(m_\mathrm{original}\) of original messages, and then, the logarithmic value multiplies the user’s willingness \(\varepsilon _u\). The stronger the user’s willingness to propagate message is, the greater its value is.

$$\begin{aligned} \mathrm{Wil}(j)=\log \left( \frac{m_{\mathrm{transmit}}}{m_\mathrm{original}}+1\right) \cdot \varepsilon _u, \end{aligned}$$

\(\mathrm{Wil}(j)\) denotes the node propagation willingness.

(b):

Node propagation characteristic: it is defined as the product of the propagation characteristic of the messages from the node j to the node i multiplying with the node willingness \(\varepsilon _{i,j}\), shown as the following formula.

$$\begin{aligned} \zeta _{i\leftarrow j}=\frac{\log (1+\mathrm{Inf}(j))}{\log \left( (1+\mathrm{Inf}(i))\cdot (1+\mathrm{Inf}(j))\right) }\cdot \varepsilon _{i,j}, \end{aligned}$$

\(\zeta _{i\leftarrow j}\) is the node propagation characteristic. From the above formula, we may know the node propagation characteristic is related to the influences of the node i and the node j, where we need to remark that (1) \(\zeta _{i\leftarrow j}\ne \zeta _{j\leftarrow i}\) normally; (2) if \(\mathrm{Inf}(i)\ll \mathrm{Inf}(j)\), then \(\zeta _{i\leftarrow j}\approx \varepsilon _{i,j}\); it shows that the node i is easy to accept the messages transmitted by the node j; thus, the node j is more influential; (3) conversely, if \(\mathrm{Inf}(i)\gg \mathrm{Inf}(j)\), then \(\zeta _{i\leftarrow j}\approx 0\); it shows that the node i is difficult to accept the messages transmitted by the node j; thus, the node i is more influential.

(3):

Characteristics \(\Psi _{\mathrm{cnt}}\) of propagated messages

(a):

For the situation whether message contains URL linkFootnote 3, we may set the following expression:

$$\begin{aligned} \mathrm{Url}(\mathrm{cnt})=\left\{ \begin{array}{rcl} 1,&{}&{}\quad {\mathrm{if}~\mathrm{URL}\subset \mathrm{cnt}}\\ 0,&{}&{}\quad \mathrm{otherwise} \end{array} \right. \end{aligned}$$
(b):

For the situation whether message contains label (such as the symbol #label content#)Footnote 4, we may set the following expression:

$$\begin{aligned} \mathrm{Lab}(\mathrm{cnt})=\left\{ \begin{array}{rcl} 1,&{}&{}{if~\#content\#\subset \mathrm{cnt}}\\ 0,&{}&{}{otherwise} \end{array} \right. \end{aligned}$$
(2):

Edge Feature

(1):

Characteristic relation \(\Psi _{s,r}\) between the communication subject and the communication object

(a):

Interest similarity: the users with the same or similar interests are more likely to propagate the same kind of messages. It is derived from the extraction of node interest in the community detection. The complete formula is shown as the formula of interest similarity from Sect. 5.3, where

$$\begin{aligned} \mathrm{Int}(i,j)=\mathrm{sim}(i,j). \end{aligned}$$
(b):

Directed propagation: if message sender directly propagates related messages to receiver according to receiver’s ID (such as “ receiver’s ID"), then it indicates a close connection between the two users and a greater intimacy degree.

$$\begin{aligned} \mathrm{Dit}(i,j)=\left\{ \begin{array}{rcl} 1,&{}&{}\quad {\mathrm{if}~i~\mathrm{and}~j~\mathrm{were}~\mathrm{associated}}\\ 0,&{}&{}\quad \mathrm{otherwise} \end{array} \right. \end{aligned}$$
(2):

Characteristic relation \(\Psi _{r,\mathrm{cnt}}\) between the communication object and the communication content The characteristic relation \(\Psi _{r,\mathrm{cnt}}\) is represented by propagation interest. Propagation interest is to measure the difference between user’s interest and propagated content, which evaluates whether the user is interested in the content. So, it is defined as the similarity between the communication object and the communication content according to the calculation of the Manhattan distance, and the formula is as follows:

$$\begin{aligned} \mathrm{Spr}(U,C)=\mathrm{dist}(U,C)=\sum \limits _{k=1}^n |u_k-c_k|, \end{aligned}$$

where \(U=(u_1,u_2,\ldots u_n)\) is the issued document vector of a user and \(C=(c_1,c_2,\ldots c_n)\) is the propagated document vector of the userFootnote 5.

To extract the features, we need to use the min-max standardization method to process some of the features, where all the values of the features are standardized and mapped to the interval [0, 1]. For example, the out-degrees of a node are {10,30,5,25,15,35,7} in a period of time; its node influences are {10,30,5,25,15,35,7} according to the corresponding definition; then, we may standardize the value of 25 by the min-max standardization method as follows:

$$\begin{aligned} \frac{25-5}{35-5}=\frac{20}{30}\approx 0.667. \end{aligned}$$

6.2 Model construction of message propagation

The model construction of message propagation (Saito et al. 2011) is mainly related to the propagation probability function and the propagation delay function. The model construction needs to build up the relationship between them according to the extracted feature set from Sect. 5.1. In this paper, we use the node feature and the edge feature to build the feature vectors, where the vectors’ dimensions are related to node feature number k, edge feature number n and resource (content) number r. Also, we build a personal willingness vector \(\Psi _{\varepsilon }\). The construction of all the vectors is shown as follows.

$$\begin{aligned} \Psi _k= & {} \left\{ \begin{array}{rcl} \Psi _s\\ \Psi _r\\ \Psi _{\mathrm{cnt}}\\ \end{array}\right\} ,\\ \Psi _n= & {} \left\{ \begin{array}{rcl} \Psi _{s,r}\\ \Psi _{r,\mathrm{cnt}}\\ \end{array}\right\} ,\\ \Psi _\varepsilon= & {} \left\{ \begin{array}{rcl} \varepsilon _u\\ \varepsilon _{i,j}\\ \varepsilon _c\\ \end{array}\right\} . \end{aligned}$$

In message propagation, the vector \(\Psi _\varepsilon \) takes full account of the willingness value \(\varepsilon _u\) of a single node, the willingness value \(\varepsilon _{i,j}\) between nodes and the community willingness value \(\varepsilon _c\). The vector can show the executive effectiveness of personal willingness to influence message propagation.

In this paper, a basic function \(f(i,j,\mathrm{cnt},\varepsilon )\) is used to linearly represent the correlation characteristics \(\Psi _k\), \(\Psi _n\) and \(\Psi _\varepsilon \), thus

$$\begin{aligned} f(i,j,\mathrm{cnt},\varepsilon )=\alpha _0+\alpha _1^T\cdot \Psi _k+\alpha _2^T\cdot \Psi _n+\alpha _3^T\cdot \Psi _\varepsilon , \end{aligned}$$

where i and j are the node’s indexes, \(\mathrm{cnt}\) is the feature of message content, \(\varepsilon \) is the personal willingness, \(\alpha _0\) represents a constant value, \(\alpha _1\) represents the weight of node feature, \(\alpha _2\) represents the weight of edge feature and \(\alpha _3\) represents the weight of personal willingness. So, the greater the weight is, the greater the impact on the propagation probability is. Also, the greater the value of personal willingness is, the faster the flow of messages is. Then, the Bayesian logistic function can indicate the propagation probability function \(p(i,j,\mathrm{cnt},\varepsilon )\) as follows:

$$\begin{aligned} p(i,j,\mathrm{cnt},\varepsilon )=\frac{1}{1+\exp \{-f(i,j,\mathrm{cnt},\varepsilon )\}}. \end{aligned}$$

Also, the propagation delay function \(\tau (i,j,\mathrm{cnt},\varepsilon )\) is represented by the linear combination of \(\Psi _k\), \(\Psi _n\) and \(\Psi _\varepsilon \), thus

$$\begin{aligned} \tau (i,j,\mathrm{cnt},\varepsilon )=\beta _0+\beta _1^T\cdot \Psi _k+\beta _2^T\cdot \Psi _n+\beta _3^T\cdot \Psi _\varepsilon , \end{aligned}$$

where \(\beta _0\) represents a constant value, \(\beta _1\) represents the weight of node feature, \(\beta _2\) represents the weight of edge feature and \(\beta _3\) represents the weight of personal willingness. Similarly, the greater the weight is, the greater the impact on the propagation delay is. Also, the greater the value of personal willingness is, the faster the flow of messages is.

Then, we build the message propagation model according to the propagation probability function and the propagation delay function, and we introduce time attenuation factor to the model. In Goyal et al. (2010), Liu et al. (2017) and Fang et al. (2018) the related studies pointed out that the message propagation process is asynchronous and random, and the ability and influence of message propagation between nodes will decay as time interval increases, which is consistent with the exponential decay rule. In order to verify the conclusion, the related studies collected and analyzed a large number of cascaded messages distributed in social network (such as Flickr data set) and then found that the message propagation process is closest to the exponential distribution with time interval changing. Therefore, we choose the exponential model to establish the message propagation mechanism in this paper. The formula of the propagation probability density is described as follows:

  1. 1)

    when \( t_j > t_i \),

    $$\begin{aligned}&y\left( {\left( {j,t_j } \right) \mathrm{{|}}\left( {i,t_i } \right) , \alpha ,\beta } \right) \\&\quad =p(i,j,\mathrm{cnt},\varepsilon )\cdot \tau \left( {i,j,\mathrm{cnt},\varepsilon } \right) \\&\qquad \cdot \exp \left\{ { - \tau \left( {i,j,\mathrm{cnt},\varepsilon } \right) \cdot \left( {t_i - t_j } \right) } \right\} , \end{aligned}$$
  2. 2)

    when \( t_j \le t_i \),

    $$\begin{aligned} y\left( {\left( {j,t_j } \right) \mathrm{{|}}\left( {i,t_i } \right) ,\alpha ,\beta } \right) =0, \end{aligned}$$

where \(y\left( {\left( {j,t_j } \right) \mathrm{{|}}\left( {i,t_i } \right) , \alpha ,\beta } \right) \) is the propagation probability density, which represents the probability that the node i will pass the message to the node j during this time between \(t_i\) and \(t_j\), \(\alpha = (\alpha _0 ,\alpha _1^T ,\alpha _2^T ,\alpha _3^T )\) and \(\beta = (\beta _0 ,\beta _1^T ,\beta _2^T ,\beta _3^T )\).

In message propagation, we introduce the personal willingness vector \(\Psi _\varepsilon \) to the message propagation model, whose values are the interval [0, 1]. So, when \(y\left( {\left( {j,t_j } \right) \mathrm{{|}}\left( {i,t_i } \right) , \alpha ,\beta } \right) \) is the propagation probability that the node i will pass the message \(\mathrm{cnt}\) to the node j during this time between \(t_i\) and \(t_j\), the propagation probability will decay as the time interval \(\Delta t=t_j-t_i\) increases, which is consistent with the exponential decay rule. Then, we may compute the integral value of the propagation probability density function during this time between \(t_i\) and \(t_j\) as follows:

$$\begin{aligned}&Y\left( {\left( {j,t_j } \right) |\left( {i,t_i } \right) , \alpha ,\beta } \right) \\&\quad = \int _{t_i }^{t_j } {y\left( {\left( {j,t_j } \right) |\left( {i,t_i } \right) , \alpha ,\beta } \right) } dt, \end{aligned}$$

where \(Y\left( {\left( {j,t_j } \right) |\left( {i,t_i } \right) , \alpha ,\beta } \right) \) is the cumulative function of the propagation probability during this time between \(t_i\) and \(t_j\). So, the survival probability

$$\begin{aligned} E\left( {\left( {j,t_j } \right) |\left( {i,t_i } \right) , \alpha ,\beta } \right) =1-Y\left( {\left( {j,t_j } \right) |\left( {i,t_i } \right) , \alpha ,\beta } \right) , \end{aligned}$$

where \(E\left( {\left( {j,t_j } \right) |\left( {i,t_i } \right) , \alpha ,\beta } \right) \) denotes the probability that the node j does not receive the message \(\mathrm{cnt}\) from its adjacent node i until the time \(t_j\). Then, the probability that the node j does not receive the message \(\mathrm{cnt}\) from its adjacent nodes before the time \(t_j\) except for its adjacent node \(\omega \) is described as \(\prod \nolimits _{i\in R\_Node}^{i\ne \omega } E\left( {\left( {j,t_j } \right) |\left( {i,t_i } \right) , \alpha ,\beta } \right) \), where \(R\_Node\) denotes a set whose elements are the adjacent nodes that received the message \(\mathrm{cnt}\) before the time \(t_j\). Therefore, the probability that the node j only receives the message \(\mathrm{cnt}\) from the adjacent node \(\omega \) at the time \(t_j\) is described as

$$\begin{aligned} y\left( {\left( {j,t_j } \right) \mathrm{{|}}\left( {\omega ,t_{\omega } } \right) , \alpha ,\beta } \right) \cdot \prod \limits _{i\in R\_Node}^{i\ne \omega } E\left( {\left( {j,t_j } \right) |\left( {i,t_i } \right) , \alpha ,\beta } \right) . \end{aligned}$$

For a message content set \(C=\{\mathrm{cnt}_1,\mathrm{cnt}_2,\ldots \mathrm{cnt}_N\}\) where \(\mathrm{cnt}_l\) is the lth message content, we assume that the message set is propagated in a social network graph G(VE) (the number of nodes is represented by \(k=|V(G)|\)). Then, all the nodes that received the lth message content \(\mathrm{cnt}_l\) and their corresponding times may be denoted as a set \(S_l=\{(v_{l,1},t_{l,1}),(v_{l,2},t_{l,2})\ldots (v_{l,k},t_{l,k})\}\), where \(v_{l,x}\) is the index of node and \(t_{l,x}\) is the corresponding time with \(x\in \{1,2,\ldots k\}\). So, for the lth message content \(\mathrm{cnt}_l\), we may get its propagation probability in the G(VE) as follows:

$$\begin{aligned} \hbar (\mathrm{cnt}_l|\alpha ,\beta )= & {} \prod \limits _{(j,t_j)\in S_l}[ y\left( {\left( {j,t_j } \right) \mathrm{{|}}\left( {\omega ,t_{\omega } } \right) , \alpha ,\beta } \right) \cdot \\&\prod \limits _{i\in R\_Node}^{i\ne \omega } E\left( {\left( {j,t_j } \right) |\left( {i,t_i } \right) , \alpha ,\beta } \right) ]. \end{aligned}$$

So, for the message content set C, we may get its propagation probability in the G(VE) as follows:

$$\begin{aligned} H(C|\alpha ,\beta )=\prod \limits _{\mathrm{cnt}_l\in C}\hbar (\mathrm{cnt}_l|\alpha ,\beta ). \end{aligned}$$

Therefore, we may solve the maximum natural estimating values of \(\alpha \) and \(\beta \) according to the works of Saito et al. (2011); Zhou et al. (2015), which are the solutions of the constructed message propagation model; namely, \((\hat{\alpha },\hat{\beta })\) must satisfy the function

$$\begin{aligned} \mathbf{min }\{-\mathbf{lg } H(C|\alpha ,\beta )\}. \end{aligned}$$

Further, to solve the values of \(\alpha \) and \(\beta \), we first set the objective function \(F(C|\alpha ,\beta )=-\mathbf{lg } H(C|\alpha ,\beta )\) and solve the partial derivatives \(\frac{\partial F(C|\alpha ,\beta )}{\partial \alpha }\) and \(\frac{\partial F(C|\alpha ,\beta )}{\partial \beta }\)Footnote 6, and then, we can get that

$$\begin{aligned}&\frac{\partial F(C|\alpha ,\beta )}{\partial \alpha }=-\frac{\partial \mathbf{lg } H(C|\alpha ,\beta )}{\partial \alpha }\\&\quad =-\frac{\partial \mathbf{lg } \prod \limits _{\mathrm{cnt}_l\in C}\hbar (\mathrm{cnt}_l|\alpha ,\beta )}{\partial \alpha }\\&\quad =-\sum \limits _{\mathrm{cnt}_l\in C}\frac{\partial \mathbf{lg } \hbar (\mathrm{cnt}_l|\alpha ,\beta )}{\partial \alpha }\\&\quad =-\sum \limits _{\mathrm{cnt}_l\in C}\\&\qquad \frac{\partial \mathbf{lg } \prod _{(j,t_j)\in S_l}[ y\left( {\left( {j,t_j } \right) \mathrm{{|}}\left( {\omega ,t_{\omega } } \right) , \alpha ,\beta } \right) \cdot \prod _{i\in R\_\mathrm{Node}}^{i\ne \omega } E\left( {\left( {j,t_j } \right) |\left( {i,t_i } \right) , \alpha ,\beta } \right) ]}{\partial \alpha }\\&\quad =-\sum \limits _{\mathrm{cnt}_l\in C}\sum \limits _{(j,t_j)\in S_l}\\&\qquad \frac{\partial \mathbf{lg }[ y\left( {\left( {j,t_j } \right) \mathrm{{|}}\left( {\omega ,t_{\omega } } \right) , \alpha ,\beta } \right) \cdot \prod _{i\in R\_\mathrm{Node}}^{i\ne \omega } E\left( {\left( {j,t_j } \right) |\left( {i,t_i } \right) , \alpha ,\beta } \right) ]}{\partial \alpha }\\&\quad =-\sum \limits _{\mathrm{cnt}_l\in C}\sum \limits _{(j,t_j)\in S_l}\\&\qquad \frac{\partial \mathbf{lg }[ y\left( {\left( {j,t_j } \right) \mathrm{{|}}\left( {\omega ,t_{\omega } } \right) , \alpha ,\beta } \right) \cdot \prod _{i\in R\_\mathrm{Node}}^{i\ne \omega }(1-\int _{t_i }^{t_j } {y\left( {\left( {j,t_j } \right) |\left( {i,t_i } \right) , \alpha ,\beta } \right) } dt)]}{\partial \alpha }, \end{aligned}$$

where

$$\begin{aligned}&y\left( {\left( {j,t_j } \right) \mathrm{{|}}\left( {i,t_i } \right) , \alpha ,\beta } \right) \\&\quad =p(i,j,\mathrm{cnt},\varepsilon )\cdot \tau \left( {i,j,\mathrm{cnt},\varepsilon } \right) \\&\qquad \cdot \exp \left\{ { - \tau \left( {i,j,\mathrm{cnt},\varepsilon } \right) \cdot \left( {t_i - t_j } \right) } \right\} . \end{aligned}$$

Similarly, we may solve the partial derivative \(\frac{\partial F(C|\alpha ,\beta )}{\partial \beta }\).

Then, we may use the stochastic gradient descent algorithm (Vorontsov 1998; Roux et al. 2013; Johnson and Zhang 2013) to solve the maximum natural estimating values \((\hat{\alpha },\hat{\beta })\). The main idea of the algorithm is that 1) it starts to train the objective data set from the initial values, and then, it declines a step (generally set to a small value) along the gradient of the objective function and updates the data set at every iteration; 2) until the objective function is converged, it may find the optimal solutions in a global or local optima. So, according to the stochastic gradient descent algorithm, we may solve the optimal values of \(\alpha \) and \(\beta \). For example, we may compute the optimal value of \(\alpha \) by the following formulas:

$$\begin{aligned} \left\{ \begin{aligned} \alpha ^{(t)}&=\alpha ^{(t-1)}-\lambda \cdot \frac{\partial F(C|\alpha ^{(t-1)},\beta ^{(t-1)})}{\partial \alpha ^{(t-1)}}\\&\quad ...~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\\ \alpha ^{(2)}&=\alpha ^{(1)}-\lambda \cdot \frac{\partial F(C|\alpha ^{(1)},\beta ^{(1)})}{\partial \alpha ^{(1)}}~~~~~~~~\\ \alpha ^{(1)}&=\alpha ^{(0)}-\lambda \cdot \frac{\partial F(C|\alpha ^{(0)},\beta ^{(0)})}{\partial \alpha ^{(0)}}~~~~~~~~ \end{aligned} \right. \end{aligned}$$
(1)

where \(\alpha ^{(0)}\) is the initial value of \(\alpha \) and \(\lambda \) is a step value. Similarly, we may solve the optimal value of \(\beta \). Finally, based on \(\alpha \) and \(\beta \), we may get a message propagation model with regard to the features mentioned in Sect. 6.1.

7 Experiments and analysis of the proposed scheme

7.1 Data set

Based on the open API interfaces of the social networking software called as Weibo from the Sina companyFootnote 7, we use the crawler tool to grab some data sets from the Weibo and then process the data sets as our experimental data sets. Additionally, we download other data sets to test our proposed scheme, such as YouTube (Dang and Viennet 2012) and Digg (Lin et al. 2011).

  1. (1)

    We process the data sets in advance according to the years of data accumulation and then construct different scale networks, as shown in Table 1. In Table 1, the size of the 4 networks is increasing with the steady growth of time; thus, the efficiency of the proposed community detection algorithm can be detected, where the algorithm reduces its randomness and contingency by introducing the interest similarity of network nodes.

  2. (2)

    We randomly choose some network nodes as the initial nodes to grab the users’ personal information and the corresponding contents in a period of time, including original number, forwarding number, total number, attention number, mutual powder number, fan number, information content (including tags and links) and hot topic. Then, according to the attention of the Weibo users, we extract the corresponding data from the original data sets, and then, the preprocessed data are used as the experimental data set, shown in Table 2.

  3. (3)

    To test the performance of the community detection algorithm on the different data sets, we download the YouTube and Digg data sets, where the preprocessed data are used as the experimental data sets, shown in Table 3.

7.2 Community detection experiment

In the experiment, the network data shown in Tables 1 and 3 are processed and divided to the different communities; thus, we may get the size distribution of the communities. In order to show the compared results clearly, based on the different internal factor value \(\eta \), the threshold value T and the control parameter \(\varpi \), we can get the more efficient results of community detection on the different influence of the parameters, where T and \(\varpi \) are changed at the same time. When we set the personal willingness as 0, 1/2, 1 (maximum), the numbers of the communities and the corresponding community members detected by the proposed algorithm are shown in Figs. 2 and  3.

Table 1 Different scale network data
Table 2 Preprocessed network data
Table 3 Preprocessed network data
Fig. 2
figure 2

The number of detected communities

Fig. 3
figure 3

The number of members of the detected maximal community

As we can see from Fig. 2a, the greater personal willingness is, the less the number of detected communities is. It indicates that the different personal willingness is influential to the results of detected communities, and detected communities are more stable. As shown in Fig. 2a, d, when the value \(\eta \) becomes larger, the number of detected communities becomes slightly smaller. It indicates that \(\eta \) is lightly influential to the results of detected communities. As shown in Fig. 2a–c, the different values T and \(\varpi \) are influential to the results of detected communities. When the values T and \(\varpi \) are changed, the number of detected communities becomes smaller. It indicates that T and \(\varpi \) are obviously influential to the results of detected communities. Therefore, node interest and personal willingness are influential and important to the results of detected communities. As shown in whole Fig. 2, the different parameters are differently influential to the results of detected communities. When the values \(\eta \), T and \(\varpi \) are changed, the number of detected communities is also changed. It indicates that communities are strictly detected, and detected communities are more stable. Additionally, for the other data sets YouTube and Digg, we can get the similar results as Weibo, and the numbers of communities detected on YouTube and Digg are similar to that of the I4 data set.

As shown in Fig. 3.g, the greater personal willingness is, the less the number of the members of the detected max-size community is. It indicates that when personal willingness becomes larger, the opportunity for generating large communities becomes fewer. As shown in Fig. 3g, j, the value \(\eta \) is changed to incur that the number of the members of the detected max-size community becomes slightly smaller. It indicates that \(\eta \) is lightly influential to the number of the members of the detected max-size community. As shown in Fig. 3g–i, the different values \(\eta \), T and \(\varpi \) are influential to the number of the members of the detected max-size community. When the values \(\eta \), T and \(\varpi \) are changed, as the number of the members of the detected max-size community becomes smaller, the opportunity for generating large communities becomes fewer. Similarly, for the other data sets YouTube and Digg, we can get the similar results as Weibo. Therefore, the community detection algorithm can make community detection more comprehensive and reasonable by introducing node interest and personal willingness. The proposed algorithm may reduce the probability of generating large-scale communities so as to improve the accuracy and reliability of community detection and increase the stability of community structure.

In the experiment, we also compare the GN (Girvan and Newman 2002) and NM (Newman 2004) algorithms with our proposed community detection algorithm by the numbers of detected communities and members of the detected maximal community. The GN algorithm is a community structure discovery algorithm based on edge clustering. The NM algorithm is a community discovery algorithm based on network module degree, which is a greedy algorithm.

Fig. 4
figure 4

The performance of different algorithms

As shown in Fig. 4, because our proposed algorithm strictly screens the community members to form the communities, the numbers of communities and members of the maximal community detected by our proposed algorithm are both less than those of the GN and NM algorithms. So, compared with the other two algorithms, our proposed algorithm may reduce the probability of generating large-scale communities and increase the stability of community structure.

According to Table 1, we make experiments for the 4 different scale network data sets. Given \(\eta =0.5\), \(T=0.5\) and \(\varpi =3\), we set personal willingness to the values of 0, 1/2 and 1 and then, respectively, calculate the efficiency of the proposed algorithm, where the average degree of the networks is 6, and the number of edges increases continuously from 5000 to 30,000. As shown in Fig. 5, with the increase in the number of edges, the greater personal willingness is, the less the execution time of the proposed algorithm is. Additionally, under the condition of the same number of edges, the greater personal willingness is, the less the execution time of the proposed algorithm is. As shown from the curve of personal willingness to be 1/2, while the number of edges increases, the curve increases relatively quickly at the beginning, and when the number of edges reaches a certain number (\(2.5\times 10^4\)), the increase becomes relatively slow. Then, we may know that when personal willingness is being increased, the increment of the running time of the proposed algorithm decreases with the increase in the number of edges. So, the personal willingness is greater, the efficiency of the proposed algorithm relatively becomes higher.

Fig. 5
figure 5

The computation time of the proposed algorithm with different personal willingness values

In summary, whenever we make the same experiments on the data sets Weibo, YouTube and Digg, the experimental results show that the community detection algorithm can make community detection more comprehensive and reasonable because the proposed algorithm introduces node interest and personal willingness to strictly detect community members. Also, compared with the other algorithms (such as the GN and NM algorithms), the numbers of communities and members of the maximal community detected by the proposed algorithm are the fewest. Therefore, our proposed algorithm may reduce the probability of generating large-scale communities so as to improve the accuracy and reliability of community detection and increase the stability of community structure.

7.3 Message propagation experiment

In the experiment, the message propagation model is tested on the network data from Table 2, where we firstly need to solve the values of \(\alpha \) and \(\beta \) by the stochastic gradient descent algorithm. Additionally, because the original data set lacks the attribute of personal willingness, we need to preprocess the data set by randomly setting the values of personal willingness to the original data set. Thus, in order to show the contrast results more clearly, the related weights of the extracted attributes can be got from the experiment, and the values are standardized, shown in Fig. 6. From Fig. 6, we may know that the attributes of larger weight proportion are node propagation characteristic, interest similarity, node propagation willingness and node activity. In the social network of Weibo, the user’s interest degree, active degree and propagation willingness will largely determine whether the user can continue to spread the messages. Also propagation characteristics define the characteristics of propagated message, which further show that the users with common interests and larger propagation willingness spread the related messages more possibly.

Fig. 6
figure 6

The comparison of attribute weights in message propagation model

Figure 7 shows the values of \(\alpha \) and \(\beta \) solved by the stochastic gradient descent algorithm when we set the personal willingness to the different values, where \(\alpha = (\alpha _0 ,\alpha _1^T ,\alpha _2^T ,\alpha _3^T )\) and \(\beta = (\beta _0 ,\beta _1^T ,\beta _2^T ,\beta _3^T )\). In Fig. 7o, the values of \(\alpha \) are increasing with the increasing values of personal willingness, and it denotes that personal willingness may influence the solutions of the message propagation model (where the personal willingness is greater, the objective function of the stochastic gradient descent algorithm converges faster). Additionally, compared with \(\alpha _0\), \(\alpha _1\) and \(\alpha _2\), the \(\alpha _3\) related to personal willingness is changed faster. Figure 7p also shows the similar results. So, from the Fig. 7, we may know personal willingness can influence the construction of message propagation model.

Fig. 7
figure 7

The solutions changing of \(\alpha \) and \(\beta \)

Fig. 8
figure 8

Node activities with different personal willingness values

Fig. 9
figure 9

Node propagation willingness with different personal willingness

Fig. 10
figure 10

Node propagation characteristics with different personal willingness

Also, in the tested network data, a certain number of nodes are selected to make the experiment. We randomly select three groups of related nodes (the sizes, respectively, are 5, 10 and 20) in the experiment and number them as the indexes of 1–5, 1–10 and 1–20. According to the relationship between the attributes and the nodes, we extract the attributes of the corresponding nodes and then analyze node activity, node propagation willingness and node propagation characteristic by Figs. 8, 9 and 10. Additionally, we randomly select 6 related nodes to analyze interest similarity between any two nodes by Fig. 11.

(1):

Node activity As shown in Fig. 8, the black column block with a personal willingness value of 0 is not shown in the diagram, because the node is inactive when the personal willingness has a value of 0. As a whole, regardless of the group sizes being 5, 10 and 20, the greater the value of personal willingness is, the greater the corresponding columnar area is. It shows that personal willingness can affect node activity, where the value of personal willingness is greater, the node is more active. So, personal willingness is influential in the message propagation model.

(2):

Node propagation willingness As shown in Fig. 9, the black column block with a personal willingness value of 0 is not shown in the diagram, because the node refuses to make communication. As a whole, regardless of the group sizes being 5, 10 and 20, the greater the value of personal willingness is, the greater the corresponding columnar area is. It shows that personal willingness can affect node propagation willingness, where the value of personal willingness is greater, the node is more willing to propagate messages.

(3):

Node propagation characteristic As shown in Fig. 10, the black column block with the value of 0 is not shown as the previous figures. From the whole, regardless of the group sizes being 5, 10 and 20, the greater the value of personal willingness is, the greater the corresponding columnar area is. It shows that personal willingness can affect node propagation characteristic, where the value of personal willingness is greater, the value of node propagation characteristic is greater. So, the node with the greater value is more willing to propagate messages.

(4):

Interest similarity According to the definition of interest similarity, we may know that the interest similarity \(\mathrm{Int}(i,j)\) denotes the interest similarity between the node i and the node j. From Fig. 11, we may know when the value of personal willingness is 0, the black column block is not shown as the previous figures. Because the two nodes with the personal willingness value of 0 are in a same state refused to outside interest, the interest similarity between the two nodes is no practical significance. As a whole, the greater the value of personal willingness is, the greater the corresponding columnar area is. It shows that personal willingness can affect interest similarity, where the value of personal willingness is greater, the value of interest similarity is greater. So, the nodes with the greater values are more willing to propagate some similar messages.

Fig. 11
figure 11

Interest similarities with different personal willingness

8 Conclusions

Mining community structure in complex social network, researching on message propagation model, exploring their correlation with personal willingness and deeply analyzing related impact factors are important to the prevention of network crime and the network monitoring of public opinion. Therefore, we propose the social network community detection and message propagation scheme based on personal willingness in this paper: (1) we present the community detection algorithm, which uses module degree with interest degree and personal willingness to make community detection; additionally, based on personal willingness, we use edge intimacy degree and node interest degree as the referred relationship to divide community structure, so as to improve the quality of community detection and reduce the size of large community; the experiment shows that the proposed algorithm based on personal willingness can improve the stability and reliability of online social communities; (2) we present the message propagation model based on personal willingness, in which we extract the characteristics of user attribute and propagated message content to build the model according to propagation probability and propagation delay; the proposed model can take full account of the initiative and effectiveness of users and ensure the quality of message propagation; the experiment shows that personal willingness may influence the construction of message propagation model, where the personal willingness is greater, the objective function of the stochastic gradient descent algorithm converges faster.