1 Introduction

Social networks in the real world can be modeled through a complex network (Ghoshal et al. 2014; Rossi and Ahmed 2019; Vasudevan and Deo 2012). In ordinary complex networks, the links between nodes represent a certain connection between individuals in social networks. Then, in the signed network, the links between nodes imply the positive and negative attribute relations among individuals in the social network. If there is a positive connection between nodes, it is shown as positive links, and vice versa. And in the signed network, there is a certain community structure; that is, the nodes within the community are mostly with positive links, and the nodes among different communities are mostly with negative links (Davis 1967). The above theory shows that in social network, individuals with a certain cooperative relationship are in the same cluster, and individuals in different clusters have a certain competitive relationship. Moreover, if the nodes in the signed network are positively or negatively connected with the other two nodes, then the two nodes are more inclined to produce positive links. Generally, in a signed network constructed by an online social network, positive links indicate “support”, “like” or “cooperation”, while negative links indicate “opposing”, “dislike” or “competition”. For example, users on the Slashdot website (Leskovec et al. 2010) can mark other users as friends or enemies based on other users’ comments, and in consumer review site Epinions.com (Leskovec et al. 2010) which is a who-trusted-whom online social network, members of the site can decide whether to “trust” each other. Compared with the common complex network, the signed network with positive and negative attributes can contain more information when it represents the social network, so the analysis of the signed network has attracted more and more attention in the field of social network analysis.

Community detection and link prediction are two basic issues in signed network analysis. In a signed network, community detection is to find community structures that are represented as dense positive links in the same community and intensive negative links in different communities (Yang et al. 2007), and link prediction is to predict the states of unknown links like positive or negative (Li et al. 2018a). Community detection can detect the community structure in social networks, which is helpful to analyze the grouping of individuals in social networks. Link prediction can predict the connection status of individuals in the social network in the next stage and can be used in the recommendation system. Therefore, the community detection algorithm and link prediction algorithm of the signed network are very helpful for the analysis of the social network.

Although some algorithms for community detection and link prediction in signed networks have been proposed in recent years, their development is still immature and not proven or still being developed. For example, some algorithms (Li et al. 2014; Anchuri and Magdon-Ismail 2012) based on optimization objective functions and heuristic-based algorithms (Yang et al. 2017; Zhao et al. 2017) have high computational complexity. Some model-based algorithms (Yang et al. 2007; Jiang 2015) have low accuracy in performance or need probabilistic statistical inference methods to select models, such as EM algorithm, resulting in a large computational burden. Some algorithms are based on deep learning (Wang et al. 2017, 2018) with high computational performance but poor interpretability. And most of the above algorithms can only be used for community detection or link prediction. Faced with these challenges, we propose a new RC-NMF model for community detection and link prediction.

In this paper, we introduce a graph regularization based on the convex nonnegative matrix factorization (Convex-NMF) algorithm. Convex-NMF is an improvement of the semi-NMF algorithm, which constrains the base matrix in the semi-NMF by adding a weight matrix. The introduced graph regularization can simultaneously constrain the nodes with positive links to enter the same community and the nodes with negative links to enter different communities. In addition to being used for community detection, our proposed RC-NMF can also be used for link prediction. We have compared experiments with other current advanced algorithms on artificially generated signed network datasets and real large-scale signed networks and proved the validity and accuracy of our proposed RC-NMF algorithms.

The structure of the paper is as follows: In Sect. 2, we introduce the related work on signed network community detection and link prediction algorithms in recent years. In Sect. 3, we introduce the RC-NMF algorithm we proposed in detail. Section 4 shows the comparison of other state-of-the-art algorithms on artificial signed network datasets and large-scale real-signed network datasets, which verifies the validity and accuracy of our proposed algorithm. Section 5 summarizes our contributions.

2 Related work

In recent years, a large number of algorithms have emerged for signed network community detection and link prediction. These algorithms can be broadly divided into the following categories: modularity optimization-based, balance theory-based, model-based and deep learning-based.

Modularity optimization-based methods The standard modularity is developed for unsigned networks, and it measures how far the real positive connections deviate from the expected random connections. And standard modularity optimization is essentially a discrete combination problem (Newman 2016). The communities in the network can be detected by optimizing the modularity objective function (Newman 2016). But this standard modularity optimization method was initially only applicable to unsigned networks. Li et al. (2014) defined signed modularity by improving standard modularity in the unsigned network and made it capable of handling negative links. Signed modularity balances the trend of users with positive links to forming community and the trend of users with negative links to destroying community by adding weights on positive and negative components in signed networks. Based on the above, some heuristics algorithms based on signed modularity optimization have been proposed. For example, Anchuri and Magdon-Ismail (2012) generalized spectral partitioning (SpePart) approach with iterative optimization to explore the community in the signed network, which is an extension about standard modularity optimization in the unsigned network.

Balance theory-based methods In the 1940s, Heider (1946) introduced the balance theory that the two positively related individuals had the same attitude toward the third person by studying perceptions and attitudes of individuals, which generally implies that “the friend of my friend is my friend” and “the enemy of my enemy is my friend.” In the 1950s, Cartwright and Harary (1956) further developed the theory in the graph theoretical at the group level and validated the Harary and Frank (1953) that a signed graph is balanced if and only if nodes can be divided into two mutually exclusive clusters such that intra-links are positive and inter-links are negative. Therefore, the theory was developed in more than two clusters (Kulakowski et al. 2019), which introduced that a weakly balanced graph exists a partition of the nodes into k clusters just as nodes with positive links are in the same cluster and nodes with negative links are between different clusters. Based on this theory, we can find the community structure of the signed social network by cutting off all negative links. However, the signed social networks in the real world have been normally unbalanced since the existence of frustration that presents itself as the positive inter-links and the negative intra-links. To address this challenge, many algorithms for signed network analysis based on structural balance theory are proposed. For example, Chiang et al. (2014) extended the applicability of the balance theory from the local features to the global features in the signed network. Amelio and Pizzuti (2016) developed a correlation clustering method (CC) that maximizes positive links in a community and negative links between communities or minimizes frustration to detect community in signed networks. Li et al. (2018a) presented a novel framework including two implicit features and two latent features for predicting link, one of which is obtained by balance theory. Derr et al. (2020) used the theory of structural balance among individuals to predict link and interaction polarity in signed networks.

Model-based methods This type of methods focuses on modeling the generated mechanism which tends to apply to the network. For example, Yang et al. (2007) proposed an agent-based random walk model framework (FEC), which is the two-stage approach. First, FC (finding community) is conducted on the positive component of network based on a random walk model, and then, EC ( extracting community ) is conducted by minimizing predefined signed cut criteria according to the links of nodes obtained in the first stage (Yang et al. 2007). The algorithm is capable of giving nearly optimal solutions in linear time concerning the size of a network, but its performance is poor. Chen et al. (2014) proposed a novel approach named signed probabilistic mixture (SPM) model for overlapping community detection. Some of the above methods are based on optimization objectives or heuristic to detect a community structure in the signed network and do not care about the generation of the network. Jiang (2015) proposed a generalized stochastic block model that is the signed stochastic block model (SSBM) to explore the mesoscopic structures in signed networks from a node perspective where each node is assigned to a block or community and links are independently generated for pairs of nodes. Yang et al. (2017) adopted the variational Bayes EM algorithm to estimate the parameters and select model by approximate Bayesian model evidence based on the signed stochastic blockmodel (SSBM) that was proposed to characterize and generate the block structures of signed networks by explicitly formulating the link density and sign based on a stochastic perspective. Zhao et al. (2017) presented a statistical inference method in signed networks (SISN) for community detection, in which a probabilistic model is presented to model signed networks and an expectation–maximization (EM)-based parameter estimation method is deduced to find communities in signed networks. Li et al. (2018b) proposed a regularized semi-nonnegative matrix tri-factorization (Res-NMTF), which splits the matrix in the traditional semi-nonnegative matrix factorization into three terms and adds regularization on this basis to constrain nodes with negative links into different communities. However, this algorithm does not consider the nodes with positive links, so the accuracy is relatively low.

Deep learning-based methods With the rise of deep learning, some algorithms based on machine learning to discover community structure and link prediction emerged. Wang et al. (2017) proposed a novel framework SNEA (social network embedding with attributes), which exploits the network structure and user attributes simultaneously for network representation learning. Although the performance of deep learning is better than some traditional algorithms, the interpretation of these models is weak. Wang et al. (2018) proposed a novel and flexible end-to-end signed heterogeneous information network embedding (SHINE) framework to predict the sign of unobserved links. The SHINE framework gets the implicit low-dimensional vectors of nodes in the network through deep autoencoders and then does the similarity analysis of the nodes on this basis.

3 Our work

3.1 Convex Nonnegative Matrix Factorization (NMF)

Convex nonnegative matrix factorization (Convex-NMF) (Jordan 2009) is the improvement of semi-nonnegative matrix factorization (semi-NMF). Semi-NMF is one of the most popular methods for community detection in signed network, formulated as the following model:

$$\begin{aligned}&L=\left\| A^{\pm }-F^{\pm }H^T\right\| _F^2\nonumber \\&s.t \quad F \in R_{\pm }^{N\times C}, \quad H \in R_{+}^{N \times C} \end{aligned}$$
(1)

where A is the adjacency matrix of the signed network G with N nodes, F is the basis matrix, H is the community indicators matrix where element \(h_{jc}\) is the propensity of node j in community c, and C is the community amount.

Convex nonnegative matrix factorization (Convex-NMF) constrains the basis matrix F in semi-NMF based on the definition that F lies within the column space of adjacency matrix A, formulated as the following model:

$$\begin{aligned}&L=\left\| A^{\pm }-A^{\pm }W^{+ }H^T\right\| _F^2\nonumber \\&s.t \quad F=A W, \quad W \in R_{+}^{N \times C}, \quad H \in R_{+}^{N \times C} \end{aligned}$$
(2)

where W is the weight matrix.

3.2 Regularized Convex-NMF model(RC-NMF)

Because of the influence of negative links in signed networks, we introduce graph regularization (Zheng and Skillicorn 2015) to minimize the positive ratio cut and negative ratio association simultaneously. The regularization that constrains the nodes connected with negative links to be distributed into different communities and the nodes connected with positive links which are in the same communities simultaneously is defined as follows:

$$\begin{aligned} \min \sum _{j=1}^{k} \frac{h_{j}^{\top }\left( D^{p}-A\right) h_{j}}{h_{j}^{\top } h_{j}} \end{aligned}$$
(3)

where A is the adjacency matrix of the signed network, \({D}^{p}\) is a diagonal matrix with \(D_{i i}^{p}=\sum _{i} {A}_{i j}^{p}\), and \(h_{j}\) is jth vector of the community indicator matrix H.

In addition, we can observe that \({\text {tr}}(H^{\top }\left( {D}^{p}-A\right) H)=\sum _{j=1}^{k} \frac{h_{j}^{\top }\left( D^{p}-A\right) h_{j}}{h_{j}^{\top } h_{j}}\), and then, the regularization term optimization problem can be obtained by addressing the following optimization problem:

$$\begin{aligned} \min _{H \in R^{n \times k}} {\text {tr}}\left( H^{T} (D^{p}-A) H\right) \text{ s.t. } H_{i j} \ge 0, \forall i, j \end{aligned}$$
(4)

Furthermore, to make the node try its best to belong to only one community, we use \(\left\| H\right\| _1^2\) to control the sparsity of the node probability matrix.

Combining the regularization term into Convex-NMF, the final RC-NMF model objective function is:

$$\begin{aligned} \begin{aligned} L=&\left\| A-A W H^T\right\| _F^2+\lambda {\text {tr}}(H^{\top }(D^{p}-A)H)+\gamma \left\| H\right\| _1^2 \end{aligned} \end{aligned}$$
(5)

3.3 Update rules for RC-NMF model

We design multiplicative update rules to solve (5). The object function can be written as:

$$\begin{aligned} \begin{aligned} L&=\left\| A-A W H^{T}\right\| _{F}^{2}+\lambda {\text {tr}}\left( H^{T} (D^{p}-A) H\right) +\gamma \Vert H\Vert _{1}^{2}\\&= {\text {tr}}\left( \left( A-A W H^{T}\right) \left( A-A W H^{T}\right) ^{T}\right) \\&\quad +\lambda {\text {tr}}\left( H^{T} (D^{p}-A) H\right) +\gamma {\text {tr}}\left( H N H^{T}\right) \\&={\text {tr}}\left( A^{T}A-2 H^{T} A^{T} A W+W^{T} A^{T} X W H^{T} H\right) \\&\quad +\lambda {\text {tr}}\left( H^{T} (D^{p}-A) H\right) +\gamma {\text {tr}}\left( H N H^{T}\right) \\&\quad s.t \quad W \ge 0 \quad \& \quad H \ge 0 \end{aligned} \end{aligned}$$
(6)

where N is a matrix of size \(k\times k\) whose elements are 1. In the face of constrained optimization problems, we construct Lagrange functions of W and H,

$$\begin{aligned} J(W)= & {} \left( -2 H^{T} A^{T} A W+W^{T} A^{T} X W H^{T} H\right) \nonumber \\&-\,\beta W \end{aligned}$$
(7)
$$\begin{aligned} J(H)= & {\text {tr}}\left( -2 H^{T} A^{T} A W+W^{T} A^{T} X W H^{T} H\right) \nonumber \\&+\,\lambda {\text {tr}}\left( H^{T} (D^{p}-A) H\right) +\gamma {\text {tr}}\left( H N H^{T}\right) \nonumber \\&-\,\beta H \end{aligned}$$
(8)

where \(\beta\) is Lagrange multiplier for \(W \ge 0\) and \(H \ge 0\). First, update W(fixing H), take the derivative of W’s Lagrange function J(W) and set it at zero, and we get the following:

$$\begin{aligned} \frac{\partial J}{\partial W}=-2 A^{T} A H+2 A^{T} A W H^{T} H - \beta = 0 \end{aligned}$$
(9)

because of \(W \ge 0\) and using the KKT conditions \(\beta W = 0\), we can get the following:

$$\begin{aligned}&\left( -2 A^{T} A H+2 A^{T} A W H^{T} H\right) W = \beta W = 0 \end{aligned}$$
(10)
$$\begin{aligned}&\left( - A^{T} A H+ A^{T} A W H^{T} H\right) W^{2} = \beta W^{2} = 0 \end{aligned}$$
(11)

where \(A^{T}A=(A^{T}A)^{+}-(A^{T}A)^{-}\), and we can get the following:

$$\begin{aligned}&\left[ (-((A^{T}A)^{+}H-(A^{T}A)^{-}H))+((A^{T}A)^{+}WH^{T}H\right. \nonumber \\&\qquad \left. -(A^{T}A)^{-}WH^{T}H)\right] W^{2}=0 \end{aligned}$$
(12)
$$\begin{aligned}&\left[ (A^{T}A)^{-}H+(A^{T}A)^{+}WH^{T}H\right] W^{2} \nonumber \\&\quad =\left[ (A^{T}A)^{+}H+(A^{T}A)^{-}WH^{T}H\right] W^{2} \end{aligned}$$
(13)
$$\begin{aligned}&W_{ik}^{2} = W_{ik}^{2} \frac{(A^{T}A)^{+}H+(A^{T}A)^{-}WH^{T}H}{(A^{T}A)^{-}H+(A^{T}A)^{+}WH^{T}H} \end{aligned}$$
(14)

Similarly, we update H(fixing W), take the derivative of H’s Lagrange function J(H) and set it at zero, and we get the following:

$$\begin{aligned} \frac{\partial L}{\partial H}= & -2 A^{T} A W+2 H W^{T} A^{T} A W\nonumber \\&+\,2\lambda (D^{p}-A)H+2\gamma H N - \beta = 0 \end{aligned}$$
(15)

because of \(H \ge 0\) and using the KKT conditions \(\beta H = 0\), we can get the following:

$$\begin{aligned}&\left( -2 A^{T} A W+2 H W^{T} A^{T} A W +\,2\lambda (D^{p}-A)H+2\gamma H N\right) \nonumber \\&\quad H_{ik} = \beta H_{ik} = 0 \end{aligned}$$
(16)
$$\begin{aligned}&\left( - A^{T} A W+ H W^{T} A^{T} A W+\lambda (D^{p}-A)H +\gamma H N\right) \nonumber \\&\quad H_{ik}^{2} = \beta H_{ik}^{2} = 0 \end{aligned}$$
(17)

where \(A^{T}A=(A^{T}A)^{+}-(A^{T}A)^{-}\) and \(D^{p}-A = ((D^{p}-A)^{+}-(D^{p}-A)^{-})\), and we can get the following:

$$\begin{aligned}&\left[ -((A^{T}A)^{+}W-(A^{T}A)^{-}W)+(HW^{T}(A^{T}A)^{+}W\right. \nonumber \\&\qquad -\,HW^{T}(A^{T}A)^{-}W)\nonumber \\&\qquad +\,(\lambda (D^{p}-A)^{+}H- \lambda (D^{p}-A)^{-}H)\nonumber \\&\qquad \left. +\, \gamma HN \right] H_{ik}^{2} = 0 \end{aligned}$$
(18)
$$\begin{aligned}&\left[ (A^{T}A)^{-}W)+HW^{T}(A^{T}A)^{+}W \right. \nonumber \\&\qquad \left. +\,\lambda (D^{p}-A)^{+}H + \gamma HN\right] H_{ik}^{2}\nonumber \\&\quad =\left[ (A^{T}A)^{+}W)+HW^{T}(A^{T}A)^{-}W \right. \nonumber \\&\qquad \left. +\,\lambda (D^{p}-A)H^{-}\right] H_{ik}^{2} \end{aligned}$$
(19)
$$\begin{aligned}&H_{ik}^{2} =H_{ik}^{2} \nonumber \\&\quad \frac{(A^{T}A)^{+}W+HW^{T}(A^{T}A)^{-}W +\lambda (D^{p}-A)^{-}H}{(A^{T}A)^{-}W+HW^{T}(A^{T}A)^{+}W +\lambda (D^{p}-A)^{+}H + \gamma HN} \end{aligned}$$
(20)

Finally, we can get the updating rules of W and H as follows by formula (14) and formula (20):

$$\begin{aligned}&W_{i k} \leftarrow W_{i k} \sqrt{\frac{\left[ \left( A^{T} A\right) ^{+} H\right] _{i k}+\left[ \varPsi _{2} H^{T} H\right] _{i k}}{\left[ \left( A^{T} A\right) ^{-} H\right] _{i k}+\left[ \varPsi _{1} H^{T} H\right] _{i k}}} \end{aligned}$$
(21)
$$\begin{aligned}&H_{i k} \leftarrow H_{i k} \sqrt{\frac{\left[ \varPsi _{1}\right] _{i k}+\left[ H W^{T}\varPsi _{2}\right] _{i k}+\left[ \lambda \varPsi _{3}^{-} H\right] _{i k} }{\left[ \varPsi _{2}\right] _{i k}+\left[ H W^{T}\varPsi _{1}\right] _{i k}+\left[ \lambda \varPsi _{3}^{+} H\right] _{i k}+\left[ \gamma H N\right] }} \end{aligned}$$
(22)

where \(\varPsi _{1}=\left( A^{T} A\right) ^{+} W\), \(\varPsi _{2}=\left( A^{T} A\right) ^{-} W\) and \(\varPsi _{3}=\left( D^{p}-A\right)\). And the iterative update strategy for the model is shown in Algorithm 1.

figure a

3.4 Computational complexity

The computational complexity of updating W of the proposed RC-NMF algorithm is \(O(niter(N^{2}K+NK^{2}))\), and that of updating H is \(O(niter(N^{2}K+NK^{2}))\), where niter is the number of iterations and K is the number of community. It is worth noting that in the real world, the signed network is sparse, so \(N^{2}\) can be represented as the number of links M in the network. Moreover, the number of community K is far less than the number of nodes N, so \(K^{2}\) can be ignored. Therefore, the computer complexity of the optimization algorithm for the proposed RC-NMF can degrade to \(O(niter(M + N))\).

4 Experiments

In this section, we designed a series of experiments in synthetic data and real-world signed networks to validate our model including the convergence of our algorithm. In order to ensure that our RC-NMF algorithm yields the best results, we made relevant parameter sensitivity experiments in Sect. 4.4 and determined the optimal parameter as: \(\lambda =3\) and \(\gamma =7\).

4.1 Experiments on synthetic signed networks

In this section, we design a series of control experiments of our RC-NMF model with other algorithms on the artificial signed network dataset and the real large-scale signed network dataset to verify the performance of our model in community detection and link prediction. Finally, we analyze the convergence of our RC-NMF algorithm.

4.1.1 Validation of community detection

Synthetic signed networks SG benchmark network (Yang et al. 2017): The SG benchmark network (Yang et al. 2017) is evolved from the GN benchmark network (Yang et al. 2007). The GN benchmark network includes four parameters c, n, k and \(p_{in}\), which can only generate ordinary complex networks. Where c is the community number, n is the number of nodes in each community, k is the average degree in the network, and \(p_{in}\) denotes the probability of internal links. On this basis, the SG benchmark network can generate signed network, which adds two noise-level parameters \(p_{+}\) and \(p_{-}\) that represent the prior probability of positive inter-links and negative intra-links, respectively. As \(p_{in}\) decrease or noise level increases, the community structures will become less clear and more difficult to be detected, we set the parameters as follows: \(\hbox {c} = 4\); \(\hbox {n} = 32\); \(\hbox {k} = 16\) and generate two kinds of SG networks:

Type I Weakly balanced signed network(SG-BN)

There is no noise in SG-BN, i.e., \(p_{+}=0\) and \(p_{-}=0\). The parameter \(p_{in}\) is from 0.1 to 0.9. The larger the \(p_{in}\) value is, the clearer the community structure of the signed network tends to be.

Type II Unbalanced signed network(SG-UN)

Noise is added with different levels, i.e., \(p_{+}\) from 0 to 0.5 in 0.05 steps and \(p_{-}\) from 0 to 0.5 in 0.05 steps, the parameter \(p_{in}\) is set to 0.8. We set the threshold of noise level to 0.5 because if the noise is too large, the generated network model does not satisfy the characteristics of the signed network that the links in the same community are mostly positive and the links in different communities are mostly negative.

SLFR benchmark network (Yang et al. 2017): The signed Lancichinetti–Fortunato–Radicchi (SLFR) benchmark network (Yang et al. 2017) is derived from Lancichinetti–Fortunato–Radicchi (LFR) (Lancichinetti et al. 2008). Compared with GN benchmark network above, LFR benchmark network considers the non-uniform distribution of node degree and community number and has eight parameters: n, \(k_{avg}\), \(k_{max}\), \(\lambda _{1}\), \(\lambda _{2}\), \(c_{min}\), \(c_{max}\) and \(\mu\), where n is number of nodes, \(k_{avg}\) and \(k_{max}\) represent average degree and maximum degree, respectively, \(\lambda _{1}\) and \(\lambda _{2}\) mean minus exponent for the degree sequence and the community size distribution, respectively, \(c_{min}\) and \(c_{max}\) mean the minimum and maximum of community number, respectively, and \(\mu\) is the mixing parameter that indicates the fraction of edges connecting the neighbors in other communities. On this basis, the SLFR benchmark network can generate signed network, which adds two noise-level parameters \(p+\) and \(p_{-}\) that represent the fraction of positive external links and negative internal links, respectively. As \(p_{in}\) decrease or noise level increases, the community structures will become less clear and more difficult to be detected. In this paper, we set the parameters as follows: \(n=128\), \(k_{avg}=16\), \(k_{max}=20\), \(\lambda _{1}=2\), \(\lambda _{2}=1\), \(c_{min}=20\), \(c_{max}=40\), and generate two kinds of SLFR networks::

Type I Weakly balanced signed network(SLFR-BN)

There is no noise in SLFR-BN, i.e., \(p_{+}=0\) and \(p_{-}=0\). The parameter \(\mu\) is from 0.1 to 0.9. The \(\mu\) value represents the degree of confusion of the community in the signed network. The larger the \(\mu\) value is, the more likely the signed network tends to be in a state of no community structure.

Type II Unbalanced signed network(SLFR-UN)

Noise is added with different levels, i.e., the parameter \(\mu\) is set to 0.2, and \(p_{+}\in [0, 0.5]\) in 0.05 steps and \(p_{-}\in [0, 0.5]\) in 0.05 steps.

Validation metrics As for the accuracy measurement of community detection, we use normalized mutual information(NMI) (Jiao et al. 2018), which is widely used to measure the degree of similarity between predicted community structure and real community structure. The larger the NMI value is, the higher the accuracy of the community division gets. The NMI can be expressed as follows:

$$\begin{aligned} NMI(C,C')=\frac{\sum _{i=1}^{k}\sum _{j=1}^{k}n_{ij}log \frac{n_{ij}n}{n_{i}^{(1)}n_{j}^{(2)}}}{\sqrt{\sum _{i=1}^{k}n_{i}^{(1)}log \frac{n_{i}^{(1)}}{n}\sum _{j=1}^{k}n_{j}^{(2)}log\frac{n_{j}^{(2)}}{n}}} \end{aligned}$$
(23)

where C and \(C^{'}\) denote ground truth and detected community partition by algorithm, respectively, k is the number of the communities, n is the number of nodes, \(n_{ij}\) denotes the number of nodes in ground community i that are assigned in community j in detected community partition, \(n_{i}^{(1)}\) is the number of nodes in true community i, and \(n_{j}^{(2)}\) is the number of nodes in detected community j.

Comparison methods The RC-NMF performance was first tested with respect to the community detection with signed networks. Comparisons were made using three state-of-the-art methods: FEC (Yang et al. 2007), SISN (Zhao et al. 2017) and Res-NMTF (Li et al. 2018b).

Analysis of experimental results In order to more accurately measure the performance of community detection, the comparison experiments were designed in the four different types of signed networks as above 4.1.1. In the two noise-free signed network datasets, SG-BN and SLFR-BN, the x-axis represents the internal connection ratio of the signed network and the confusion of the community structure. In the two unbalanced signed network datasets SG-UN and SLFR-UN, the x- and y-axes represent the internal negative links and external positive links of the signed network, respectively, which are also represented as noise in the signed network.

Fig. 1
figure 1

NMI of community detection in SG-BN and SLFR-BN

Fig. 2
figure 2

NMI of community detection in SG-UN

Fig. 3
figure 3

NMI of community detection in SLFR-BN

The community detection comparison results obtained in the two noise-free signed network datasets of SG-BN and SLFR-BN are shown in Fig. 1. We can observe that in SG-BN 1(a), as the \(p_{in}\) increases, the community structure of the signed network tends to be clearer, and the NMI value also increases. The RC-NMF algorithm we proposed is represented in red. When \(p_{in}>0.1\), the NMI value is always 1, which means that the community divided by the algorithm is the same as the real community. The performance of the other three algorithms is inferior to the RC-NMF algorithm. The SISN algorithm in blue indicates that when \(p_{in}>0.2\), the NMI value is always 1, and the Res-NMTF algorithm is represented in cyan, when \(p_{in}=0.4\) and \(p_{in}=0.5\), the NMI value decreases, indicating that the performance of the algorithm is poor when the number of links in the community and between the communities tends to be equal. The FEC algorithm is represented by green. Although the NMI value shows an upward trend as the \(p_{in}\) increases, the overall NMI has always been at a low value, and its algorithm performance is poor. In SLFR-BN 1(b), as the \(\mu\) value increases, the community structure of the signed network tends to be blurred, and the NMI value also decreases. Before \(\mu\) = 0.9, the NMI value obtained by our RC-NMF algorithm is always 1 showing the best accuracy. The other three algorithms show a downward trend in the process of increasing \(\mu\), but the accuracy is lower than our algorithm. And the community detection comparison results obtained on the unbalanced signed network dataset of SG-UN are shown in Fig. 2. We can observe that with the increase in external positive link \(p_{+}\) and internal negative link \(p_{-}\), which represents the increase in noise level in the signed network, the performance in community detection of RC-NMF algorithm and Res-NMTF algorithm has decreased to some extent. The accuracy of the RC-NMF algorithm we proposed decreased significantly when the internal negative link \(p_{-}\) was increased, and the external positive link \(p_{+}\) had no significant impact on the performance of the algorithm. Moreover, when the internal negative link \(p_{-}\) was less than 0.25, our algorithm result obtained that the NMI was always 1, and its community detection performance was better than other algorithms. Moreover, in the real-world social network, there are fewer negative relationships among individuals in the same cluster, which, as shown in the signed network, there are fewer internal negative edges, and our algorithm has better performance in this case, so it can be well applied to social network analysis. However, in the case of high noise, the performance of our RC-NMF algorithm is worse than that of FEC and SISN, which reflects the poor robustness of our algorithm. Finally, the community detection comparison results obtained on the unbalanced signed network dataset of SLFR-UN are shown in Fig. 3. We can observe that with the increase in external positive link \(p_{+}\) and internal negative link \(p_{-}\), which represents the increase in noise level in signed network, the increase in the detection performance of all algorithms has been reduced to varying degrees. The accuracy of our RC-NMF algorithm is greatly affected by the internal negative links \(p_{-}\), when the internal negative link \(p_{-}<0.2\), the NMI value is 1, and the performance of the algorithm is better than all other algorithms. With internal negative link \(p_{-}>0.2\), the accuracy of the algorithm decreases, which is inferior to the accuracy of the FEC algorithm and the SISN algorithm. And the accuracy of the Res-NMTF algorithm decreased significantly with the increase in internal negative link \(p_{-}\).

4.1.2 Validation of link prediction

Validation metrics To measure the performance of algorithms about link prediction in signed network, we use GAUC (generalized AUC over +1, 0 and − 1) (Song and Meyer 2015), which is an extension of AUC measurement index and can be used to measure the accuracy of the three states of positive links, negative links and un-links. It is very suitable for measuring the accuracy of signed network link prediction, formulated as:

$$\begin{aligned}&\frac{1}{|P|+|N|}\bigg ( \frac{1}{|U|+|N|} \sum \limits _{a_{i} \in P} \sum \limits _{a_{j} \in U \cup N} I\big (L\left( a_{i}\right) >L\left( a_{j}\right) \big ) \nonumber \\&\quad +\,\frac{1}{|U|+|P|} \sum \limits _{a_{i} \in N} \sum \limits _{a_{j} \in U \cup P} I\big (L\left( a_{i}\right) <L\left( a_{j}\right) \big ) \bigg ) \end{aligned}$$
(24)

where |P|, |N| and |U| represent the number of positive links, negative links and un-links in signed networks, respectively. L(\(\cdot\)) is the link score function, and I(\(\cdot\)) is the 0/1 indicator function that if the condition in (\(\cdot\)) comes true, we get 0 loss, otherwise 1 loss. The larger the GAUC, the higher the accuracy of the link prediction algorithm.

Comparison methods Because the above-mentioned algorithm can only be used to detect the community and cannot perform link prediction experiments, we have selected several node similarity indicators and the Res-NMTF algorithm as the comparison algorithm of our RC-NMF algorithm about measuring the performance of link prediction. Several indicators include: CN, Jaccard and Salton.

Analysis of experimental results In order to make the experimental results more accurate and comprehensive, we used the standard fivefold cross-validation for training and testing. Figure 4 shows the performance of our RC-NMF algorithm with the Res-NMTF algorithm and several indicators on the four network models, respectively. In the SG-BN signed network dataset shown in Fig. 4a, the x-axis represents the proportional of the internal links \(p_{in}\). In the SLFR-BN dataset shown in Fig. 4c, the x-axis represents the degree of confusion of the community structure. In the SG-UN data set and the SLFR-UN data set shown in Fig. 4b, d, respectively, the x-axis represents the ratio of the internal negative links and the external positive link, which denote the noise level of the signed network. It can be observed that in SG-BN, as the internal links \(p_{in}\) increase, the community structure tends to be obvious, and the performance of the link prediction algorithms is improved. In SLFR-BN, as the degree of confusion in the community structure \(\mu\) increases, the performance of the link prediction algorithm decreases to varying degrees. In SG-UN and SLFR-UN, the performance of the link prediction algorithm decreases with the increase in noise. The accuracy of our RC-NMF algorithm is always the second best or the best among the comparison results of various algorithms, which indicates that our algorithm is superior in link prediction.

Fig. 4
figure 4

GAUC of link prediction in different synthetic datasets

4.2 Experiments on real-world signed networks

In this section, we compared RC-NMF with the other above-approaches in real-world signed networks to validate the accuracy and effectiveness of our proposed algorithm in community detection and link prediction.

4.2.1 Validation of community detection

Slovene parliamentary party network The network is about the relations among the ten parties in Slovene parliament, 1994, which has two communities (Ferligoj and Kramberger 1996). In the community detection, we only retain the sign of link in the network and ignore the weight of links. Figure 5a shows the connection state between nodes in the Slovene parliamentary party network, the solid edges represent the positive relationship, and the dash-dot edges represent the negative relationship. Figure 5b shows the community partition made by our RC-NMF algorithm, and the result is the same as the real situation, which is divided into two communities: (SKD, ZDSS, ZS, SLS, SPS) and (ZLSD, LDS, ZW-ESS, DS, SNS).

Fig. 5
figure 5

Slovene parliamentary party network

Gahuku-Gama subtribes network The network is about the culture of New Guinea Highland (Read 1954). There are 16 subtribes in this network falling into three communities. Figure 6a shows the connection state between nodes in the Gahuku-Gama subtribes network, where solid edges represent the political alliance relationship and dash-dot edges represent the enmities relationship, respectively. Figure 6b shows the community partition made by our RC-NMF algorithm, and the result is the same as the real situation, which is divided into three communities: (UKUNZ, GEHAM, MASIL, OVE, ASARO, ALIKA), (SEUVE, UHETO, NAGAM, NOTOH, KOHIK) and (KOTUN, GAMA, NAGAM, GAVEV).

Fig. 6
figure 6

Gahuku-Gama subtribes network

4.2.2 Validation of link prediction

In order to detect the accuracy of link prediction in real-world signed network, we used four large-scale real network datasets in experiments of link prediction, i.e., Slashdot (Leskovec et al. 2010), Wiki (Maniu et al. 2011), Epinions (Leskovec et al. 2010) and Bitcoinotc (Kumar et al. 2018). And in the real world, a person has an average of 40 friends offline and 338 friends online. Therefore, it is more realistic to check users with a high degree (Li et al. 2018a). In the contrast experiment of link prediction in the large-scale real-signed network, we select nodes that the degree threshold is set at 50, and the network statistics after setting are shown in Table 1 where we used ‘name@degree’ to represent a specific dataset, e.g., Epinions@50 is the dataset about Epinions with \(d \ge 50\). Table 2 shows the comparison results of our algorithm with other methods. Each number represents the GAUC value obtained by the link prediction experiment of the corresponding algorithm on the corresponding large-scale signed network dataset, where the experimental result value of our algorithm is shown in bold. We can observe that compared to other algorithms, our algorithm performs better than other methods in real-scale large-scale signed networks.

Table 1 Large-scale signed network dataset statistics
Table 2 Comparison experiment results of four algorithms in link prediction

4.3 Algorithm convergence

To test the convergence of the algorithm, we perform experiments on four kinds of synthetic datasets and deriving networks from each kind of network model. As shown in Fig. 7, when the number of iterations is greater than 50, the value of our objective function tends to be stable and will not change, which indicates that the proposed algorithm satisfies the convergence condition.

Fig. 7
figure 7

Convergence of our gradient descent update rules

4.4 Parameter sensitivity

In this section, we study the parameter sensitivity in our proposed algorithm. We selected one of the SLFR-UN datasets, in which \(p_{+}=0.25\) and \(p_{-}=0.25\) represent the signed network with certain noise but not the maximum, which is similar to the signed network in the real world. The experimental results of parameter sensitivity are shown in Fig. 8. The parameters have some influence on the accuracy of the algorithm, but not very much. When parameter \(\lambda\) is greater than 5, the blue color increases, and the accuracy of the algorithm decreases to a certain extent. Moreover, parameter \(\gamma\) has little influence on the accuracy of the algorithm, and the better results are concentrated between 4 and 9. Therefore, we select parameter values: \(\lambda =3\) and \(\gamma =7\).

Fig. 8
figure 8

Experimental results of parameter sensitivity

5 Conclusion and future work

Community detection and link prediction are the basic tasks of signed network analysis. Many of the previous algorithms rely on pre-defined optimization objective functions or heuristic algorithms with high computational complexity, and most of the algorithms cannot simultaneously perform community detection and link prediction. In response to these challenges, we propose the RC-NMF method that converges in a reasonable number of times and can be used to complete community detection and link prediction at the same time. In this paper, in order to constrain the influence of negative links, we introduce a method of graph regularization to constrain nodes with positive links being assigned to the same community and nodes with negative links being assigned to different communities. Then, in order to constrain the situation of overlapping communities, we added sparse constraints to our model. After that, we conducted a series of the experiments for community detection and link prediction in synthetic data and real-world signed network to validate the effectiveness and accuracy of our RC-NMF algorithm.

In the future, we will work on using our proposed algorithm in real social network analysis, using crawler technology to crawl the text data of comments between users on social media such as Facebook or Weibo. Then, we use some emotion analysis methods to analyze the emotional tendency in the text, such as friendliness or hostility, and build a signed network based on this. Furthermore, we use the algorithm we proposed to find some community structures and special societies in the social network such as fraud groups. Finally, we use it to predict the emotional tendency of users, such as positive or negative signed network, which can serve as a foundation for the friend recommendation system.