Keywords

1 Introduction

Networks have become effective tools for processing complex systems in the real world such as social networks, biological networks and technological networks [10]. In a network, nodes represent the entities, and edges stand for the connections between entities. The traditional single-layer network can only represent a single relationship. However, the relationship of entities is diversified. Therefore, multi-layer networks, which can address multi-relationship more effectively, have gradually been a research hot spot [11].

Communities are one of the most essential attributes of networks, which can reveal the potential structure characteristics of networks [10]. Generally speaking, a community (module or cluster) is a series of nodes with closer connections or more similar attributes [6]. Up to now, single-layer network based community detection algorithms have made great achievements, for example, Gao et al. propose a physarum-based framework [5] and Yang et al. propose a semi-supervised framework [17]. However, because of the particular structure of multi-layer networks, traditional single-layer based methods cannot address multi-layer networks effectively. In the past decades, some multi-layer community detection algorithms have been proposed, among which multi-objective optimization algorithm (MOOP) is one of the most effective methods. Due to the local and global search capabilities of the evolutionary algorithm, the multi-objective evolutionary algorithm (MOEA), one of the most classical MOOP methods, has attracted great attention [9]. MOEAs achieve good performance because they can balance the information of each layer by optimizing different objective functions. However, this kind of algorithms pays much attention to the topological information rather than taking prior information into full consideration.

Yang et al. suggest that only considering the topological information is not adequate for algorithms [17]. The prior information can further improve the accuracy and robustness of the algorithm by dividing the preknown similar nodes into the same community despite the weak relationship of the topological information. For example, the prior information can be added to the non-negative matrix factorization-based (NMF) or spectral clustering-based (SC) algorithm as a graph regularization [17]. However, there are few researches for incorporating the prior information into the MOEA. This is because it is difficult to extract the consensus prior information due to the complex structure of multi-layer networks. Besides, unlike the NMF and SC methods, MOEAs find the optimal community partition by imitating the population evolution, which brings difficulties in applying the guidance of prior information.

To address the problem mentioned above, this paper proposes a novel Semi-Supervised Multi-Objective evolutionary algorithm for Multi-Layer network community detection (named SS-MOML). The main idea of SS-MOML is to incorporate the prior information into the MOEA from beginning to end. More specifically, it aims to construct the prior layer, initialization and prior information based genetic operation, which incorporates the prior information into networks and optimization process, respectively. The prior layer, high-quality initialization and modified genetic operation can guide the evolution of the population by dividing similar nodes into the same community as far as possible. The contributions are listed as follows.

  • Instead of only paying attention to the topological information like the traditional MOEAs, the SS-MOML neatly combines the idea of semi-supervised methods and MOEA methods on different aspects, so as to further improve the accuracy and robustness.

  • In order to extract the prior information for multi-layer networks with various structures, this paper proposes DeepWalk-based method and \(S\phi rensen-Dice's\) similarity-based method, where a novel density-based aggregation strategy is added into the DeepWalk to adapt to various network structures.

  • In order to get the utmost of the prior information and MOEA, this paper proposes a SS-MOML algorithm, which utilizes the prior information to generate the prior layer, initial population and guide the population evolution.

The paper is organized as follows. Sect. 2 introduces the problem definition of multi-layer network community detection. The proposed SS-MOML algorithm is presented in detail in Sect. 3. The extensive experiments are elucidated in Sect. 4. Finally, Sect. 5 concludes the paper.

2 Problem Definition

A multi-layer network can be represented as a graph \(G=(V,\,E_{l}\,(l=1,\ldots ,L))\), which consists of L layers. Each layer has the same node set \(V=\{v_{1}, v_{2},\ldots ,v_{N}\}\) which satisfies \(|V|=N\) and N represents the number of nodes. \(E_{l}\) stands for the edge set of the \(l_{th}\) layer which represents a kind of relationship. Existing approaches formulate the multi-layer network as a set of matrices, \(A^{(l)} \in \mathbb {R}_{+}^{N \times N}(l=1, \cdots , L)\), which is used to encode the information of each layer.

Fig. 1.
figure 1

An example of multi-layer networks. The network consists of 3 layers, and different layers represent different relationships (i.e., friends, colleagues, and classmates). Nodes pertaining to different communities are colored differently.

A multi-layer network has multiple layers, where each layer represents one relationship, which is shown in Fig. 1. The complex structure of multi-layer networks leads to the diversity of the community structure in each layer. The ultimate goal of the multi-layer network community detection algorithm is to find a consensus community partition that best adapts to each layer of the multi-layer network.

3 Proposed Method

The proposed SS-MOML algorithm is introduced in this section. The flowchart of the proposed algorithm is shown in Fig. 2. The algorithm mainly consists of two components, namely, the prior information extraction and the optimization process. In the former part, two prior information extraction methods are proposed, that is, DeepWalk-based method (shown in (a)) and \(S\phi rensen-Dice's\)-similarity-based method (shown in (b)). The optimization process is based on the NSGA-II, and the prior information is incorporated into the optimization section in each step. The main components of the optimization process are shown in (c), and each of them is introduced as follows.

Fig. 2.
figure 2

The flowchart of the SS-MOML algorithm. (a) and (b) represent the DeepWalk-based and the \(S\phi rensen-Dice's\)-based prior information extraction process, respectively. (c) is the optimization process. As shown in the figure, the DeepWalk-based information is utilized to construct the prior layer and the \(S\phi rensen-Dice's\)-based information to guide the genetic operation.

3.1 Prior Information Extraction

As mentioned above, the prior information is used to guide the algorithm to divide the similar nodes into the same community. Therefore, the prior information plays a key role in algorithm since high accurate prior information can provide a correct guidance. For example, Ma et al. believe that the subgraph composed of closely connected nodes should be the prior information and they find such prior information through a greedy search method [10]. In this paper, we extract the prior information by applying DeepWalk (DW for short) [14] and \(S\phi rensen-Dice's\) (Dice for short) similarity [16] because the DW can extract the relatively high-order information and Dice can eliminate the effect of node degree. Firstly, the DeepWalk algorithm is applied in each network layer to acquire the low-dimensional representation vectors of each one. Due to the particularity of multi-layer network structure (i.e., each network layer represents a certain kind of relationship), the low-dimensional vectors need to be merged into a consensus vector as the consensus prior information of the network. The most intuitive way is to compute the average of each vector. However, the quantity of information and the importance of each layer are different, so calculating the average of vectors may lead to low accuracy. To overcome the problem, we propose the density-based aggregation strategy, which quantizes the amount of information in each layer using density, which functions as the weight to merge vectors from different layers [8]. Since the increase of density means the decrease of information quantity on the contrary, it is inconvenient to calculate the weight. To address this problem, the equation is converted to Eq. (1).

$$\begin{aligned} density = \frac{M_{c}}{N(N-1)/2} \end{aligned}$$
(1)

where \(M_{c}\) represents the number of connections within communities, and N is the number of nodes. The density means the proportion of connections and nodes within a community. We regard a network layer as a community, so that the density can be used to quantize the information quantity of a network layer.

The second method is to extract the prior information by applying Dice similarity. The equation is shown in Eq. (2).

$$\begin{aligned} \mathbf {S}\left( v_{i}, v_{j}\right) =\frac{2 { ComNeighbours }\left( v_{i}, v_{j}\right) }{ { Length }\left( v_{i}\right) + { Length }\left( v_{j}\right) } \end{aligned}$$
(2)

where \(v_{i}\) and \(v_{j}\) represent the nodes, \(ComNeighbours(v_{i},v_{j})\) means the number of common neighbours of nodes \(v_{i}\) and \(v_{j}\). \(Length(v_{i})\) is the degree of the node \(v_{i}\). To preserve the structure of the multi-layer network, the neighbours of all layers are considered when calculating the number of neighbours of nodes, which ensures the connection information of all layers are considered.

After that, the prior information contains the similarity between every two nodes. To construct the prior layer, we retain a part of node pairs with the highest DW-based cosine similarity as virtual edges according to a predefined threshold \(\theta \). Finally, the collection of virtual edges constitutes the prior layer. The prior layer contains accurate information of community structure because the connected nodes are more likely to belong to the same community (they are more similar). In addition, the Dice-based prior information is incorporated into the genetic operation, which is described in Sect. 3.4.

3.2 Objective Functions

The MOEA can be represented as \(F=\left\{ F_{1}, F_{2}, \ldots , F_{t}\right\} \), where \(F_{i}\) is the \(i_{th}\) objective function, and t means the number of functions. The MOEA optimizes each objective function to find the optimal solution. In this paper, we choose two widely-used objective functions, Modularity (Q) [12] and Normalized Cut (Nc) [3], which are shown in Eq. (3) and Eq. (4).

$$\begin{aligned} Q=\frac{1}{2 M} \sum _{i, j}^{N}\left( A_{i j}-\frac{d_{i} \times d_{j}}{2 M}\right) \delta \left( C^{i}, C^{j}\right) \end{aligned}$$
(3)

where M is the number of edges, A represents the corresponding adjacent matrix and \(d_{i}\) means the degree of node i. C stands for the community and \(C^{i}\) represents the node i that belongs to C. When \(C^{i}=C^{j}\), \(\delta \left( C^{i}, C^{j}\right) =1\), and 0 otherwise. In this paper, the first objective function is the average of Q for each layer.

$$\begin{aligned} Nc\left( \left\{ \mathcal {C}_{k}\right\} _{k=1}^{K}\right) =\frac{1}{K} \sum _{k=1}^{K} Nc_{k} \end{aligned}$$
(4)

where \(Nc_{k}=\frac{W_{k}^{o}}{2 \cdot W_{k}^{i}+W_{k}^{o}}+\frac{W_{k}^{o}}{2 \cdot \left( W_{k}^{a}-W_{k}^{i}\right) +W_{k}^{o}}\), and \(C_{k}\) is the \(k_{th}\) community. \(W_{k}^{i}\), \(W_{k}^{o}\) and \(W_{k}^{a}\) are the sum of edge weights within communities, between communities, and total edges of community k, respectively. To adapt to the structure of multi-layer networks, the \(W_{k}^{i}\), \(W_{k}^{o}\) and \(W_{k}^{a}\) are calculated for all layers.

3.3 Encoding Scheme and Initialization

The encoding scheme will affect the computational cost of the algorithm. Label-based and locus-based methods are two widely-used encoding approaches. However, the label-based method may cause redundancy, which leads to space consuming. For example, the encoding sequence \(\{1\ 1\ 1\ 2\ 2\ 2\ 2\}\) and \(\{2\ 2\ 2\ 1\ 1\ 1\ 1\}\) represent the same community partition. Therefore, the locus-based encoding scheme is selected in the proposed algorithm.

To generate the high-quality initial population, we apply the k-means algorithm to the consensus low-dimensional vector (calculated by DW-based method). Then the clustering result is transformed to the locus-based encoding, which is the initial population. The flowchart of initialization is also shown in Fig. 2.

3.4 Genetic Operators

Genetic operation is one of the most critical components, which is helpful to break the local optimality and find the optimal solution. In this paper, the crossover and mutation operations are used to increase the population diversity.

The crossover operation can generate various offspring chromosomes. The uniform crossover is selected in this paper because it is more random. Firstly, a binary mask, having the same length as the chromosome, is generated randomly. The offspring is generated by choosing the gene of parents according to the value of the binary mask. More specifically, at a gene, when the value of the mask is 0, the offspring chooses the first parent, otherwise another parent.

To make full use of the guiding ability of Dice-based prior information (calculated in Sect. 3.1), we propose the Dice-based mutation strategy. Simultaneously, to ensure the utilization of the local information (i.e., neighbours of nodes), we design a mutation strategy consisting of two parts as shown in Fig. 3.

Fig. 3.
figure 3

The diagram of mutation operation. The mutation consists of two components, which are neighbour-based (red dotted line) and Dice-based strategy (blue dotted line). (Color figure online)

There are two mutation strategies, and \(50\%\) of parent chromosome conducts the neighbour-based mutation strategy and others perform the Dice-based mutation strategy to take full advantage of the local information and prior information. For neighbour-based strategy, a gene of a parent chromosome is selected randomly according to a predefined probability. Then, the value of the selected gene mutates to the value of one of its neighbours. As for the Dice-based strategy, some genes are selected randomly with a predefined probability and the value of genes is selected from the Dice similarity table randomly. Similar to DW similarity mentioned in Sect. 3.1, the Dice similarity table is constructed by a part of node pairs with the highest Dice similarity according to a predefined reservation threshold \(\epsilon \). Figure 3 illustrates the processes of genetic operation.

3.5 Optimal Selection Strategy

As mentioned above, MOEAs can find a collection of solutions. Finding an appropriate way to extract the optimal solution is important for the algorithm. The knee point strategy is a useful method [1]. The idea of the knee point refers to a little improvement of one objective function leading to a significant reduction of other objective functions. For a two-fitness-function MOEA, an angle-based method is applied to find the knee point of the solution set [1].

Fig. 4.
figure 4

An example of knee point strategy. The red point represents the target node and green points are four neighbours of the target node. \(\alpha \), \(\beta \), \(\gamma \) and \(\eta \) are four angles. (Color figure online)

The angle-based method estimates the trades-offs of two objective functions by measuring the slopes of two lines, where the line is constructed by passing through the node and its neighbours. In the angle-based measure, first, we select four neighbors, which are the closest or the second closest on either side of the target solution. Two neighbors on different sides are randomly chosen as a pair to calculate the angle with the target solution. The four angles are shown in Fig. 4. The point with at least one maximal angle can be designated as the optimal solution.

Although the knee point strategy can find the optimal solution, the total iteration times keep unchanged, which will lead to the fluctuation of the Pareto set. Therefore, a modularity-based selection strategy is used as a supplement. The strategy selects the solutions by finding the maximum average of modularity.

4 Experiment

4.1 Datasets and Metrics

Various synthetic and real-world datasets are used to validate the performance of the proposed algorithm. The basic information of datasets is shown in Table 1. The synthetic datasets are generated by the multi-layer LFR benchmark (mLFR) [2]. The mLFR benchmark controls the structure of the network by changing the mixing parameter (\(\mu \)) and the degree change chance (Dc). Both the \(\mu \) and Dc range in (0, 1). With the increase of \(\mu \) and Dc, the community structure of the multi-layer network becomes more complicated. The real-world networks with different types and scales (i.e., SND [7], MPD [7], WTN [7], CoRA [7], Citeseer [7]) are used to ensure the comprehensiveness of the experiment.

Table 1. The summary of datasets. The first five are real-world datasets and the last two are generated by the mFLR benchmark.

Besides the datasets, metrics are also significant for experiments. In this paper, two widely-used metrics are used to evaluate the accuracy of the detected communities, namely, Normalized Mutual Information (NMI) [7] and Adjusted Rand Index (ARI) [7]. Both two metrics evaluate the similarity between partition C and \(\varOmega \), where \(\varOmega \) represents the ground truth and C stands for the partition calculated by the proposed algorithm. The range of NMI and ARI is [0,1]. The larger the both metrics, the more accurate the community partition.

4.2 Baselines and Parameter Settings

Table 2. The experimental results on various datasets. The \(\mathcal {A}_{1}-\mathcal {A}_{8}\) represent SS-MOML (our proposed algorithm), MOEA-MultiNet, SC-ML, MIMOSA, S2-jNMF, COMCLUS, CSNMF and GMC algorithm, respectively. The ‘-’ represents that the algorithm cannot run successfully on this network.

In SS-MOML, the parameter settings of DeepWalk are the same as the source paper. The number of iterations and populations are both 500. Besides, \(\theta \) and \(\epsilon \), which are used to control the proportion of retaining prior information, are set as 0.3 and 0.1, respectively. The analysis of parameters is shown in Sect. 4.4.

To verify the performance of our algorithm, various kinds of algorithms are selected as the comparison algorithms, namely, MOEA based method (MOEA-MultiNet [9]), spectral clustering based methods (SC-ML [4] and MIMOSA [3]), matrix factorization based approaches (S2-jNMF [10], COMCLUS [13] and CSNMF [7]) and multi-view clustering based method (GMC [15]). The parameter settings of comparison algorithms are consistent with the source paper.

4.3 Experimental Results

The results of comparison experiments are shown in Table 2. The experiment consists of seven comparison algorithms mentioned above running on five real-world datasets and two synthetic datasets (Syn1 and Syn2).

The results show that the proposed SS-MOML performs better than other state-of-the-art methods on most of the datasets. Since the prior information guides the algorithm to divide similar nodes into the same community, the proposed algorithm can keep a high-level performance on Citeseer, WTN and MPD, which have a complicated community structure. Moreover, we find that the SS-MOML has a higher accuracy than another classical MOEA-based method (MOEA-MultiNet) on most networks due to the utilization of prior information. On Syn1 and Syn2 (large scale synthetic network), the accuracy of SS-MOML is slightly lower than some comparison algorithms because the structure of synthetic networks is simpler compared with real-world networks and the objective function values are not linearly related to the accuracy. In general, the performance of SS-MOML surpasses other algorithms on various kinds of multi-layer networks.

4.4 Parameter Analysis

In this section, some experiments of threshold \(\theta \) and \(\epsilon \) are implemented to verify the effects of two parameters on our algorithm. The results are shown in Fig. 5.

Fig. 5.
figure 5

The experimental results of parameter analysis. The x-axis and y-axis represent the threshold of Dice (\(\epsilon \)) and cosine similarity (\(\theta \)), which controls the reservation proportion of prior information. To ensure the comprehensiveness of experiments, a real-world network (MPD) and synthetic network (128-0.5-0.2) are chosen as experimental datasets. To avoid the fluctuation of knee point selection strategy, the experiments take the second optimal selection strategy (maximize the average modularity).

The experimental results indicate that the NMI value drops to the minimum on both two datasets when \(\theta =0\) and \(\epsilon =0\), which means there is no prior information incorporated into the algorithm. With \(\theta \) and \(\epsilon \) increase, the NMI value rises because the prior information guides the algorithm to find better solutions. As shown in (a), when the values of \(\theta \) and \(\epsilon \) surpass 0.5, the reserved quantity of prior information grows up, leading to the decrease of the accuracy for prior information (the similarity between nodes declines). The experiments prove the positive impact of prior information on the algorithm.

Fig. 6.
figure 6

The analysis of robustness. (a)–(c) are designed to validate the stability of algorithms for various network structures. (d)–(f) represent the experimental results running on networks with different layers.

4.5 Robustness Analysis

As mentioned above, the prior information can improve the robustness of the algorithm. This section validates the robustness of the proposed SS-MOML algorithm. The experiment, as shown in Fig. 6, consists of two sub-experiments, which change the network structure and number of layers respectively.

The results show that the proposed SS-MOML performs better than other state-of-the-art methods in terms of accuracy and stability on various synthetic networks with different structure and number of layers. Compared with MOEA-MultiNet, the proposed algorithm owns a remarkable advantage. In addition, we can find that the NMI has less volatility than the other state-of-the-art methods with high robustness. It is because that the prior information participates in the initialization and each step of the MOEA, which can ensure the nodes with high similarity are distributed into the same community.

5 Conclusion

This paper proposes a novel semi-supervised multi-objective evolutionary algorithm for multi-layer network community detection, called SS-MOML. SS-MOML first extracts the prior information using DW-based strategy and Dice-based strategy. Then, it constructs the prior layer based on the DW-based strategy and the optimization process begins. In each iteration step, the Dice-based mutation strategy will guide the population evolution based on the Dice-based prior information. Moreover, the prior information can evaluate the similarity between nodes, and guide them into the same community despite the weak relationship on topological information. A series of extensive experiments prove our SS-MOML performs superiorly to other state-of-the-art algorithms on most datasets.