1 Introduction

1.1 Background

Community detection is to partition a network into several components, each of which, so-called a community, contains densely connected nodes with some structural similarities (Fortunato 2010; Bouguessa et al. 2010). As an important task in network science, community detection can find many real-world applications in diverse domains, like biology, chemistry, transportation, sociology (Garcia et al. 2018; Chang et al. 2016; Liu and Wang 2022; Magalingam et al. 2015). For example, community detection in biological networks can be used to analyze the interaction of brain regions and its influence on brain functions (Garcia et al. 2018). Community detection in social networks can help identifying crime organizations (Magalingam et al. 2015).

For its wide applications, many algorithms have been developed for community detection, including traditional heuristic algorithms and modern deep learning ones. Those heuristic algorithms, such as GN (Newman and Girvan 2004) and Louvain Blondel et al. (2008), usually optimize an objective function iteratively to improve the quality of detected communities. Modern deep learning-based algorithms, such as CommunityGAN (Jia et al. 2019), DANMF (Ye et al. 2018), and MRFasGCN (Jin et al. 2019), encode graph structural information for learning nodes’ representations that are used to compute community divisions.

Although lots of progresses have been achieved in the field (Souravlas et al. 2021; Su et al. 2022), most algorithms for community detection are developed in the context of single layer networks. A single layer network, or called a monoplex network in this paper, can be characterized by its unique type of edge connecting two nodes. However, in many practical scenarios multiple types of relations do exist in between two entities. For example, people can establish either friend relation or coworker relation, or both in a social network. For better representing different relations, it is recently proposed to use a multiplex network consisting of multiple layers of networks each representing only one relation type. Multiplex networks can be applied to lots of applications such as multi-behavior recommendation (Xia et al. 2021) and community detection (Tang et al. 2009; Magnani et al. 2021).

A multiplex network is composed of L network layers, \(G_{l}(V,E_{l})\), \(l=1,2,...,L\). Figure 1 illustrates a 3-layer network for community detection. The number of nodes in each layer are the same, denoted as N. A multiplex network can be also represented as multiple adjacency matrices, \(\textbf{A}_{l} \in \mathbb {R}^{N \times N}\), for \(l=1,2,...,L\). Similar to monoplex networks, community detection in multiplex networks aims to partition the node set V into M communities, \(\{C_{1},C_{2},...,C_{M}\}\). The detected communities can be seen as the comprehensive associations of nodes under diverse types of relations.

In recent years, some solutions have been proposed to address community detection in multiplex networks. A few of them propose to first convert this problem into the classical setting of community detection in a single layer network (Berlingerio et al. 2011; Suthers et al. 2013; Shao et al. 2022). For example, Berlingerio et al. (2011) first reduce a multiplex network into a single layer one by judging whether the nodes are connected in at least one network layer and then apply the random walk method to detect communities from this single layer network. Some algorithms directly apply a greedy modularity-maximization strategy on a multiplex network (Mucha et al. 2010; Tagarelli et al. 2017; Pramanik et al. 2017; Paul and Chen 2022). For example, Tagarelli et al. (2017) propose a multilayer modularity function to find consensus community structures by a greedy search approach. Some other algorithms apply matrix decomposition to extract features from a multiplex network (Ma et al. 2018; Gligorijević et al. 2019; Chen et al. 2019). For example, Ma et al. (2018) propose a nonnegative matrix factorization (NMF) algorithm called s2-jNMF, which uses a joint NMF for each network layer to obtain multiplex basis matrix for community discovering.

1.2 Motivation

The limitation of existing approaches on community detection in multiplex networks mainly lies in the following two aspects. On the one hand, lots of algorithms (Berlingerio et al. 2011; Boutemine and Bouguessa 2017; Pramanik et al. 2017; Interdonato et al. 2017) have considered how to extend traditional heuristic ones in single layer networks for multiplex networks, which could lead to the loss of some inter-layer relational information. For example, Boutemine and Bouguessa (2017) flatten a multiplex network into a single one and then utilize the Label Propagation Algorithm (LPA) (Raghavan et al. 2007). However, a flattened edge cannot describe all latent relations between two nodes in a multiplex network. On the other hand, some representation learning algorithms used in multiplex networks (Park et al. 2020; Jing et al. 2021) can be applied for community detection as their downstream tasks, but they ignore the critical factors of building multilayer community structures. For example, Jing et al. (2021) propose to maximize the mutual information between the local node embedding and the global summary for node classification and community detection. But they mainly focus on how to train a general node embedding. Besides, they are normally based on the prior knowledge of nodes’ attributes for representation learning. How to detect communities from only topological structures of a multiplex network without extra node information is a challenge task for most of such node representation learning models.

To overcome such limitations, we consider the following two challenging issues: (a) What kind of structural or topological characteristics are suitable for a community? (b) How to extract and fuse features from each single layer to represent a community in a multiplex network? We note that two kinds of characteristics are important for a community, i.e. path-aware topological characteristics and neighbor-aware structural characteristics. The former can be learned by sampling short node sequences via random walk and maximizing nodes’ co-occurrence probability in sequences. The latter can be learned by merging the embedding of a node from its neighbors (Liu et al. 2022). Besides, neighbors of different scales can be used for embedding learning. We also note that intra-layer and inter-layer not only can be extracted but also can be fused to learn nodes’ representations for community detection via neural network models.

Motivated from such considerations, this paper considers to learn not only neighbor-aware intra-layer structural information but also semantics-aware inter-layer relational information to learn nodes’ representation for community detection. In many cases, little prior knowledge is available about which type of information. For example, each layer of the temporal multiplex network represents a time slice and cannot provide extra information from the semantic perspective. Therefore, intra-layer or inter-layer, is more important for community detection in multiplex networks. We further argue that the two kinds of information should be jointly exploited for node representation learning in an interwoven manner, rather than in a separate way. In response to these arguments, we design a graph convolutional fusion model (GCFM) for community detection in multiplex networks.

1.3 Contribution

The novelties of our proposed GCFM framework include encoding and fusing intra-layer structural information with inter-layer relational information for learning node representation in an interwoven fashion. Specifically, there are three key contributions in our GCFM: (1) How to learn node topological characteristics in each single layer? In the GCFM, we first employ some graph embedding technique to obtain node initial embedding for each network layer and next design a module of graph convolutional auto-encoder (GCA), executed on a per network layer basis, to encode neighbor-aware intra-layer structural information under different convolution scales .Footnote 1 (2) How to better learn inter-layer semantics for community detection? In the GCFM, we design a module of multiscale fusion network (MFN) to fuse nodes’ encodings at different layers and different scales for learning a holistic version of nodes’ representations. (3) How to detect community based on the node feature space? In the GCFM, we use a self-training mechanism to train our model and output community divisions for a multiplex network.

Compared with those traditional heuristic studies (Tang et al. 2009; Boutemine and Bouguessa 2017; Gligorijević et al. 2019), our GCFM uses the random walk to generate node initial embedding for subsequent encoding and fusing, instead of optimizing a certain indicator such as modularity, which avoids falling into local optimum. For those deep learning-based methods which only use auto-encoder (Song and Thiagarajan 2019), our GCFM designs the GCA and MFN that consider the influence of different convolutional scales. For other node embedding based methods which intend to train a general node embedding for downstream tasks such as node classification, community detection etc. (Park et al. 2020; Jing et al. 2021), our GCFM applies a community specific training loss and pays attention to the community characteristics when learning nodes’ representations. Extensive experiments are conducted on both synthetic and real-world datasets. Results validate that our GCFM achieves higher precision and detects communities with better modularity over peer competitors in most cases.

Fig. 1
figure 1

A 3-layer network with two communities

The main contributions are summarized as follows:

  • Propose a graph convolutional fusion model (GCFM) to encode and fuse intra-layer and inter-layer information for representation learning in an interwoven fashion for community detection in multiplex networks.

  • Develop a graph convolutional auto-encoder (GCA) to encode neighbor-aware intra-layer structural information at different scales.

  • Design a multiscale fusion network (MFN) to fuse and encode intra-layer structural information and inter-layer relational information at different layers and different scales.

  • Experiment on both synthetic and real-world datasets to validate the superiority of our proposed model.

The rest of the paper is organized as follows. Section 2 reviews the related work. The proposed GCFM is presented in Sect. 3 and evaluted in Sect. 4 and 5. Section 6 concludes the paper.

2 Related work

2.1 Multiplex network community detection

Many traditional heuristic algorithms for community detection in monoplex networks have been extended into multiplex ones, which can be generally categorized into three types: flattening, aggregation, and direct methods (Huang et al. 2021).

Flattening methods (Berlingerio et al. 2011; Gao et al. 2019) first convert (flatten) a multiplex network into a monoplex one by, say for example, setting new edge weights and then use any monoplex algorithm to find communities. For example, Gao et al. (2019) propose a modified particle competition model, which constructs a extended adjacency matrix to represent a multiplex network for community detection.

Aggregation methods (Tang et al. 2009; Tagarelli et al. 2017; Ali et al. 2019) capture structural information from each single layer and aggregate them into a comprehensive version of node feature for community detection. For example, Ali et al. (2019) utilize a variational Bayes method to extract community structures from each network layer and aggregate them to determine the final community divisions.

Direct methods (Ma et al. 2018; Chen et al. 2019; Mercorio et al. 2019) work directly on a multiplex network to detect communities. For example, Ma et al. (2018) propose a s2-jNMF algorithm, which simultaneously factorizes the matrices from each network layer to obtain community divisions.

2.2 Deep learning community detection

Many deep learning techniques have been applied for community detection in monoplex networks. Compared to most traditional methods, the advantage of deep learning lies in that it can automatically learn node representations via, say for example, encoding graph structural information (Liu et al. 2020).

Deep neural networks, like convolutional neural networks (Sperlí 2019), auto-encoders (Cao et al. 2018), and generative adversarial networks (Jia et al. 2019), can help capturing structural relations in between nodes. For example, Jia et al. (2019) propose a GAN-based (Generative Adversarial Network) model to encode the membership strength of nodes to communities for community detection.

Deep graph embedding-based models convert nodes into a low-dimensional vector space to preserve graph structural information, such as deep non-negative matrix factorization (Ye et al. 2018), deep sparse filtering (Xie et al. 2018), and community embedding (Tu et al. 2018). For example, Ye et al. (2018) propose a DANMF model, which learns the hierarchical mappings between the original network and the community distribution by using a deep auto-encoder.

Graph neural network-based models can extract community structures from raw graph data by fusing graph mining and deep learning techniques (Jin et al. 2019; Wang et al. 2019; Bo et al. 2020). For example, Jin et al. (2019) propose an end-to-end model for semi-supervised community detection, which integrates a graph convolutional network with the Markov Random Field technique.

3 Graph convolutional fusion model

Figure 2 presents the framework of our graph convolutional fusion model (GCFM), which contains four modules: (1) node initial embedding, (2) graph convolutional auto-encoder (GCA), (3) multiscale fusion network (MFN), and (4) self-training community detection.

Fig. 2
figure 2

The overall framework of GCFM. \( X_{1} \) is the node initial embedding learned, say for example, by the DeepWalk in the 1-st network layer. Different colors represent different network layers. The lower part shows a graph convolutional auto-encoder. The top part is a multiscale fusion network. Soft distribution Q and target distribution P indicate a self-training community detection mechanism

3.1 Node initial embedding

We employ some graph embedding technique to obtain node initial embedding per network layer. Since nodes are not with attributes in our problem setting, we choose the DeepWalk (Perozzi et al. 2014) to help capturing latent path-aware topological features for initializing a node embedding. We note that other techniques can also be used, and we will compare a few in our experiments.

For the l-th network layer, we use \(\textbf{X}_{l} \in \mathbb {R}^{N \times d_{0}}\) to denote the node initial embedding matrix obtained from the DeepWalk, where \(d_{0}\) is the dimension of initial embedding. Although the initial embeddings can be directly used for community detection, such path-aware topological embeddings may not be able to well describe latent neighborhood information of a node, which, however, is often the key to the community detection task. To capture such local association characteristics, we next design the GCA module for encoding neighbor-aware structural information.

3.2 Graph convolutional auto-encoder

The GCA module executes aggregation operations to update a node’s embedding from its neighbors’, by which local structure information can be encoded into nodes’ embeddings. Furthermore, by applying multiple scales of the aggregation operation, not only one-hop neighbors but also multi-hop neighbors as well their associate information through edges in between neighbors can be encoded for representing a node local structure at different scales.

The GCA aggregation operations are executed on each network layer independently. Take the l-th network layer for example. Let \(\textbf{A}_{l} \in \mathbb {R}^{N \times N}\) denote the adjacency matrix of the l-th network layer. For the first scale of GCA, the input includes \(\textbf{A}_{l}\) and \(\textbf{X}_l\), and the output is given by

$$\begin{aligned} \textbf{H}_{l,1} = ReLU(\widetilde{\textbf{D}}^{-\frac{1}{2}}_{l} \widetilde{\textbf{A}}_{l} \widetilde{\textbf{D}}_{l}^{-\frac{1}{2}} \textbf{X}_{l} \textbf{W}_{l,0}), \end{aligned}$$
(1)

where \(\widetilde{\textbf{D}}^{-\frac{1}{2}}_{l} \widetilde{\textbf{A}}_{l} \widetilde{\textbf{D}}_{l}^{-\frac{1}{2}}\) is a normalized Laplacian matrix. \(\widetilde{\textbf{A}}_{l} = \textbf{A}_{l} + \textbf{I}\) is the adjacency matrix plus the node self-connection matrix and \(\widetilde{\textbf{D}}_{l(ii)} = \sum _{j} \widetilde{\textbf{A}}_{l(ij)}\). \(\textbf{W}_{l,0} \in \mathbb {R}^{d_{0} \times d_{1}}\) is a learnable weight matrix. We choose the \(ReLU(\cdot )\) function as the activation function.

The forward encoding process in the k-th GCA scale is as follows:

$$\begin{aligned} \textbf{H}_{l,k} = ReLU(\widetilde{\textbf{D}}^{-\frac{1}{2}}_{l} \widetilde{\textbf{A}}_{l} \widetilde{\textbf{D}}_{l}^{-\frac{1}{2}} \textbf{H}_{l,k-1} \textbf{W}_{l,k-1}), \end{aligned}$$
(2)

where \(\textbf{H}_{l,k} \in \mathbb {R}^{N \times d_{k}}\) is the l-th layer node encoding after the k-th GCA scale. It is obtained from the node encoding \(\textbf{H}_{l,k-1} \in \mathbb {R}^{N \times d_{k-1}}\) of the previous scale. \(\textbf{W}_{l,k-1} \in \mathbb {R}^{d_{k-1} \times d_{k}}\) is a learnable weight matrix.

Fig. 3
figure 3

Illustration of using different scales of neighborhood to encode local structural information for the white node in a single network layer. Different numbers of neighbors, possibly from different communities, are included for updating its representation at different scales

The output of each GCA scale can be regarded as how large the neighborhood of a node will be included into its local structural encoding. Figure 3 illustrates how different GCA scales can include different neighborhoods of a node for encoding its local structure information, which could be analogy to encode a node-centric max-k-neighbor subgraph. The GCA module encodes multiscale local structures for each node; Such structural encodings can be exploited for discriminating neighborhood associations at different scales, which are key components for the community detection task.

In the decoding process, we reconstruct the edges in between nodes in each single layer by using an inner product decoder:

$$\begin{aligned} \widehat{\textbf{A}}_{l} = sigmoid({\textbf{H}_{l,K}} \textbf{H}_{l,K}^{\textrm{T}}), \end{aligned}$$
(3)

where \(\widehat{\textbf{A}}_{l}\) is the reconstructed adjacency matrix of the l-th layer. \(\textbf{H}_{l,K} \in \mathbb {R}^{N \times d}\) is the K-scale GCA node encoding.

We apply a binary cross entropy loss to minimize the difference between \(\widehat{\textbf{A}}_{l}\) and \(\textbf{A}_{l}\) for the l-th layer:

$$\begin{aligned} L_{rl} = -\frac{1}{N} \sum _{i=1}^{N} \sum _{j=1}^{N} [\textbf{A}_{l(ij)} \log (\widehat{\textbf{A}}_{l(ij)})+(1- \textbf{A}_{l(ij)}) \log (1-\widehat{\textbf{A}}_{l(ij)})]. \end{aligned}$$
(4)

The total reconstruction loss is the sum of reconstruction error of each layer:

$$\begin{aligned} L_{r} = \sum _{l}^{L} L_{rl}. \end{aligned}$$
(5)

3.3 Multiscale fusion network

The GCA module is executed per network layer basis, that is, its node encodings at different neighborhood scales are for different network layers. As the edges in each layer express a kind of specific semantic relations among nodes, we need to take into account all kinds of semantic relations to output a comprehensive community division for a multiplex network. To this end, we propose the MFN module to fuse node encodings not only from different GCA scales, but also from different network layers.

As we have no a prior knowledge to decide which layer in a multiplex network is more important, we first design the multilayer fusion of node structural encodings as a simple sum operation for the k-th GCA scale:

$$\begin{aligned} \textbf{H}_{k} = \sum _{l=1}^{L} \textbf{H}_{l,k}. \end{aligned}$$
(6)

\(\textbf{H}_{k} \in \mathbb {R}^{N \times d_{k}}\) is called the k-th scale node multilayer encoding.

We next use a fully connected neural network to output different scales of node representation \(\textbf{Z}_k\). The initial scale of MFN fusion transforms \(\textbf{H}_{1}\) to the hidden state in the next scale:

$$\begin{aligned} \textbf{Z}_{1}= & {} \textbf{H}_{1}, \end{aligned}$$
(7)
$$\begin{aligned} \textbf{Z}_{2}= & {} ReLU(\textbf{W}_{1}\textbf{Z}_{1}^{\textrm{T}} + \textbf{b}_{1})^{\textrm{T}}, \end{aligned}$$
(8)

where \(\textbf{Z}_{1} \in \mathbb {R}^{N \times d_{1}}\) is the initial scale of node representation, and \(\textbf{Z}_{2} \in \mathbb {R}^{N \times d_{2}}\) the second scale of node representation. \(\textbf{W}_{1} \in \mathbb {R}^{d_{2} \times d_{1}}\) and \(\textbf{b}_{1} \in \mathbb {R}^{d_{2} \times N}\) are learnable parameters.

In the full connected units, each node exchanges its encodings with the other nodes’. We emphasize the topological information by adding the aggregated multilayer node encoding in each MFN scale. For the k-th (\(k>1\)) MFN scale, we update the node representation \(\textbf{Z}_{k}\) by inputting the multilayer node encoding \(\textbf{H}_{k-1}\) and the multiscale node representation \(\textbf{Z}_{k-1}\) into a full connected unit. The forward process can be written as:

$$\begin{aligned} \textbf{Z}_{k} = ReLU(\textbf{W}_{k-1}(\textbf{Z}_{k-1} + \textbf{H}_{k-1})^{\textrm{T}} + \textbf{b}_{k-1})^{\textrm{T}}, \end{aligned}$$
(9)

where \(\textbf{Z}_{k} \in \mathbb {R}^{N \times d_{k}}\) and \(\textbf{Z}_{k-1}, \textbf{H}_{k-1} \in \mathbb {R}^{N \times d_{k-1}}\). The weight matrix \(\textbf{W}_{k-1} \in \mathbb {R}^{d_{k} \times d_{k-1}}\) and bias \(\textbf{b}_{k-1} \in \mathbb {R}^{d_{k} \times N}\) are learnable parameters.

The final MFN scale is to fuse \(\textbf{Z}_{K} \) and \(\textbf{H}_{K}\) to output final node representation \(\textbf{Z} \in \mathbb {R}^{N \times d_{K}}\)

$$\begin{aligned} \textbf{Z} = \textbf{Z}_{K} + \textbf{H}_{K}, \end{aligned}$$
(10)

which will be input to the next self-training module for community detection.

3.4 Self-training community detection

Community detection is often an unsupervised problem in the real world. Inspired by Xie et al. (2016), we adapt a self-training mechanism to train our GCFM.

For the i-th node, its final node representation \(\textbf{z}_{i}\), the i-th row of the node representation \(\textbf{Z}\), can be seen as an embedding in a feature space \(\mathbb {Z}\). Let \({\varvec{\nu }}_{j}\) denote the representation of the j-th community center, which is also an embedding in the space \(\mathbb {Z}\). All the community center embeddings are initialized by using the K-means algorithm before training.

Assume the pre-specified number of communities as M. The computation of community center is as follows. We first randomly select M community centers. And then we repeat the following process until convergence: (a) For each node i, we find its class label \(c_{i}\) which minimizes the distance between node i and each community center \({\varvec{\nu }}_{j}\), \(j=1,2,...,M\).

$$\begin{aligned} c_{i}=\mathop {\arg \min }\limits _{j}(\Vert \textbf{z}_{i} - {\varvec{\nu }}_{j} \Vert ^{2}). \end{aligned}$$
(11)

(b) We next update each community center by recalculating the centroid of its belonging class.

$$\begin{aligned} {\varvec{\nu }}_{j}=\frac{1}{\vert C_{j} \vert } \sum \limits _{\textbf{z}_{i} \in C_{j}} \textbf{z}_{i}, \end{aligned}$$
(12)

where \(C_{j}\) includes all the node representation with class label \(c_{i}=j\).

We next use the Student’s t-distribution (Van der Maaten and Hinton 2008) as a kernel to measure the similarity between each node and each community center. The similarity \(\textbf{q}_{ij}\) between a node i and a community j is computed by

$$\begin{aligned} \textbf{q}_{ij} = \frac{(1 + \Vert \textbf{z}_{i} - {\varvec{\nu }}_{j} \Vert ^{2})^{-1}}{\sum _{j'} (1 + \Vert \textbf{z}_{i} - {\varvec{\nu }}_{j'} \Vert ^{2})^{-1}}. \end{aligned}$$
(13)

It can be interpreted as the probability of assigning node i to community j. As such, we treat \(\textbf{Q}=[\textbf{q}_{ij}] \in \mathbb {R}^{N \times M}\) as the distribution of the soft assignments of all nodes.

In order to detect cohesive communities, an objective is to enable each node closer to its belonging community center in terms of their similarity in the feature space \(\mathbb {Z}\). This can be translated to obtain a high confidence and trustworthy distribution of soft assignments, denoted as the target distribution \(\textbf{P} \in \mathbb {R}^{N \times M}\), where an element \(\textbf{p}_{ij} \in \textbf{P}\) can be computed by raising \(\textbf{q}_{ij}\) to the second power and then normalizing by frequency per community:

$$\begin{aligned} \textbf{p}_{ij} = \frac{\textbf{q}_{ij}^{2}/\textbf{f}_{j}}{\sum _{j'} \textbf{q}_{ij'}^{2}/\textbf{f}_{j'} }, \end{aligned}$$
(14)

where \(\textbf{f}_{j} = \sum _{i} \textbf{q}_{ij}\) are soft community frequencies.

After acquiring soft assignment distribution \(\textbf{Q}\) and target distribution \(\textbf{P}\), we use the Kullback–Leibler divergence loss to measure the difference between the two community distributions.

$$\begin{aligned} L_{c} = KL(\textbf{P} \Vert \textbf{Q}) =\sum _{i} \sum _{j} \textbf{p}_{ij} \log \frac{\textbf{p}_{ij}}{\textbf{q}_{ij}}. \end{aligned}$$
(15)

By minimizing the KL divergence loss, our model can improve the distribution \(\textbf{Q}\) under the supervision of the target distribution \(\textbf{P}\), which is called self-training community detection. Note that \(\textbf{P}\) should have the following properties: (a) a high accuracy of community division (b) an assigned community with high confidence (c) a normalized loss contribution of each centroid. Therefore, we choose \(\textbf{Q}\) to construct \(\textbf{P}\) because \(\textbf{Q}\) is natural and flexible for its use of softer probabilistic targets.

For model training, we define the final loss of GCFM by

$$\begin{aligned} L = L_{r} + \lambda L_{c}, \end{aligned}$$
(16)

where \(\lambda \) is a coefficient to balance two losses of network reconstruction and community detection.

At last, we obtain final community divisions from the optimized distribution \(\textbf{Q}\). The community label of each node is assigned by:

$$\begin{aligned} g_{i} = \mathop {\arg \max }\limits _{j}(\textbf{q}_{ij}), \end{aligned}$$
(17)

where \(g_{i}\) is the community label of node i.

3.5 Complexity anlaysis

Assume the number of nodes as N, the number of layers in a multiplex network as L, the depth of GCA as K, and the number of detected communities M. We consider the computational complexity of each module. The node initial embedding is generated by DeepWalk which contains the training of the Skip-gram model. Given the number of random walks \(\rho \), walk length t, window size \(\omega \), and initial embedding size \(d_{0}\), its computation complexity is in the order of \(\mathcal {O}(L\rho Nt\omega (d_{0}+d_{0}logN))\) according to Chen et al. (2018). GCA is a GCN-based model which has linear time complexity with the number of edges. Therefore, we focus on the number of edges in each layer \(\vert E_{l} \vert , l=1,2,...,L\) and the encoding dimension of each scale in GCA \(d_{k}, k=1,2,...,K\). The complexity of GCA is in the order of \(\mathcal {O}(\sum _{l=1}^{L} \vert E_{l} \vert d_{0}d_{1}d_{2}...d_{K})\). The MFN consists of multiple fully connected unit and the initial input is the first scale of GCA. The time complexity of MFN is in the order of \(\mathcal {O}(LNd_{1}^{2}d_{2}^{2}...d_{K}^{2})\). For the self-training module, the computation time mainly depends on Eq.(13). According to Xie et al. (2016), the computational complexity is \(\mathcal {O}(NM+NlogN)\). To sum up, the total time complexity of GCFM is in the order of \(\mathcal {O}(L\rho Nt\omega (d_{0}+d_{0}logN)+ \sum _{l=1}^{L} \vert E_{l} \vert d_{0}d_{1}d_{2}...d_{K} +LNd_{1}^{2}d_{2}^{2}...d_{K}^{2} +NM+NlogN)\), which has a linear relation with the number of layers and edges in each layer.

4 Experiment setup

4.1 Datasets

We conduct experiments on synthetic datasets and real-world datasets without ground truth community labels, respectively. For synthetic datasets, we use mLFR (Bródka 2016) which are controlled by the mixing parameter \(\mu \) and a smaller \( \mu \) suggests more obvious community structures. We will vary it in our experiments. For real-world datasets, we use AUCs (Magnani et al. 2013), RM (Eagle and Pentland 2006), C.elegans (Chen et al. 2006), London (De Domenico et al. 2014), Vickers (Zhang et al. 2017), FFTWYT (Magnani and Rossi 2011), CKM (Coleman et al. 1957), Plasmodium (Stark et al. 2006), HumanHIV1 (Stark et al. 2006), and FriendFeed (Magnani and Rossi 2011) to conduct experiments. The details of datasets are as follows.

\(\bullet \) mLFR (Bródka 2016): It is a benchmark tool for generating synthetic multiplex networks. The main parameter of mLFR is the mixing parameter \(\mu \) which is the probability of a node connecting to another node in a different community. A smaller value of \( \mu \) suggests that a multiplex network contains more obvious community structures. Table 1 presents the parameter settings for the mLFR datasets in our experiment.

The following real-world multiplex networks are from different domains, and Table 2 summarizes their statistics.

\(\bullet \) AUCs (Magnani et al. 2013): It is a 5-layer social network consisting of 61 nodes as employees in the Aarhus University and 620 edges as their social relations, including work together, lunch together, friendship on Facebook, off-line friendship, and co-authorship.

\(\bullet \) RM (Eagle and Pentland 2006): It is 3-layer social network from the MIT Reality Mining project, which consists of 94 nodes as project participants and 1,385 edges as their social relations: including friendship, average proximity at work and average proximity outside lab.

\(\bullet \) C.elegans (Chen et al. 2006): It is 3-layer neuronal network of the nematode "Caenorhabditis Elegans", consisting of 279 nonpharyngeal neurons and 5,863 synaptic connections, including three types, namely, electric connection, chemical monadic connection and chemical polyadic connection.

\(\bullet \) London (De Domenico et al. 2014): It is a 3-layer traffic transportation network consisting of 369 nodes and 441 edges, where nodes are stations and three types of edges are transportation lines, including the underground lines, overground lines, and DLR.

\(\bullet \) Vickers (Zhang et al. 2017): It is a 3-layer offline social network of 29 nodes as seventh grade students in a school and 740 edges for their social relations, including affinity in the class, best friends and working together.

\(\bullet \) FFTWYT (Magnani and Rossi 2011): It is a 3-layer online social network, consisting of 6,407 users and 74,862 edges. The three layers correspond to three online social platforms, including Friendfeed, Twitter, and YouTube.

\(\bullet \) CKM (Coleman et al. 1957): It is a 3-layer social network, consisting of 246 nodes and 1,551 edges. The data is collected from physicians in four towns in Illinois, Peoria, Bloomington, Quincy and Galesburg, which includes the physicians’ adoption of a new drug.

\(\bullet \) Plasmodium (Stark et al. 2006): It is a 3-layer biological network, consisting of 1,203 nodes and 2,521 edges, which includes three types of interactions between genes and proteins, i.e. direct interaction, physical association and association.

\(\bullet \) HumanHIV1 (Stark et al. 2006): It is a 5-layer biological network, consisting of 1,005 nodes and 1,355 edges. It considers different types of genetic interactions about HIV type 4 in the Biological General Repository for Interaction Datasets, i.e. physical association, direct interaction, colocalization, association, and suppressive genetic interaction.

\(\bullet \) FriendFeed (Magnani and Rossi 2011): It is a 3-layer online social network, consisting of 21,006 nodes and 573,600 edges. It mainly contains interactions among users in Friendfeed collected over the two months, including commenting, liking, and following.

Fig. 4
figure 4

The topological structure of each layer in AUCs multiplex network

Table 1 The basic parameter setting of mLFR dataset
Table 2 The statistics of real datasets

4.2 Competitors

We compare our GCFM with the following competitors:

\(\bullet \) PMM (Tang et al. 2009) is based on modularity optimization and matrix decomposition, which consists of structural feature extraction and cross-layer integration.

\(\bullet \) MDLPA (Boutemine and Bouguessa 2017) introduces a constrained label propagation mechanism for multiplex networks.

\(\bullet \) CSNMF (Gligorijević et al. 2019) introduces a collective factorization framework which uses symmetric NMF (Kuang et al. 2012) for each network layer and then generates a common feature representation by fusing layers to extract community structures.

\(\bullet \) DH-Louvain (Shao et al. 2022) is a multi-greedy algorithm for community detection in multiplex networks which optimizes a weighted modularity density.

\(\bullet \) M-DeepWalk (Song and Thiagarajan 2019) first constructs a supra graph for a multiplex network and then uses the DeepWalk (Perozzi et al. 2014) to obtain node representations that are input into an auto-encoder to extract cohesive structures in the latent space.

\(\bullet \) DMGI (Park et al. 2020) extends Deep Graph Infomax (Velickovic et al. 2019) to multiplex networks and jointly integrate the nodes’ embeddings by a consensus regularization framework.

\(\bullet \) HDMI (Jing et al. 2021) designs a joint supervision signal which includes both extrinsic and intrinsic mutual information to optimizes node representation in multiplex networks.

4.3 Metrics

For synthetic datasets, as they have ground truth labels, we use the commonly used evaluation metrics for supervised learning, including NMI (Normalized Mutual Information), ARI (Adjusted Rand Index), and Purity. For real-word datasets, as they do not contain ground truth labels, we adopt the widely used Modularity metric (Mucha et al. 2010) (also used in synthetic datasets) with two parameters, the resolution parameter \(\gamma _{s}\) and the coupling parameter \(\mathscr {C}_{jsr}\). The details of metrics are as follows.

NMI is used to trade-off the quality of communities against the number of communities:

$$\begin{aligned} NMI(\Omega ; C) = \frac{I(\Omega ; C)}{[H(\Omega )+H(C)]/2}, \end{aligned}$$
(18)

where \(\Omega \) is the community division and C is the ground truth. \(I(\Omega ; C)\) is the mutual information between \(\Omega \) and C. \(H(\cdot )\) is the Shannon information entropy.

Rand Index is the percentage of true decisions, defined by

$$\begin{aligned} RI(\Omega ; C) = \frac{TP + TN}{TP + FP + FN +TN}, \end{aligned}$$
(19)

where true positive (TP), true negative (TN), false positive (FP), and false negative (FN) are decisions between \(\Omega \) and C. ARI is to scale Rand Index in the range of [0,1].

Purity is the percentage of nodes which are classified correctly:

$$\begin{aligned} Purity(\Omega ; C) = \frac{1}{N}\sum _{m} \mathop {max}\limits _{j} \vert \omega _{m} \cap c_{j} \vert , \end{aligned}$$
(20)

where \(\omega _{m}\) and \(c_{j}\) mean the m-th community in \(\Omega \) and the j-th community in C respectively.

The multilayer modularity evaluation metric is defined by:

$$\begin{aligned} Q_{m} = \frac{1}{2\eta }\sum \limits _{ijsr} [(\textbf{A}_{ijs} - \gamma _{s} \frac{d_{is} d_{jr}}{2m_{s}} ) \delta (s,r) + \delta (i,j) \mathscr {C}_{jsr}] \delta \left( g_{is}, g_{jr}\right) , \end{aligned}$$
(21)

where \(\textbf{A}_{ijs}\) is the intra-layer edge strength and \(\mathscr {C}_{jsr}\) is the inter-layer edge strength or called the coupling parameter (subscript ij indexes nodes, sr indexes layers). \(\gamma _{s}\) is a resolution parameter in the s-th layer. \( \eta =\left( \sum _{ijs} \textbf{A}_{ijs} + \sum _{jsr} \mathscr {C}_{jsr} \right) /2 \) is the total multilayer edge strength. \(m_{s} = \left( \sum _{ij} \textbf{A}_{ijs} \right) /2\) is the total intra-layer edge strength of the s-th layer. \(d_{is}\) is the degree of nodes i of the s-th layer. \( g(\cdot ) \) is the community label. \(\delta (\cdot )\) is the Kronecker function. In our experiments, the resolution parameter \(\gamma _{s}\) and the coupling parameter \(\mathscr {C}_{jsr}\) are set to 1. The multilayer modularity is scaled in the range of [0, 1]. A higher modularity indicates better community detection result.

4.4 Parameters

In our GCFM algorithm, we use the DeepWalk to generate an 128-dimensional initial embedding for each node. For the DeepWalk, we set the window size as 10, walk length as 40, and the number of walks started per node as 20 for all datasets. The depth of GCA is set to 4. The dimension of each GCA scale is set to 128-512-1024-2048-10 for synthetic datasets and 128-1024-2048-2048-10 for real-world datasets. The learning rate r is set to 0.001. Considering that the reconstruction loss \(L_{r}\) is too larger than the community detection loss \(L_{c}\), we set the balance coefficient \(\lambda =10\).

For the competitor M-DeepWalk, as it assigns a community label for a node in each layer, we use the majority voting to determine its final community label. The MDLPA can automatically determine the number of communities without a priori assumption and do not need any hyper-parameter. The PMM contains a single hyper-parameter as the number of structural features extracted from each layer. We select the number of structural features in [5, 20] with a step of one per increment and choose the best result. The CSNMF has no external parameters, except a pre-specified community number. We set it the same as the target community number in synthetic datasets and tune it in real-world datasets. The DH-Louvain is also parameter-free and can determine the number of communities by itself. For the M-DeepWalk, we set the threshold of generating a supra graph as 0.1 for the synthetic datasets and 0.2 for the real-world datasets. The parameters of DeepWalk are the same as ours. The learning rate is set to 0.001. For the DMGI and HDMI, we use the initial embedding in Sect. 3.1 as 128-dimensional node features in each layer. The learning rate is also set to 0.001. For all the coefficient of module loss and regularization in the DMGI, we follow the original paper and select from [0.0001, 0.001, 0.01, 0.1] and tune the coefficients to obtain the best result. For the HDMI, we set the layer and fusion coefficients to 1 for all layers in synthetic datasets and use grid search to tune in the real-world datasets. For all the competitors and our model, we run them 10 times and take the average result.

Fig. 5
figure 5

The result of NMI, ARI, purity and modularity for mLFR datasets. The upper four subfigures show the varying of a NMI, b ARI, c purity, d modularity in small mLFR datasets with 500 nodes and the lower four subfigures are the varying of e NMI, f ARI, g purity, h modularity in larger mLFR datasets with 2000 nodes

5 Results and analysis

5.1 Performance on synthetic datasets

Figure 5 compares the community detection performance on mLFR synthetic networks. We observe that our GCFM performs the best in almost all the cases. Furthermore, the performance improvements are much significant compared with the competitors in the cases of \(\mu \ge 0.25\). We also notice that the much better modularity of our GCFM over that of competitors suggests stronger modular structures of its detected communities.

These results validate the effectiveness of our GCFM algorithm: The GCA module encodes neighbor-aware structural information for each node to explore locally associated structures at different scales in each network layer. The MFN module fuses structural encodings first for multiple network layers and then for different convolution scales, so as to learn a node representation featuring both intra-layer locally associated structures and inter-layer semantic relations. Finally, the self-training mechanism extracts cohesive communities from the viewpoint of community distribution via the soft assignment, which can iteratively optimize each node representation close to its community center.

Some other analysis for the results are as follows. We can observe that the performance of all algorithms degrade with the increase of the mixing parameter \(\mu \). For the mLFR synthetic datasets, a lager value of \(\mu \) indicates to include more connections in between two nodes belonging to different communities, which makes it more difficult for the task of community detection. We can also observe that the M-DeepWalk based on node representation learning have a good performance in many cases. The M-DeepWalk extends the DeepWalk (Perozzi et al. 2014) by enabling cross-layer random walks for learning node representations. Even given its simple learning technique, the results indicate that learning node representation could be a powerful approach for the community detection task. For DMGI and HDMI, they are both based on mutual information maximization which can narrow the difference between network layers and have a relatively good performance. The PMM has an average performance by optimizing the principal modularity. The CSNMF, which utilizes collective matrix decomposition, is sensitive to the mixing parameter \(\mu \). When \(\mu \ge 0.25\), the performance of CSNMF drops sharply. Though MDLPA can determine community number automatically, the losing of multiplex information when flattening multiplex networks leads to the worst performance in most mLFR datasets. The DH-Louvain has two phases, including node merging and community merging, to optimize a weighted modularity density, and its performance is on the average among all competitors.

Table 3 Modularity on real-world datasets
Fig. 6
figure 6

Modularity with different pre-specified numbers of communities

5.2 Performance on real-world datasets

Table 3 presents the modularity results on real-world datasets. The numbers in the brackets are the number of detected communities in these algorithms when achieving their respective best modularity. We note that the PMM, CSNMF, M-DeepWalk, MDGI, HDMI and GCFM algorithms need to pre-specified the number of communities; While MDLPA and DH-Louvain algorithms do not need to do so, and they can automatically determine the number of communities after finishing algorithm execution. Figure 6 presents the modularity results with different pre-specified community numbers in three different types of multiplex networks. We observe that our GCFM outperforms the others for its superiority and robustness.

We again observe that our GCFM achieves the best modularity in most real-world datasets, and the third place in the London dataset. The specialty of the London dataset lies in its sparsity: It is a 3-layer network containing 369 nodes yet with only 441 edges. Indeed, it contains many isolated nodes, each of which cannot link to any other node in one or more network layers. Generally speaking, a community division with high modularity should assign each isolated node an independent community label. MDLPA directly regard isolated nodes as independent communities, which can be found from their much larger community numbers in Table 3. On the other hand, for algorithms needing pre-specified community numbers, they do not discriminate isolated nodes yet still clustering them into communities, which leads to a lower modularity. This phenomenon could be easily addressed by firstly removing isolated nodes for community division. From Fig. 6, it can be observed that our GCFM achieves consistent good performance while some competitors such as the CSNMF and PMM have some performance variations with the increase of the numbers of communities. Both the CSNMF and PMM are based on the adjacency matrix decomposition. Varying community numbers would impact on the eigenvectors during matrix decomposition, which again would impact on the obtained community structures in the process of two medium-sized communities merging into one or a huge community being divided into multiple communities. In our GCFM, the community number only impacts on the self-training module and KL loss.

An interesting observation is that the M-DeepWalk often plays the worst in these real-world datasets, which is contrary to the results in the synthetic datasets. This discrepancy comes from the differences of two types of datasets. In the synthetic datasets, the topological structure of each layer is similar, that is, each node has similar neighborhood in different layers, which makes it easier to find more inter-layer edges. While in real-world datasets, the topological structure of different network layers could differ greatly, which leads to the loss of structural information when generating a supra graph. While our GCFM can decrease such inconsistency of nodes in different layer.

5.3 Ablation study

5.3.1 Node initial embedding

To investigate the impact of node initial embedding, we compare the DeepWalk with the other three methods, i.e. one-hot, node2vec (Grover and Leskovec 2016), and node centrality. For node2vec, we set its return parameter \(p = 1\) and in-out parameter \(q = 0.5\). For centrality-based method, we choose four common centrality metrics as the node initial embedding, including degree centrality, betweenness centrality, closeness centrality, and load centrality. Figure 7 presents the modularity results with different node initial embeddings. The two graph embedding algorithms, namely, DeepWalk and node2vec achieve better performance than the other two. This is not unexpected, as they can pre-train node initial embeddings with some topological information.

Fig. 7
figure 7

Modularity with different ways of generating node initial embedding

5.3.2 Fusion depth analysis

For a given four-scale GCA module, we conduct experiments to analyze the impact of fusion depths. The outputs of the four-scale GCA are denoted by \(\textbf{H}_{1}\), \(\textbf{H}_{2}\), \(\textbf{H}_{3}\), and \(\textbf{H}_{4}\). Each time we decrease one output scale and input the others into the MFN module. For example, GCFM-3 means that there are three output scales, i.e. \(\textbf{H}_{2}\), \(\textbf{H}_{3}\), and \(\textbf{H}_{4}\), which are input to the MFN module. Besides, GCFM-1 means that only \(\textbf{H}_{4}\) is used for self-training community detection. Table 4 presents the modularity results with fusion under different scales. We can see that the output of each GCA scale contributes to the final community division and GCFM-4 outperforms the others, which suggests the necessity of multiscale fusion.

Table 4 Modularity with different fusion depths

5.3.3 Model depth analysis

We also conduct experiments to investigate the effect of model depths, i.e., the GCA scales. Figure 8 plots the modularity results against different model depths. We can obverse that the best modularity results are usually obtained with a 4-scale model. When more than four scales are used, the modularity will not keep increasing but even start dropping. Although multiscale fusion is desirable, a larger model depth, i.e., more GCA scales might introduce the so-called over-smoothing problem: After more neighbors are included into the aggregation operation, the neighbor-aware structural encoding start becoming similar across topology-close nodes; While reducing its capability of discriminating local structures.

Fig. 8
figure 8

Modularity with different model depths

5.3.4 Parameter analysis

In order to investigate the effect of hyper-parameter, i.e. the loss coefficient \(\lambda \) and learning rate r, we test the modularity with the increase of \(\lambda \) from 0 to 100 under different learning rate, i.e. \(r \in \{0.0001, 0.001, 0.01, 0.1\}\). Figure 9 shows the modularity results with varying \(\lambda \) and r in C.elegans dataset. We can see that the best modularity is obtained under the setting of \(\lambda =10\) and \(r=0.001\). When \(\lambda =0\), our model can not identify high modular structures, which proves the effectiveness of community detection loss \(L_{c}\).

Fig. 9
figure 9

Modularity with different \(\lambda \) and r in C.elegans dataset

Fig. 10
figure 10

Network visualization for an instance of a 3-layer mLFR network with \(\mu = 0.35\). a The 3-layer mLFR network consists of 50 nodes and 84, 83, 84 edges in the three layers, respectively. The vertical lines illustrate a same node in the three layers. b The ground truth of community divisions, as well as the detected communities by the experimented algorithms. Nodes with the same color indicate that they belong to a same community

Fig. 11
figure 11

Embedding visualization of node representation \(\textbf{Z}\) in a mLFR network with \(\mu = 0.4\), consisting of 500 nodes and in total 16,445 edges. a 2D visualization of the nodes’ representations during the training process of GCFM algorithm. Epoch0 indicates that the representation is only trained with the graph convolutional auto-encoder, and the following epoches have included the self-training component in the representation training. b Visualization of feature similarity matrix \(\mathbf {ZZ^{T}}\). For node representation \(\textbf{Z}\) in epoch25, we normalize the matrix \(\mathbf {ZZ^{T}}\) and visualize it with a gray-scale image. The main diagonal blocks indicate the community distribution

5.4 Visualization

5.4.1 Network visualization

Figure 10 visualizes an instance of a 3-layer mLFR network with \(\mu = 0.35\) and the communities detected by the experimented algorithms. We can observe that the result of our GCFM is closest to the ground truth.

5.4.2 Embedding visualization

Figure 11 visualizes the node representations and feature similarity matrix learned by our GCFM for an instance of a 4-layer mLFR network with \(\mu =0.4\). We use the t-SNE (Van der Maaten and Hinton 2008) to visualize the node representation \(\textbf{Z}\) at some training epoches. We can see that the not only the node representations become more evident with the increase of epochs, but also the relations in between nodes, that is, clearer yet distant community division in the projected space. In the right, the main diagonal blocks of feature similarity matrix \(\mathbf {ZZ^{T}}\) show the community distribution corresponding to the result of the 25-th training epoch.

6 Discussion and conclusion

In this paper, we have proposed the GCFM for the task of community detection in multiplex networks. It contains a graph convolutional auto-encoder for encoding neighbor-aware intra-layer structural information, a multiscale fusion network for learning a holistic version of nodes’ representations by fusing nodes’ encodings at different layers and different scales, and a self-training mechanism to train our model and detect communities. Experiments on both synthetic and real-world datasets have validated the superiority of our GCFM over the state-of-art algorithms. In this work, an excellent feature of our model lies in its full consideration for different scales of local information.

We also obverse some limitations which need to be further investigated. Due to the lack of prior knowledge, we treat each layer equally while each network layer may have its own contribution to the final detected communities in the real world. Therefore, how to judge the weight of each layer in a multiplex network is important. One of the possible approaches is to introduce node attributes. Also the GCFM is mainly used in undirected multiplex networks. Some applications in the real world suggest directed edges. In our future work, we would like to investigate community detection in multiplex networks with node attributes and edge directions.