1 Introduction

Graphs are flexible data structures that model complex relationships among entities, i.e. data points as nodes and the relations between nodes via edges. One important task in graph analysis is community detection, where the objective is to cluster nodes into multiple groups (communities). Each community is a set of densely connected nodes. The communities can be overlapping or non-overlapping, depending on whether they share some nodes or not. Several algorithmic [1, 5] and probabilistic approaches [9, 20, 34, 38] to community detection have been proposed. Another fundamental task in graph analysis is learning the node embeddings. These embeddings can then be used for downstream tasks like graph visualization [8, 28, 33, 34] and classification [3, 27].

In the literature, these tasks are usually treated separately. Although the standard graph embedding methods capture the basic connectivity, the learning of the node embeddings is independent of community detection. For instance, a simple approach can be to get the node embeddings via DeepWalk [23] and get community assignments for each node by using k-means or Gaussian mixture model. Looking from the other perspective, methods like Bigclam [36], that focus on finding the community structure in the dataset, perform poorly for node-representation tasks e.g. node classification. This motivates us to study the approaches that jointly learn community-aware node embeddings.

Recently several approaches, like CNRL [30], ComE [4], vGraph [26] etc., have been proposed to learn the node embeddings and detect communities simultaneously in a unified framework. Several studies have shown that community detection is improved by incorporating the node representation in the learning process [3, 18]. The intuition is that the global structure of graphs learned during community detection can provide useful context for node embeddings and vice versa.

The joint learning methods (CNRL, ComE and vGraph) learn two embeddings for each node. One node embedding is used for the node representation task. The second node embedding is the “context” embedding of the node which aids in community detection. As CNRL and ComE are based on Skip-Gram [22] and DeepWalk [23], they inherit “context” embedding from it for learning the neighbourhood information of the node. vGraph also requires two node embeddings for parameterizing two different distributions. In contrast, we propose learning a single community-aware node representation which is directly used for both tasks.

In this paper, we propose an efficient generative model called J-ENC for jointly learning both community detection and node representation. The underlying intuition behind J-ENC is that every node can be a member of one or more communities. However, the node embeddings should be learned in such a way that connected nodes are “closer” to each other than unconnected nodes. Moreover, connected nodes should have similar community assignments. Formally, we assume that for i-th node, the node embeddings \({\boldsymbol{z}}_i\) are generated from a prior distribution \(p({\boldsymbol{z}})\). Given \({\boldsymbol{z}}_i\), the community assignments \(c_i\) are sampled from \(p(c_i | {\boldsymbol{z}}_i)\), which is parameterized by node and community embeddings. In order to generate an edge (ij), we sample another node embedding \({\boldsymbol{z}}_j\) from \(p({\boldsymbol{z}})\) and respective community assignment \(c_j\) from \(p(c_j | {\boldsymbol{z}}_j)\). Afterwards, the node embeddings and the respective community assignments of node pairs are fed to a decoder. The decoder ensures that embeddings of both the nodes and the communities of connected nodes share high similarity. This enables learning such node embeddings that are useful for both community detection and node representation tasks.

We validate the effectiveness of our approach on several real-world graph datasets. In Sect. 4, we show empirically that J-ENC is able to outperform the baseline methods including the direct competitors on all three tasks i.e. node classification, overlapping community detection and non-overlapping community detection. Furthermore, we compare the computational cost of training different algorithms. J-ENC is up to 40x more time-efficient than its competitors. We also conduct hyperparameter sensitivity analysis which demonstrates the robustness of our approach. Our main contributions are summarized below:

  • We propose an efficient generative model called J-ENC for joint community detection and node representation learning.

  • We adopt a novel approach and argue that a single node embedding is sufficient for learning both the representation of the node itself and its context.

  • Training J-ENC is extremely time-efficient in comparison to its competitors.

2 Related Work

2.1 Community Detection

Early community detection algorithms are inspired from clustering algorithms [35]. For instance, spectral clustering [29] is applied to the graph Laplacian matrix for extracting the communities. Similarly, several matrix factorization based methods have been proposed to tackle the community detection problem. For example, Bigclam [36] treats the problem as a non-negative matrix factorization (NMF) task. Another method CESNA [38] extends Bigclam by modelling the interaction between the network structure and the node attributes. Some generative models, like vGraph [26], Circles [20] etc., have also been proposed to detect communities in a graph.

2.2 Node Representation Learning

Many successful algorithms which learn node representation in an unsupervised way are based on random walk objectives [10, 11, 23]. Some known issues with random-walk based methods (e.g. DeepWalk, node2vec etc.) are: (1) They sacrifice the structural information of the graph by putting over-emphasis on the proximity information [24] and (2) great dependence of the performance on hyperparameters (walk-length, number of hops etc.) [10, 23]. Some interesting GCN based approaches include graph autoencoders e.g. GAE and VGAE [17] and DGI [32].

2.3 Joint Community Detection and Node Representation Learning

In the literature, several attempts have been made to tackle both these tasks in a single framework. Most of these methods propose an alternate optimization process, i.e. learn node embeddings and improve community assignments with them and vice versa [4, 30]. Some approaches (CNRL [30], ComE [4]) are inspired from random walk, thus they inherit the issues discussed above. Others, like GEMSEC [25], are limited to the detection of non-overlapping communities. Some generative models like CommunityGAN [13] and vGraph [26] also jointly learn community assignments and node embeddings. CNRL, ComE and vGraph require learning two embeddings for each node for simultaneously tackling the two tasks. Unlike them, J-ENC learns a single community-aware node representation which is directly used for both tasks.

It is pertinent to highlight that although both vGraph and J-ENC adopt a variational approach but the underlying models are quite different. vGraph assumes that each node can be represented as a mixture of multiple communities and is described by a multinomial distribution over communities, whereas J-ENC models the node embedding by a single distribution. For a given node, vGraph, first draws a community assignment and then a connected neighbor node is generated based on the assignment. Whereas, J-ENC draws the node embedding from prior distribution and then community assignment is conditioned on a single node only. In simple terms, vGraph also needs edge information in the generative process whereas J-ENC does not require it. J-ENC relies on the decoder to ensure that embeddings of the connected nodes and their communities share high similarity with each other.

3 Methodology

3.1 Problem Formulation

Suppose an undirected graph \({\mathcal {G}}= ({\mathcal {V}}, {\mathcal {E}})\) with the adjacency matrix \({\boldsymbol{A}}\in {\mathbb {R}}^{N\times N}\) and a matrix \({\boldsymbol{X}}\in {\mathbb {R}}^{N \times F}\) of F-dimensional node features, N being the number of nodes. Given K as the number of communities, we aim to jointly learn the node embeddings and the community embeddings following a variational approach such that:

  • One or more communities can be assigned to every node.

  • The node embeddings can be used for both community detection and node classification.

3.2 Variational Model

Generative Model: Let us denote the latent node embedding and community assignment for i-th node by the random variables \({\boldsymbol{z}}_{i} \in \mathbb {R}^d\) and \(c_{i}\) respectively. The generative model is given by:

$$\begin{aligned} p({\boldsymbol{A}}) = \int \sum \limits _{{\boldsymbol{c}}} p({\boldsymbol{Z}}, {\boldsymbol{c}}, {\boldsymbol{A}}) d{\boldsymbol{Z}}, \end{aligned}$$
(1)

where \({\boldsymbol{c}}= [c_{1}, c_{2}, \cdots , c_{N}]\) and the matrix \({\boldsymbol{Z}}= [{\boldsymbol{z}}_{1}, {\boldsymbol{z}}_{2}, \cdots , {\boldsymbol{z}}_{N}]\) stacks the node embeddings. The joint distribution in (1) is mathematically expressed as

$$\begin{aligned} p({\boldsymbol{Z}}, {\boldsymbol{c}}, {\boldsymbol{A}}) = p({\boldsymbol{Z}}) \ p_{\theta }({\boldsymbol{c}}| {\boldsymbol{Z}}) \ p_{\theta }({\boldsymbol{A}}| {\boldsymbol{c}}, {\boldsymbol{Z}}), \end{aligned}$$
(2)

where \(\theta \) denotes the model parameters. Let us denote elements of \({\boldsymbol{A}}\) by \(a_{ij}\). Following existing approaches [14, 17], we consider \({\boldsymbol{z}}_i\) to be i.i.d random variables. Furthermore, assuming \(c_i | {\boldsymbol{z}}_i\) to be i.i.d random variables, the joint distributions in (2) can be factorized as

$$\begin{aligned} p({\boldsymbol{Z}})&= \prod \limits _{i = 1}^{N} p({\boldsymbol{z}}_{i}) \end{aligned}$$
(3)
$$\begin{aligned} p_{\theta }({\boldsymbol{c}}| {\boldsymbol{Z}})&= \prod \limits _{i = 1}^{N} p_{\theta }(c_{i} | {\boldsymbol{z}}_{i}) \end{aligned}$$
(4)
$$\begin{aligned} p_{\theta }({\boldsymbol{A}}| {\boldsymbol{c}}, {\boldsymbol{Z}})&= \prod \limits _{i, j} p_{\theta }(a_{ij} | c_{i}, c_{j}, {\boldsymbol{z}}_{i}, {\boldsymbol{z}}_{j}), \end{aligned}$$
(5)

where Eq. (5) assumes that the edge decoder \(p_{\theta }(a_{ij} | c_{i}, c_{j}, {\boldsymbol{z}}_{i}, {\boldsymbol{z}}_{j})\) depends only on \( c_{i}, c_{j}, {\boldsymbol{z}}_{i}\) and \({\boldsymbol{z}}_{j}\).

Inference Model: We aim to learn the model parameters \(\theta \) such that \(\text {log} (p_{\theta }({\boldsymbol{A}}))\) is maximized. In order to ensure computational tractability, we introduce the approximate posterior

$$\begin{aligned} q_{\phi }({\boldsymbol{Z}}, {\boldsymbol{c}}| {\mathcal {I}})= & {} \prod \limits _i q_{\phi }({\boldsymbol{z}}_{i}, c_{i} | {\mathcal {I}}) \end{aligned}$$
(6)
$$\begin{aligned}= & {} \prod \limits _i q_{\phi } ({\boldsymbol{z}}_{i} | {\mathcal {I}}) q_{\phi } (c_{i} | {\boldsymbol{z}}_{i}, {\mathcal {I}}), \end{aligned}$$
(7)

where \({\mathcal {I}}= ({\boldsymbol{A}}, {\boldsymbol{X}})\) if node features are available, otherwise \({\mathcal {I}}= {\boldsymbol{A}}\). We maximize the corresponding ELBO bound (for derivation, refer to the supplementary material), given by

$$\begin{aligned} \mathcal {L}_{\mathrm {ELBO}} \approx&- \sum \limits _{i=1}^{N} D_{KL} \Big (q_{\phi }({\boldsymbol{z}}_{i} | {\mathcal {I}}) \ || \ p({\boldsymbol{z}}_{i})\Big ) \nonumber \\ -&\sum \limits _{i=1}^{N} \frac{1}{M} \sum \limits _{m = 1}^{M} D_{KL} \Big (q_{\phi }(c_{i} | {\boldsymbol{z}}_{i}^{(m)}, {\mathcal {I}}) \ ||\ p_{\theta }(c_{i} | {\boldsymbol{z}}_{i}^{(m)})\Big ) \nonumber \\ +&\sum \limits _{(i, j) \in {\mathcal {E}}} \mathbb {E}_{({\boldsymbol{z}}_{i}, {\boldsymbol{z}}_{j}, c_{i}, c_{j}) \sim q_{\phi }({\boldsymbol{z}}_{i}, {\boldsymbol{z}}_{j}, c_{i}, c_{j} | {\mathcal {I}})}\Big \{\text {log}\Big (p_{\theta }(a_{ij} | c_{i}, c_{j}, {\boldsymbol{z}}_{i}, {\boldsymbol{z}}_{j})\Big )\Big \}, \end{aligned}$$
(8)

where \(D_{KL}(.||.)\) represents the KL-divergence between two distributions. The distribution \(q_{\phi }({\boldsymbol{z}}_{i}, {\boldsymbol{z}}_{j}, c_{i}, c_{j} | {\mathcal {I}})\) in the third term of Eq. (8) is factorized into two conditionally independent distributions i.e.

$$\begin{aligned} q_{\phi }({\boldsymbol{z}}_{i}, {\boldsymbol{z}}_{j}, c_{i}, c_{j} | {\mathcal {I}}) = q_{\phi }({\boldsymbol{z}}_{i}, c_{i} | {\mathcal {I}}) q_{\phi }({\boldsymbol{z}}_{j}, c_{j} | {\mathcal {I}}). \end{aligned}$$
(9)

3.3 Design Choices

In Eq. (3), \(p({\boldsymbol{z}}_{i})\) is chosen to be the standard gaussian distribution for all i. The corresponding approximate posterior \(q_{\phi } ({\boldsymbol{z}}_{i} | {\mathcal {I}})\) in Eq. (7), used as node embeddings encoder, is given by

$$\begin{aligned} q_{\phi } ({\boldsymbol{z}}_{i} | {\mathcal {I}}) = {\mathcal {N}}\big (\boldsymbol{\mu }_{i}({\mathcal {I}}), \text {diag}({\boldsymbol{\sigma }^2}_{i}({\mathcal {I}}))\big ). \end{aligned}$$
(10)

The parameters of \(q_{\phi } ({\boldsymbol{z}}_{i} | {\mathcal {I}})\) can be learnt by any encoder network e.g. graph convolutional network [16], graph attention network [31], GraphSAGE [11] or even two matrices to learn \(\boldsymbol{\mu }_{i}({\mathcal {I}})\) and \(\text {diag}({\boldsymbol{\sigma }^2}_{i}({\mathcal {I}}))\). Samples are then generated using reparametrization trick [6].

For parameterizing \(p_{\theta }(c_{i} | {\boldsymbol{z}}_{i})\) in Eq. (4), we introduce community embeddings \(\{{\boldsymbol{g}}_1, \cdots , {\boldsymbol{g}}_K\}; \ {\boldsymbol{g}}_k \in \mathbb {R}^d\). The distribution \(p_{\theta }(c_{i} | {\boldsymbol{z}}_{i})\) is then modelled as the softmax of dot products of \({\boldsymbol{z}}_i\) with \({\boldsymbol{g}}_k\), i.e.

$$\begin{aligned} p_{\theta }(c_{i}&= k| {\boldsymbol{z}}_{i}) = \frac{\text {exp}(<{\boldsymbol{z}}_{i}, {\boldsymbol{g}}_{k}>)}{\sum \limits _{\ell = 1}^{K}\text {exp}(<{\boldsymbol{z}}_{i}, {\boldsymbol{g}}_{\ell }>)}. \end{aligned}$$
(11)

The corresponding approximate posterior \(q_{\phi } (c_{i} = k | {\boldsymbol{z}}_{i}, {\mathcal {I}})\) in Eq. (7) is affected by the node embedding \({\boldsymbol{z}}_{i}\) as well as the neighborhood. To design this, our intuition is to consider the similarity of \({\boldsymbol{g}}_{k}\) with the embedding \({\boldsymbol{z}}_{i}\) as well as with the embeddings of the neighbors of the i-th node. The overall similarity with neighbors is mathematically formulated as the average of the dot products of their embeddings. Afterwards, a hyperparameter \(\alpha \) is introduced to control the bias between the effect of \({\boldsymbol{z}}_{i}\) and the set \({\mathcal {N}}_i\) of the neighbors of the i-th node. Finally, a softmax is applied, i.e.

$$\begin{aligned} q_{\phi }(c_{i} = k| {\boldsymbol{z}}_{i}, {\mathcal {I}})&= \mathrm {softmax}\Big (\alpha<{\boldsymbol{z}}_{i}, {\boldsymbol{g}}_{k}> \nonumber \\&+\,(1 - \alpha ) \frac{1}{|{\mathcal {N}}_i|} \sum \limits _{j \in {\mathcal {N}}_i}<{\boldsymbol{z}}_{j}, {\boldsymbol{g}}_{k}>\Big ). \end{aligned}$$
(12)

Hence, Eq. (12) ensures that graph structure information is employed to learn community assignments instead of relying on an extraneous node embedding as done in [4, 26]. Finally, the choice of edge decoder in Eq. (5) is motivated by the intuition that the nodes connected by edges have a high probability of belonging to the same community and vice versa. Therefore we model the edge decoder as:

$$\begin{aligned} p_{\theta }(a_{ij} | c_{i}&= \ell , c_{j} = m, {\boldsymbol{z}}_{i}, {\boldsymbol{z}}_{j}) = \frac{\sigma (<{\boldsymbol{z}}_{i}, {\boldsymbol{g}}_{m}>) + \sigma (<{\boldsymbol{z}}_{j}, {\boldsymbol{g}}_{\ell }>)}{2}. \end{aligned}$$
(13)

For better reconstructing the edges, Eq. (13) makes use of the community embeddings, node embeddings and community assignment information simultaneously. This helps in learning better node representations by leveraging the global information about the graph structure via community detection. On the other hand, this also forces the community assignment information to exploit the local graph structure via node embeddings and edge information.

3.4 Practical Aspects

The third term in Eq. (8) is estimated in practice using the samples generated by the approximate posterior. This term is equivalent to the negative of binary cross-entropy (BCE) loss between observed edges and reconstructed edges. Since community assignment follows a categorical distribution, we use Gumbel-softmax [12] for backpropagation of the gradients. As for the second term of Eq. (8), it is also enough to set \(M = 1\), i.e. use only one sample per input node.

For inference, non-overlapping community assignment can be obtained for i-th node as

$$\begin{aligned} {\mathcal {C}}_i = \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{k \in \{1, \cdots , K\}}\ q_{\phi }(c_{i} = k | {\boldsymbol{z}}_{i}, {\mathcal {I}}). \end{aligned}$$
(14)

To get overlapping community assignments for i-th node, we can threshold its weighted probability vector at \(\epsilon \), a hyperparameter, as follows

$$\begin{aligned} {\mathcal {C}}_i = \Big \{k \ \bigg |\ \frac{q_{\phi }(c_{i} = k | {\boldsymbol{z}}_{i}, {\mathcal {I}})}{\max \limits _{\ell } q_{\phi }(c_{i} = \ell | {\boldsymbol{z}}_{i}, {\mathcal {I}})} \ge \epsilon \Big \}, \quad \epsilon \in [0, 1]. \end{aligned}$$
(15)

3.5 Complexity

Computation of dot products for all combinations of node and community embeddings takes \({\mathcal {O}}(NKd)\) time. Solving Eq. (12) further requires calculation of mean of dot products over the neighborhood for every node, which takes \({\mathcal {O}}(|{\mathcal {E}}|K)\) computations overall as we traverse every edge for every community. Finally, we need softmax over all communities for every node in Eq. (11) and Eq. (12) which takes \({\mathcal {O}}(NK)\) time. Equation (13) takes \({\mathcal {O}}(|{\mathcal {E}}|)\) time for all edges as we have already calculated the dot products. As a result, the overall complexity becomes \({\mathcal {O}}(|{\mathcal {E}}|K + NKd)\). This complexity is quite low compared to other algorithms designed to achieve similar goals [4, 39].

4 Experiments

4.1 Synthetic Example

We start with a synthetic dataset, consisting of 3 communities with 5 points per community. This dataset is actually a random partition graph generated by the python package networkx. The encoder simply consists of two matrices that give \(\boldsymbol{\mu }_{i}({\mathcal {I}})\) and \(\text {diag}({\boldsymbol{\sigma }^2}_{i}({\mathcal {I}}))\). The results of the community assignments discovered by J-ENC are given in Fig. 1, where the node sizes are reciprocal to the confidence of J-ENC in the community assignments. We choose 3 communities for demonstration because the probabilistic community assignments in such case can be thought of as rgb values for coloring the nodes. It can be seen that J-ENC discovers the correct community structure. However, the two bigger nodes in the center can be assigned to more than one communities as J-ENC is not very confident in case of these nodes. This is evident from the colors that are a mix of red, green and blue. We now proceed to the experiments on real-world datasets.

4.2 Datasets

We have selected 18 different datasets ranging from 270 to 126,842 edges. For non-overlapping community detection and node classification, we use 5 the citation datasets [2, 40]. The remaining datasets [20, 37], used for overlapping community detection, are taken from SNAP repository [19]. Following [26], we take 5 biggest ground truth communities for youtube, amazon and dblp. Moreover, we also analyse the case of large number of communities. For this purpose, we prepare two subsets of amazon dataset by randomly selecting 500 and 1000 communities from 2000 smallest communities in the amazon dataset (Table 1).

Fig. 1.
figure 1

Visualization of community assignments discovered by J-ENC in the synthetic dataset of 15 points divided in three communities.

Table 1. Every dataset has \(|{\mathcal {V}}|\) nodes, \(|{\mathcal {E}}|\) edges, K communities and |F| features. \(|F| = \text {N/A}\) means that either the features were missing or not used.

4.3 Baselines

For overlapping community detection, we compare with the following competitive baselines: MNMF [34] learns community membership distribution by using joint non-negative matrix factorization with modularity based regularization. BIGCLAM [36] also formulates community detection as a non-negative matrix factorization (NMF) task. It simultaneously optimizes the model likelihood of observed links and learns the latent factors which represent community affiliations of nodes. CESNA [38] extends BIGCLAM by statistically modelling the interaction between the network structure and the node attributes. Circles [20] introduces a generative model for community detection in ego-networks by learning node similarity metrics for every community. SVI [9] formulates membership of nodes in multiple communities by a Bayesian model of networks. vGraph [26] simultaneously learns node embeddings and community assignments by modelling the nodes as being generated from a mixture of communities. vGraph+, a variant further incorporates regularization to weigh local connectivity. ComE [4] jointly learns community and node embeddings by using gaussian mixture model formulation. CNRL [30] enhances the random walk sequences (generated by DeepWalk, node2vec etc.) to jointly learn community and node embeddings. CommunityGAN (ComGAN)is a generative adversarial model for learning node embeddings such that the entries of the embedding vector of each node refer to the membership strength of the node to different communities. Lastly, we compare the results with the communities obtained by applying k-means to the learned embeddings of DGI [32].

For non-overlapping community detection and node classification, in addition to MNMF, DGI, CNRL, CommunityGAN, vGraph and ComE, we compare J-ENC with the following baselines: DeepWalk [23] makes use of SkipGram [22] and truncated random walks on network to learn node embeddings. LINE [27] learns node embeddings while attempting to preserve first and second order proximities of nodes. Node2Vec [10] learns the embeddings using biased random walk while aiming to preserve network neighborhoods of nodes. Graph Autoencoder (GAE) [17] extends the idea of autoencoders to graph datasets. We also include its variational counterpart i.e. VGAE. GEMSEC is a sequence sampling-based learning model which aims to jointly learn the node embeddings and clustering assignments.

4.4 Settings

For overlapping community detection, we learn mean and log-variance matrices of 16-dimensional node embeddings. We set \(\alpha = 0.9\) and \(\epsilon = 0.3\) in all our experiments. Following [17], we first pre-train a variational graph autoencoder. We perform gradient descent with Adam optimizer [15] and learning rate \(= 0.01\). Community assignments are obtained using Eq. (15). For the baselines, we employ the results reported by [26]. For evaluating the performance, we use F1-score and Jaccard similarity.

For non-overlapping community detection, since the default implementations of most the baselines use 128 dimensional embeddings, for we use \(d = 128\) for fair comparison. Equation (14) is used for community assignments. For vGraph, we use the code provided by the authors. We employ normalized mutual information (NMI) and adjusted random index (ARI) as evaluation metrics.

For node classification, we follow the training split used in various previous works [16, 32, 40], i.e. 20 nodes per class for training. We train logistic regression using LIBLINEAR [7] solver as our classifier and report the evaluation results on rest of the nodes. For the algorithms that do not use node features, we train the classifier by appending the raw node features with the learnt embeddings. For evaluation, we use F1-macro and F1-micro scores.

All the reported results are the average over five runs. Further implementation details can be found in: https://github.com/RayyanRiaz/gnn_comm_det.

Table 2. F1 scores \((\%)\) for overlapping communities. Best and second best values are bold and underlined respectively.
Table 3. Jaccard scores \((\%)\) for overlapping communities. Best and second best values are bold and underlined respectively.
Table 4. Non-overlapping community detection results. Best and second best values are bold and underlined respectively.
Table 5. Node classification results. Best and second best values are bold and underlined respectively.

4.5 Discussion of Results

Tables 2 and 3 summarize the results of the performance comparison for the overlapping community detection tasks.

First, we note that our proposed method J-ENC outperforms the competitors on all datasets in terms of Jaccard score as well as F1-score, with the dataset (fb0) being the only exception where J-ENC is the second best. These results demonstrate the capability of J-ENC to learn multiple community assignments quite well and hence reinforces our intuition behind the design of Eq. (12).

Second, we observe that there is no consistent performing algorithm among the competitive methods. That is, excluding J-ENC, the best performance is achieved by vGraph/vGraph+ on 5, ComGAN on 4 and ComE on 3 out of 13 datasets in terms of F1-score. A a similar trend can be seen in Jaccard Similarity. It is worth noting that all the methods, which achieve the second-best performance, are solving the task of community detection and node representation learning jointly.

Third, we observe that vGraph+ results are generally better than vGraph. This is because vGraph+ incorporates a regularization term in the loss function which is based on Jaccard coefficients of connected nodes as edge weights. However, it should be noted that this prepossessing step is computationally expensive for densely connected graphs.

Table 4 shows the results on non-overlapping community detection. First, we observe that MNMF, DeepWalk, LINE and Node2Vec provide a good baseline for the task. However, these methods are not able to achieve comparable performance on any dataset relative to the frameworks that treat the two tasks jointly. Second, J-ENC consistently outperforms all the competitors in NMI and ARI metrics, except for CiteSeer where it achieves second best ARI. Third, we observe that GCN based models i.e. GAE, VGAE and DGI show competitive performance. That is, they achieve second best performance in all the datasets except CiteSeer. In particular, DGI achieves second best NMI results in 3 out of 5 datasets and 2 out of 5 datasets in terms of ARI. Nonetheless, DGI results are not very competitive in Table 2 and Table 3, showing that while DGI can be a good choice for learning node embeddings for attributed graphs with non-overlapping communities, it is not the best option for non-attributed graphs or overlapping communities.

The results for node classification are presented in Table 5. J-ENC achieves best F1-micro and F1-macro scores on 4 out of 5 datasets. We also observe that GCN based models i.e. GAE, VGAE and DGI show competitive performance, following the trend in results of Table 4. Furthermore, we note that the node classification results of CommunityGan (ComGAN) are quite poor. We think a potential reason behind it is that the node embeddings are constrained to have same dimensions as the number of communities. Hence, different components of the learned node embeddings simply represent the membership strengths of nodes for different communities. The linear classifiers may find it difficult to separate such vectors.

4.6 Hyperparameter Sensitivity

We study the dependence of J-ENC on \(\epsilon \) and \(\alpha \) by evaluating on four datasets of different sizes: fb698(\(N = 61\)), fb1912(\(N = 747\)), amazon1000(N=1540) and youtube(\(N = 5346\)).

Effect of \(\epsilon \): We sweep for \(\epsilon = \{0.1, 0.2, \cdots , 0.9\}\). For demonstrating effect of \(\alpha \), we fix \(\epsilon = 0.3\) and sweep for \(\alpha = \{0.1, 0.2, \cdots , 0.9\}\). The average results of five runs for \(\epsilon \) and \(\alpha \) are given in Fig. 2a and Fig. 2b respectively. Overall J-ENC is quite robust to the change in the values of \(\epsilon \) and \(\alpha \). In case of \(\epsilon \), we see a general trend of decrease in performance when the threshold \(\epsilon \) is set quite high e.g. \(\epsilon > 0.7\). This is because the datasets contain overlapping communities and a very high \(\epsilon \) will cause the algorithm to give only the most probable community assignment instead of potentially providing multiple communities per node. However, for a large part of sweep space, the results are almost consistent.

Effect of \(\alpha \): When \(\epsilon \) is fixed and \(\alpha \) is changed, the results are mostly consistent except when \(\alpha \) is set to a low value. Equation (12) shows that in such a case the node itself is almost neglected and J-ENC tends to assign communities based upon neighborhood only, which may cause a decrease in the performance. This effect is most visible in amazon1000 dataset because it has only 1.54 points on average per community. This implies a decent chance for neighbours of a point of being in different communities. Thus, sole dependence on the neighbors will most likely result in poor results.

Fig. 2.
figure 2

Effect of hyperparameters on the performance. F1 and Jaccard scores are in solid and dashed lines respectively.

Fig. 3.
figure 3

Comparison of running times of different algorithms. We can see that J-ENC outperforms the direct competitors. The time on y-axis is in log scale.

4.7 Training Time

Now we compare the training times of different algorithms in Fig. 3. As some of the baselines are more resource intensive than others, we select aws instance type g4dn.4xlarge for fair comparison of training times. For vGraph, we train for 1000 iterations and for J-ENC for 1500 iterations. For all other algorithms we use the default parameters as used in Sect. 4.4. We observe that the methods that simply output the node embeddings take relatively less time compared to the algorithms that jointly learn node representations and community assignments e.g. J-ENC, vGraph and CNRL. Among these algorithms J-ENC is the most time efficient. It consistently trains in less time compared to its direct competitors. For instance, it is about 12 times faster than ComE for CiteSeer-full and about 40 times faster compared to vGraph for Cora-full dataset. This provides evidence for lower computational complexity of J-ENC in Sect. 3.5.

4.8 Visualization

Our experiments demonstrate that a single community-aware node embedding is sufficient to aid in both the node representation and community assignment tasks. This is also qualitatively demonstrated by graph visualizations of node embeddings (obtained via t-SNE [21]) and inferred communities for two datasets, fb107 and fb3437, presented in Fig. 4.

Fig. 4.
figure 4

Graph visualization with community assignments (better viewed in color)

5 Conclusion

We propose a scalable generative method J-ENC to simultaneously perform community detection and node representation learning. Our novel approach learns a single community-aware node embedding for both the representation of the node and its context. J-ENC is scalable due to its low complexity, i.e. \({\mathcal {O}}(|{\mathcal {E}}|K + NKd)\). The experiments on several graph datasets show that J-ENC consistently outperforms all the competitive baselines on node classification, overlapping community detection and non-overlapping community detection tasks. Moreover, training the J-ENC is highly time-efficient than its competitors.