Keywords

1 Introduction

Network embedding (NE) aims to learn the latent low-dimensional representations of nodes in a network while preserving the intrinsic essence of the network [8, 15, 19], which can provide precise service and higher efficiency in practical applications, such as targeted detection and personalized recommendation [22, 28]. Therefore, NE has aroused many researchers’ interests under the drive of great requirements in recent years.

In a network consisting of nodes and edges, nodes represent objects and edges describe the interactive relationships amongst nodes. For example, in a cite network, nodes represent papers and edges describe the cite relationship amongst papers. In general, the interactive relationships amongst nodes are referred to as network topology, which plays a vital role in network analysis tasks. Network topology, typically in the form of node adjacency matrix, is the most common form of network representation. An important goal of NE is to preserve the neighborhood relationship of nodes in the network topology. To this end, various NE methods, such as DeepWalk [15] used the random walks based on the sampling strategies to convert a general graph structure into a large collection of linear sequences, and then utilized the skip-gram model [13] to learn low-dimensional representations for nodes from such linear sequences. This is one effective way to express graph structural information, because the sampled node sequences characterize the connections amongst nodes in a graph. However, the procedure involves a slow sampling process, and the hyper-parameters (such as walk length and total walks) are not easy to determine, especially for the large graphs. Because the sampled sequences have finite lengths, furthermore, it is difficult to capture the correct contextual information for nodes that appear at the boundaries of the sampled sequences, such that some relationships amongst nodes cannot be captured accurately and completely. To make up for the shortcomings of random walk, DNGR [4] adopts a random surfing model to capture graph structural information directly, instead of using the sampling-based method for generating linear sequences. The random surfing model first randomly orders the nodes in a graph, and then directly yield a probabilistic co-occurrence (PCO) matrix that capturing the transition probabilities amongst different nodes. Based on the PCO matrix, the positive point-wise mutual information (PPMI) can be computed, which avoids the expensive sampling process. As an explicit representation of a graph, the PPMI can effectively maintain the structural characteristics of a graph and contain the high-order similarity information of the nodes [4], so the PPMI representations of nodes can more accurately capture potentially complex, non-linear relations amongst different nodes. But DNGR used network topology alone while did not take the attribute features affiliated to nodes into consideration.

The attribute features affiliated to nodes, such as authors, research themes and keywords associated with papers in a citation network, describe the individual profile of nodes in a micro-perspective. This information often carries orthogonal and complementary knowledge beyond node connectivity and network topology, so incorporating semantic information is expected to significantly enhance NE based on network topology alone. A network whose nodes are associated with attribute features referred to as an attributed network [2]. The embedding of an attributed network (ANE) aims to learn the latent low-dimensional representations of nodes while preserving the neighborhood relationship of nodes in the network topology as well as the semantics of attribute features. This is not a trivial task, because network topology and attribute features are two heterogeneous information, although they describe the same network from two different perspectives [5, 9]. How to integrate two heterogeneous information and preserve the intrinsic essence contained in network topology and attribute features simultaneously is a key issue in ANE. Some existing approaches, such as TADW [24], ASNE [12], CANE [20], first converted network topology into the feature representations, then which were used to embed into a low-dimensional space. Meanwhile, attribute features were also used to derive low-dimensional embedding on node semantics. The two low-dimensional representations of all these NEs are concatenated to joint learn the final embedding. Due to converting a network topology into a feature representation may lose or may not faithfully represent non-linear relationship amongst the nodes [11], and the individual feature vector only contains individual information without inter-individual association relationships, combining topological feature vector and attribute feature vector together may unsatisfactory to explore and exploit the complementary relationship between these two types of information. Since, UWMNE [11] maintained network topology in the graph form and built attribute graph to represent semantic information, and then used deep neural networks to integrate the topological and semantic information in these graphs to learn a unified embedding representation.

Inspired by the DNGR [4] and the UWMNE [11], we propose a deep model based on the PPMI for ANE in this paper. The model is referred to as DANEP. Specifically, we first transform attribute features into an attribute graph, which is homogeneous with topology graph, so we can deal with them in the same way. Next, we carry out random surfing on the attribute/topology graph respectively to generate a attribute/topology probabilistic co-occurrence (PCO) matrix, and then calculate the attribute/topology PPMI based on the attribute/topology PCO matrix. After that, using a shared Auto-Encoder to learn low-dimensional node representations. The advantages of DANEP lie in: the attribute graph describes the geometry of potential non-linear manifolds under attribute features information more clearly, the uniformed graph representation of attribute features and network topology contributes to integrating the complementary relationship between two types of information; the random surfing captures graph structural information concerning attribute/topology, the PPMIs calculated from the attribute/topology PCO matrixes effectively maintain both the structural characteristics and the high-order proximity information of attribute/topology graph; and the shared Auto-Encoder learns high-level abstractions from low-level features as well as captures highly non-linear information conveyed by the graph via non-linear projections. Besides, the local pairwise constraint is further designed in shared Auto-Encoder to improve the quality of node representations. We also conduct extensive experiments on four real-world networks and compare our approach with 10 baselines. The experimental results demonstrate the superiority of our approach.

It is needed to note that our DANEP model is different from the DNGR [4] and the UWMNE [11]. In our DANEP model, we apply the deep learning method on PPMIs of attribute features and network topology, but the DNGR just apply matrix factorization on PPMI of network topology, and the UWMNE directly use network topology and attribute graph as the input of an Auto-Encoder.

The rest of the paper is arranged as follows. Section 2 offers a brief overview of related work. The details of DANEP are presented in Sect. 3. Section 4 provides extensive experiments and results, and in Sect. 5, conclusions are given.

2 Related Work

2.1 Network Embedding

Many network embedding approaches only utilized network topology to learn the latent low-dimensional representations. DeepWalk [15] first employed the truncated random walks to capture the local information and then learn the latent embedding result by making use of the local information. Node2vec [8] proposed a biased random walk method to explore various neighborhoods. Line [19] considered to preserve the first-order and second-order proximity of network topology into the learned embedding representation. SDNE [21] proposed a semi-supervised model to jointly preserve the first-order and second-order similarity of network topology. Struc2vec [17] utilized a weighted random walk to obtain a similar node sequence and conceived a hierarchical structure strategy to capture node proximity at different scales. GraRep [3] integrated global structural information learned from different models into the embedding representation. DNGR [4] first adopted a random surfing model to capture graph structural information, and then used a stacked denoising Auto-Encoder to learn low-dimensional vertex representations.

2.2 Attributed Network Embedding

In recent years, many researchers learned representations of nodes by integrating network topology and attribute features of nodes. This brings new opportunities and development for embedding learning. In detail, AANE [10] considered the proximity of the attribute features into embedding learning and adopted a distributed manner to accelerate the learning process. TADW [24] proposed the text-associated DeepWalk model to integrate node’s text features into embedding learning by matrix factorization. ASNE [12] adopted a deep neural network to model the complex interrelations between attribute features and network topology. DANE [6] employed two symmetrical Auto-Encoders to capture the consistency and complementary information between attribute features and network topology, where the two symmetrical Auto-Encoders are allowed to interact with each other. ANRL [27] utilized a neighbor enhancement Auto-Encoder with attribute-aware skip-gram to extract the correlations of attribute features and the network topology. NANE [14] considered the local and global information in the embedding process by a pairwise constraint. Based on the observations that nodes with similar topology may be dissimilar in their attribute features and vice versa,which are referred to as the partial correlation, PRRE [29] taken the partial correlation of nodes into account in the learning process.

3 The Proposed Model

In this section, we first present the definition of ANE and then develop a deep attributed network embedding model based on the positive pointwise mutual information.

3.1 Problem Definition

Given an attributed network with n nodes and m edges \(G=(V,E,\mathbf{A} )\), wherein \(V=\{{{v}_{1}},\cdots ,{{v}_{n}}\}\) and \( E=\{{e}_{ij}\}_{i,j=1}^{n} \) represent the sets in items of nodes and edges, respectively, and \(\mathbf{A} \in {{R}^{n\times m}}\) represent the attribute matrix affiliated to the nodes, whose row vector \({\mathbf{a }_{i}}\in {{R}^{m}}\) corresponds to the attribute features of the node \({{v}_{i}}\). Let \(\mathbf{S} \in {{R}^{n\times n}}\) be the adjacent matrix affiliated to the edges, whose the element \({{s}_{ij}}\) corresponds to the relationship of the edge between nodes \({{v}_{i}}\) and \({{v}_{j}}\), i.e., \({{s}_{ij}}=1\) indicates there exists an edge linked \({{v}_{i}}\) to \({{v}_{j}}\), and \({{s}_{ij}}=0\) indicates the edge is nonexistent. The goal of the ANE is to find a map function \(f(\mathbf{A} ,\mathbf{S} )\rightarrow \mathbf{H} \) that map attribute features \(\mathbf{A} \) and network topology \(\mathbf{S} \) into a unified low-dimensional representation \(\mathbf{H} \in {{R}^{n\times d}}(d\ll n,d\ll m)\) while preserving the proximities existing in both the attribute of the nodes and the topology of the network. More precisely, nodes with similar attribute and topology in the original network should be closer in the embedding space.

3.2 The Architecture of Proposed Model

The architecture of DANEP is shown in Fig. 1. DANEP first constructs an attribute graph based on the attribute features A, such that the attribute graph and the topology graph are homogeneous. Based on the homogeneous representations of the attribute graph and topology graph, the random surfing is first conducted to obtain the attribute/topology probabilistic co-occurrence (PCO) matrix, and then the PPMIs concerning the attribute graph and topology graph are calculated, represented as \(\mathbf{PPMI}\_AF \) and \(\mathbf{PPMI}\_NT \), respectively. The row vectors of \(\mathbf{PPMI}\_AF \) and \(\mathbf{PPMI}\_NT \) depict the profile and the neighborhood relationships of node \({{v}_{i}}\) with respect to attribute features and network topology. After that, a shared Auto-Encoder equipped with the local enhancement of graph regulation is applied to learn the unified low-dimensional representation for each node from the PPMIs concerning the attribute graph and topology graph.

Fig. 1.
figure 1

The architecture of DANEP.

The Construction of the Attribute Graph. In this subsection, we construct an attribute graph based on the attribute features \(\mathbf{A} \). Let \(\mathbf{B} \in {{R}^{n\times n}}\) be the attribute similarity matrix, whose elements \({{b}_{ij}}\in \mathbf{B} \) can be measured by the similarity of attribute vectors \({\mathbf{a }_{i}}\in \mathbf{A} \) and \({\mathbf{a }_{j}}\in \mathbf{A} \), such as the cosine similarity can be calculated by Eq. (1), where “\(\cdot \)” signifies the dot product of the two vectors, “\(||\cdot ||\)” denotes L2 norm, and “\(\times \)” indicates the product of two scalars.

$$\begin{aligned} {{b}_{ij}}=\frac{{\mathbf{a }_{i}}\cdot {\mathbf{a }_{j}}}{\left\| {\mathbf{a }_{i}} \right\| \times \left\| {\mathbf{a }_{j}} \right\| } \end{aligned}$$
(1)

Intuitively, the distance between two nodes is closer, the more intimate relationship they should have. Therefore, we apply the k-nearest neighbor method [11, 18] on the \(\mathbf{B} \) to construct the attribute graph with n nodes, where each node \({{v}_{i}}\) is connected to k nodes with top-k similarities in \({{b}_{i}}\). Let \({\mathbf{B }^{new}}\in {{R}^{n\times n}}\) be the adjacent matrix of the constructed attribute graph, then the element \(b_{ij}^{new}=1\) indicates there exists an edge linked \({{v}_{i}}\) to \({{v}_{j}}\), and \(b_{ij}^{new}=0\) indicates the edge is nonexistent.

The Calculation of PPMIs. Motivated by DNGR [4], we adopt the random surfing model on the topology graph \(\mathbf {S}\)/attribute graph \({{\mathbf {B}}^{new}}\) to obtain the attribute/topology probabilistic co-occurrence (PCO) matrix through k-step iterative. The iterative process can be represented by Eq. (2), where \({{\mathbf {p}}_{0}}\) is the initial one-hot vector with i-th value is 1 and the other values are 0, coefficient \(\alpha \) and \(1-\alpha \) represent the probabilities with respect to the node jumps to the next node and returns to original vertex (restart), respectively.

$$\begin{aligned} {{\mathbf {p}}_{k}}=\alpha \cdot {{\mathbf {p}}_{k-1}}+(1-\alpha ){{\mathbf {p}}_{0}} \end{aligned}$$
(2)

Based on the attribute/topology PCO matrix, the pointwise mutual information (PMI) can be calculated by Eq. (3), where \(p({{v}_{i}},{{v}_{j}})\) represents the number of co-occurrences that nodes \({{v}_{i}}\) and \({{v}_{j}}\) are in the same context, \(\left| D \right| =\sum \nolimits _{{{v}_{i}}}{\sum \nolimits _{{{v}_{j}}}^{{}}{p({{v}_{i}},{{v}_{j}})}}\), \(p({{v}_{i}})\) and \(p({{v}_{j}})\) represents the number of occurrences of nodes \({{v}_{i}}\) and \({{v}_{j}}\), respectively.

$$\begin{aligned} \mathbf {PMI}_{{{v}_{i},{v}_{j}}}=\log (\frac{p({{v}_{i}},{{v}_{j}})\cdot \left| D \right| }{p({{v}_{i}})\cdot p({{v}_{j}})}) \end{aligned}$$
(3)

Then, PPMI can be calculated by Eq. (4) [23], which means that negative values in attribute/topology PMI are assigned to zeros.

$$\begin{aligned} \mathbf {PPM}{{\mathbf {I}}_{{{v}_{i}},{{v}_{j}}}}=\max (\mathbf {PMI}_{{{v}_{i},{v}_{j}}},0) \end{aligned}$$
(4)

The Design of the Shared Auto-Encoder. In general, an Auto-Encoder consists of an encoder and a decoder which can extract inherent essence and non-linear information of a network. In DANEP, we designed a shared Auto-Encoder with \(2K-1\) layers to incorporate attribute features and network topology. The input of the Auto-Encoder is the concatenation of the row vectors of \(\mathbf {PPMI\_AF}\) and \(\mathbf {PPMI\_NT}\), i.e. \({\mathbf {c}_{i}}=({\mathbf {f}_{i}},{\mathbf {t}_{i}})=({{f}_{i1}},\cdots ,{{f}_{in}},{{t}_{i1}},\cdots ,{{t}_{in}})\), where \(\mathbf {C}=[\mathbf {F},\mathbf {T}]\in {{R}^{n\times 2n}}\), \({\mathbf {c}_{i}}\), \({\mathbf {f}_{i}}\) and \({\mathbf {t}_{i}}\) are the i-th row vector of \(\mathbf {C}\), \(\mathbf {F}\), and \(\mathbf {T}\), respectively. Let \({\mathbf {y}_{i,k}}(k=1,\cdots ,K)\) and \({\mathbf {\hat{y}}_{i,k}}(k=1,\cdots ,K)\) be the desired embedding representation and the reconstructed representation of the Auto-Encoder, then \({\mathbf {y}_{i,k}}(k=1,\cdots ,K)\) and \({\mathbf {\hat{y}}_{i,k}}(k=1,\cdots ,K)\) can be computed by Eq. (5)–(9).

$$\begin{aligned} {\mathbf {y}_{i,1}}=f({\mathbf {W}_{1}}{\mathbf {c}_{i}}+{\mathbf {b}_{1}}) \end{aligned}$$
(5)
$$\begin{aligned} {\mathbf {y}_{i,k}}\,{=}\,f({\mathbf {W}_{k}}{\mathbf {y}_{i,k-1}}+{\mathbf {b}_{k}})(k=2,\cdots ,K-1) \end{aligned}$$
(6)
$$\begin{aligned} {\mathbf {y}_{i}}\,{=}\,{\mathbf {\hat{y}}_{i,1}}={\mathbf {y}_{i,K}}=f({\mathbf {W}_{K}}{\mathbf {h}_{i,K-1}}+{\mathbf {b}_{K}}) \end{aligned}$$
(7)
$$\begin{aligned} {\mathbf {\hat{y}}_{i,k}}\,{=}\,f(\mathbf {W}_{K+k{-}1}^{{}}\mathbf {\hat{y}}_{i,k{-}1}^{{}}+\mathbf {b}_{K+k{-}1}^{{}})(k=2,\cdots ,K-1) \end{aligned}$$
(8)
$$\begin{aligned} {\mathbf {\hat{y}}_{i,K}}=f(\mathbf {W}_{2K-1}^{{}}\mathbf {\hat{y}}_{i,K-1}^{{}}+\mathbf {b}_{2K-1}^{{}}) \end{aligned}$$
(9)

Where \(f(\cdot )\) represents the non-linear activation function, and \(\theta =\{\mathbf {W}_{k},\mathbf {b}_{k}\}(k=1,\cdots ,2K-1)\) are weight and bias parameters of the shared Auto-Encoder.

Let \(\mathbf {\hat{C}}\) be the output of the decoder, where \({\mathbf {\hat{c}}_{i}}\,=\,{\mathbf {\hat{y}}_{i,K}}=f(\mathbf {W}_{2K-1}^{{}}\mathbf {\hat{y}}_{i,K-1}^{{}}+\mathbf {b}_{2K-1}^{{}})\). The goal of Auto-Encoder is to minimize the reconstruction loss between the \(\mathbf {C}\) and \(\mathbf {\hat{C}}\), so the loss function is defined as:

$$\begin{aligned} {\mathcal {L}_{rec}}=\sum \limits _{i=0}^{n}{||{{\mathbf {\hat{c}}}_{i}}}-{\mathbf {c}_{i}}||_{2}^{2} \end{aligned}$$
(10)

To further improve the quality of node representation of the shared Auto-Encoder, we designed the local pairwise constraint, which is used to reinforce the consistency and complementary information contained in attribute features and network topology. Given the adjacent matrix \(\mathbf {S}/{\mathbf {B}^{new}}\) of attribute/topology graph, the local pairwise constraint is defined as:

$$\begin{aligned} \begin{aligned} {\mathcal {L}_{local}}=\frac{1}{2}\sum \nolimits _{i=1}^{n}{\sum \nolimits _{j=1}^{n}{{{s}_{ij}}||{{\mathbf {y}}_{i}}-{{\mathbf {y}}_{j}}||_{2}^{2}}}+\frac{1}{2}\sum \nolimits _{i=1}^{n}{\sum \nolimits _{j=1}^{n}{b_{ij}^{new}||{{\mathbf {y}}_{i}}-{{\mathbf {y}}_{j}}||_{2}^{2}}} \\ =tr({{({{\mathbf {Y}}^{C}})}^{T}}{\mathbf {L}_{1}}{{\mathbf {Y}}^{C}})+tr({{({{\mathbf {Y}}^{C}})}^{T}}{\mathbf {L}_{2}}{{\mathbf {Y}}^{C}}) \end{aligned} \end{aligned}$$
(11)

where \({\mathbf {L}_{1}}={\mathbf {D}'}-\mathbf {S}\), \({\mathbf {L}_{2}}={\mathbf {D}''}-{\mathbf {B}^{new}}\), both \({\mathbf {D}'}=[{{{d}'}_{ij}}]\in {{R}^{n\times n}}\) and \({\mathbf {D}''}=[{{{d}''}_{ij}}]\in {{R}^{n\times n}}\) are diagonal matrices, \({{{D}'}_{ii}}=\sum \nolimits _{j=1}^{n}{{{s}_{ij}}}\), \({{{D}''}_{ii}}=\sum \nolimits _{j=1}^{n}{b_{ij}^{new}}\).

Thus, the objective function of the DANEP is defined as:

$$\begin{aligned} \mathcal {L}=\alpha {\mathcal {L}_{local}}+\beta {\mathcal {L}_{rec}} \end{aligned}$$
(12)

Where \(\alpha \) and \(\beta \) are the hyper-parameter to balance the weights among different losses.

4 Experiments and Results

In this section, we conduct extensive experiments on the four real-world networks by adopting three widely used applications, i.e., node classification, node clustering, and network visualization, to evaluate the effectiveness of our proposed method DANEP.

4.1 Datasets

In experiments, four publicly available networks with class labels are used, i.e., Cora, Citeseer, BlogCatelog and Flicker networks, where the first two datasets are academic papers citation network, and the last two datasets are the social networks. In Cora/Citeseer networks, nodes and edges represent academic papers and the citation relationships amongst those papers, respectively, each paper can be represented as a bag-of-words vector with 1433/3703-dimensions, and papers are divided into 7/6 categories, such as Genetic algorithm, Neural Networks and Reinforcement Learning. In BlogCatelog/ Flicker networks, nodes and edges represent the users and relationships amongst those users, respectively, each user can be represented as a bag-of-words vector with 8189/12047-dimensions, and those users are divided into 6/9 categories based on social preferences. The statistics for each network are summarized in Table 1.

Table 1. The statistics of networks.

4.2 Baselines

To verify the effectiveness of DANEP model, we select 10 approaches as the baselines, including: 4 “Topology-only” algorithms, i.e., DeepWalk [15], Node2Vec [8], GraRep [3], DNGR [4], and 6 “Topology +Attribute” algorithms, i.e., AANE [10], TADW [24], ASNE [12], DANE [6], ANRL [27], NANE [14]. The details of these baselines are illustrated as follows:

“Topology-Only” Algorithms: DeepWalk [15]: It employed the truncated random walks to capture the local topology information, and then learned the latent embedding representation by making full use of the captured local information.

Node2Vec [8]: It proposed biased random walks to project node into a low-dimensional space while preserving the network essence by exploring and preserving network neighborhoods of nodes.

GraRep [3]: It developed a model to learn the node representation for the weighted graph by integrating global structural similarity in the learning process.

DNGR [4]: It adopted a random surfing model to capture topology information, and then utilized the stacked denoising Auto-Encoder to extract meaningful information into the low-dimensional vector representation.

Table 2. The performance evaluation of node classification.
Table 3. The performance evaluation of node clustering.

“Topology +Attribute” Algorithms: AANE [10]: AANE considered and integrated the proximity of attribute features into the embedding learning and adopted a distributed manner to accelerate the learning process.

TADW [24]: It employed a matrix factorization method based on DeepWalk to learn low-dimensional representations of text and network topology, and then concatenate them to form the final representation.

DANE [6]: DANE allowed neighborhood topology obtained by random walks and attribute features to interact with each other to preserve the consistent and complementary information during the learning process.

ANRL [27]: It designed a neighbor enhancement Auto-Encoder model with an attribute-aware skip-gram to integrate the attribute features and network topology proximities in the learning process simultaneously.

ASNE [12]: ASNE integrated the adjacent matrix of network topology and attribute matrix on the input layer, and allowed them to interact with each other for capturing the complex relationships and the more serviceable information.

NANE [14]: It cascaded the adjacent matrix of network topology and cosine similarity of attribute features into the unified representation to capture the local information and non-linear correlation in the network.

4.3 Parameter Settings

To get a fair comparison, we set the embedding dimension d of all datasets to be 128 for all baselines. For DeepWalk and Node2Vec, we set the window size as 10, the walk length as 80, and the number of walks per node as 10. For GraRep, the maximum transition step is set to 5. For TADW, we set the regularization parameter to 0.2. Besides, the default values of the other parameters for these methods were set the same as the open-source codes released by the original authors.

4.4 Node Classification

In this subsection, we randomly select 10%, 30%, 50% nodes as the training set and the remained nodes as the testing set, apply the linear SVM as the classifier and use 5-fold cross-validation to train the classifier in the learning process. This process is repeated 10 times and the average performance in terms of both Macro-F1 and Micro-F1 [25] is reported as the classification results. The detailed results are shown in Table 2, where the bold numbers indicate the best results. From Table 2, we have the following observations and analyses:

  1. (1)

    DANEP obtains the best performance with respect to the Micro-F1 and Macro-F1 on the Cora, Flickr and BlogCatalog datasets when the training rates are 10%, 30% and 50%, respectively. The improved performance concerning the Micro-F1 and Macro-F1 is significantly on different datasets, such as DANEP achieves the 20.04%, 19.92%, 14.25%, 14.21%, 12.95 and 12.9% than the best baseline DANE on Flicker dataset when the training rates are 10%, 30% and 50%, respectively. Those results demonstrated the superiority of DANEP with random surfing and PPMI schemes.

  2. (2)

    ANRL and TADW achieve the highest values on the Citeseer dataset when the training rates are 10% and 50%, respectively, which indicate that the neighbor enhancement mechanism and text-associated matrix factorization have some the ability to capture the essence of the network, but they are still obviously inferior to DANEP.

Fig. 2.
figure 2

The visualization result of different methods on the BlogCatalog dataset

Fig. 3.
figure 3

The sensitivity of DANEP w.r.t. different \(\alpha \) and \(\beta \) for node classification

4.5 Node Clustering

Node clustering is an unsupervised downstream task of network analysis based on the learned node representation. In this study, we use k-means [1] as the clustering algorithm, accuracy (ACC) [6] and normalized mutual information (NMI) [14] as metrics to evaluate the clustering performance. Similarly, this process is repeated 10 times and the average performance in terms of both ACC and NMI is reported as the clustering results. The final results for each baseline are shown in Table 3. From Tables 3, we have the following observations and analyses:

  1. (1)

    DANEP acquires the best clustering performance on the Citeseer, BlogCatalog and Flickr datasets against all the baselines. The promotion of performance is significantly on different datasets, such as, DANEP with 32.46% and 41.69% than the best baseline TADW on the Flicker dataset. Besides, DANEP ranked the second on the Cora dataset, but only with the slightly inferior in ACC than DANE, i.e., −0.008, and in NMI than TADW, i.e., −0.005, respectively. Those results indicated that DANEF based on the graph representation and PPMI has a good clustering performance than all baselines.

  2. (2)

    From the perspective of average performance, TADW obtains better clustering results than the other baselines, but TADW is seriously inferior to DANEP. In detail, DANEP averagely improves 10.87% in ACC and 13.11% in NMI than TADW, which demonstrated attribute graph and PPMI matrix have powerful assistance in node clustering.

4.6 Network Visualization

To verify whether the learned node representations have the discriminative essence features, we use the t-SNE [16] to project the learned embedding representation for each node into the 2D space. The color of a point indicates the class label. The desired embedding layout should be that nodes with the same color (label) to closer each other and different colors (label) to distant each other with the obvious boundary. Due to the space limitation, we only show the visualization result on the BlogCatalog dataset in Fig. 2, and the visualization results on other datasets are similar.

From Fig. 2, we can see that the DANEP, i.e., sub-figure (k), performs the best result with the nodes of the same color are close to each other and the boundaries amongst the different colors are discernible. Besides, DANE, sub-figure (e), performs the suboptimal result that the separation of boundaries is inferior to DANEP. Nevertheless, the visualization results of the DeepWalk, Node2Vec, Grarep, DNGR, ANRL, AANE, TADW, NANE and ASNE, i.e., sub-figure (a), (b), (c), (d), (f), (g), (h), (i) and (j), are mixed with different color nodes.

4.7 Sensitivity Analysis of Parameters

The hyper-parameters \(\alpha \) and \(\beta \) are used to balance the weights between the pairwise constraint loss and reconstruction loss of the DANEP. In this subsection, we analyze the sensitivity of hyper-parameters of DANEP via node classification and node clustering tasks. Experimental results of Micro-F1 of node classification and ACC of node clustering are presented in Fig. 3 and Fig. 4, respectively. The trends of Macro-F1 and NMI with respect to \(\alpha \) and \(\beta \) are similar to that of Micro-F1 and ACC, so we do not present them due to the space limitation.

Fig. 4.
figure 4

The sensitivity of DANEP w.r.t. different \(\alpha \) and \(\beta \) for node clustering

From Fig. 3, we can observe that the tendencies of the Micro-F1 value of node classification are stable under different hyper-parameters and different datasets, which indicates DANEP has stable performance for node classification. In Fig. 4, the fluctuation of ACC of node clustering is obvious with the various hyper-parameter \(\beta \) than hyper-parameter \(\alpha \), which indicates that reconstruction loss plays a vital role in the node clustering process.

5 Conclusion

In this study, we develop the DANEP model to integrate attribute features and network topology into a unified graph format and encode each node into a low-dimensional embedding representation. In our model, the k-nearest neighbor graph can reveal some potential non-linear manifold under the attribute features, the random surfing model and PPMI can capture the structural characteristics and high-order proximity information of the attribute/topology graph, and the pairwise constraint can improve the quality of node representation. Experiment results on four real-life datasets in node classification, node clustering and visualization tasks indicated that the performance of the DANEP outperformed 10 representative baselines, including “Topology-only” algorithms and “Topology+Attribute” algorithms.

The DANEP is designed to handle the homogeneous networks with single-typed nodes and edges. However, real-world networks are usually with multiple-typed nodes and edges, which contain richer semantic information and more complex network topology for network representation learning [7, 26]. Therefore, extending the DANEP to heterogeneous networks and improving the stability of clustering are our future works.