Keywords

1 Introduction

Social networks are ubiquitous in our daily lives, ranging from online social-networking sites such as Facebook and Weibo, biological gene-disease networks [6, 18], to citation networks [14, 16]. Network embedding, which maps the information of each node into a latent space, has grown up to be an effective method in social network data mining. As a result, various data mining methods, e.g., group clustering [27], link prediction [21], anomaly detection [3], node classification [2, 21], can be directly conducted in the latent space.

Early works(e.g., DeepWalk [24], LINE [28], node2vec [9]) explore the effect of local and global structure similarity [8, 12, 17, 31] on network embedding. Local structure similarity indicates that a pair of nodes with edges are more prone to be similar. On the other hand, global structure similarity reveals node status in the network. For instance, a pair of nodes with semblable structural neighbors tend to be closer. However, these methods need to be improved for lack of abundant nodes’ attribute information. Recently, several algorithms pay attention to attributed network embedding leveraging structural and attribute information both. LANE [10] only takes the local attribute similarity into consideration which means that nodes in the network with similar attribute are more likely to be in a same community. UPP-SNE [35] is more prone to capturing global structural and attribute similarity under the framework of DeepWalk. How to extract the local and global information in structure and attribute jointly is still an arduous problem.

In this paper, we propose a Neural-based Attributed Network Embedding framework named NANE to address the aforementioned problem. In our approach, various information is comprehensively considered, including the local and global information in structure and attribute. We impose an autoencoder model to encode the global attribute and structural information into a latent space with a pairwise constraint to preserve the local information. Specifically, we regard the attribute similarity of two nodes as an uncertain link and the similarity indicates the possibility of connecting two nodes. On top of that, an affinity matrix is introduced to represent the attribute global proximity.

In conclusion, the main contributions can be summarized as follows:

  • We propose a generic neural-based framework NANE to represent attributed social network capturing both structural and attribute non-linear similarity. To our best of knowledge, our model is the first attempt to capture the local and global similarity in structure and attribute jointly. Specifically, we introduce an affinity matrix to measure the global attribute similarity among nodes.

  • We empirically conduct experiments on three real-world datasets with multi-class classification of vertices and node clustering tasks. Comparing with cutting-edge network embedding algorithms, we evaluate the effectiveness of NANE.

The rest of this paper is organized as follows. Firstly, we discuss the related work in Sect. 2, and then we show some definitions in our work in Sect. 3. On top of that, we propose the NANE algorithm in Sect. 4, followed by experiments in Sect. 5. Finally, Sect. 6 concludes the paper and visions the future work.

2 Related Work

In this section, we briefly summarize the development of network embedding methods which we can simply divide into two parts. The first part is named structure-based network embedding which focuses on structural information only. The latter is named content-aware network embedding that combines additional information with network structure for a better performance on graph representation.

2.1 Structure-Based Network Embedding

Some earlier works utilize manifold learning to capture structure proximity [1, 25, 29]. Recently, inspired by Skip-Gram [19, 20], DeepWalk [24] imposes random walks to generate sequences of nodes, and feeds them into Skip-Gram to capture the similarity of nodes. However, DeepWalk is prone to preserving the global structural proximity since the limit of random walks. On top of that, node2vec [9] introduces biased random walks to balance the global and local structural information. LINE [28] designs an explicit objective function to capture both global and local structural information while it lacks further fusion.

On the other hand, some researchers try to introduce neural networks to extract node high non-linearities. DNGR [5] adopts a random surfing model to capture network structure and reduces it into a limit dimension by a SDAE model. It happens that there is a similar case. SDNE [33] also exploits a deep neural network with a supervised component to extract the structural information.

All the methods mentioned above are based on network structure only without considering attribute information which is extremely beneficial to graph representation.

2.2 Content-Aware Network Embedding

Some recent efforts have been devoted to leveraging additional information for a better performance.

TADW [34], a text-associated DeepWalk model, creatively explores the contribution of nodes contents in network embedding. It imposes matrix factorization to encode text features into network representation learning. TriDNR [22] exploits DeepWalk and Doc2Vec [13] to extract structural information and capture content information respectively. Relying on a late fusion which is a series of weighted sums only, it ignores the correlation between structure and contents since it lacks further convergence. Further, these two models which concern about node contents only cannot handle noisy attribute information.

LANE [10] is the first attempt to utilize label information and node attributes. It utilizes three matrices to measure networks: the network adjacent matrix, node content-level similarity matrix and node label-level similarity matrix, and then maps them into the same vector space by dimension reduction. LANE projects node embedding with matrix factorization which cannot reflect the non-linear correlation between nodes. UPP-SNE [35] is the first attempt to take user profiles into consideration. The basic idea is to extract the similarity of nodes by random walks and embed user profile information via a non-linear mapping into a low-dimensional latent space. However, UPP-SNE is more prone to extracting the global proximity while it ignores the local information.

3 Problem Definition

We first summarize some notations used in this paper. We denote scalars as lowercase alphabets(e.g., n) and represent vectors as boldface lowercase alphabets(e.g., s). Moreover, matrices are represented by boldface uppercase alphabets(e.g., S). s\(_i\) is the i\(^{th}\) row of a matrix S, and the (i,j)\(^{th}\) element of it is denoted by s\(_{\textit{ij}}\). We list the main notations in Table 1.

We regard a social network as a homogeneous attributed network which indicates that there is only one relationship between nodes, each edge is undirected and some nodes have their attributes. Under these assumptions, we denote a social graph as \(\mathcal {G}\) = (\(\mathcal {V}\), \(\mathcal {E}\), U), where \(\mathcal {V}\) is a set of n nodes, \(\mathcal {E}\) is a set of edges and U is the node attribute matrix. A is the adjacent matrix, and a\(_{ij}\) is defined as 1 if there is a edge between node i and node j.

The aim of this paper is to project the nodes into a low-dimensional vector space while preserving structural and attribute information jointly. It is vital to map the matrices U,A into the same latent space jointly. In this end, we propose a neural-based attributed network embedding model to learn a comprehensive representation.

Table 1. Main notations and descriptions

4 Methodology

In this section, we first elaborate the framework of our final model NANE preserving structural and attribute information jointly. And then we give the explicit loss function and its optimization in detail.

4.1 Overall Architecture

In this paper, we propose a neural-based attributed network embedding model to capture structural and attribute information jointly. As shown in Fig. 1, NANE consists of three major steps. Specifically, we first convert structural and attribute information to a set of binary features, and then leverage a self-feedforward layer to measure the weight of each feature. We then concatenate the output of self-feedforward layer and feed it into a neural network structure. The early fusion operation makes it possible to optimize all parameters simultaneously. To capture the highly non-linear node correlation, we utilize a deep autoencoder [26] model to reconstruct overall information and design a specific loss function to preserve both local and global similarity in the end. The details of each step are elaborated as follows.

Fig. 1.
figure 1

The framework of NANE

4.2 Global Information Encoding

To economize the global attribute and structural information, we introduce an affinity matrix S and an adjacent matrix A respectively. In this paper, we take a row of the adjacency matrix to represent for structural information. For example, a\(_i\) is the structural information of vertex i. Furthermore, we utilize a cosine similarity to represent the global attribute information. We simply take a row of the similarity matrix S to represent the attribute information, and s\(_i\) is the attribute similarity information of vertex i.

To generate the attribute affinity matrix S, we need to define the user attribute information matrix U in the first place. As is well-known, attribute information is rich and diverse in social networks. However, it is inevitably incomplete and noisy due to its heterogeneity and feature of manual filling. To tackle this problem, we first convert all discrete attributes, e.g., user demographics [4], user interest [23], to a set of binary features by one-hot encoding. For instance, the marital status attribute has four values \(\left\{ \textit{married},\textit{single},\textit{divorced},\textit{widowed} \right\} \), we can encode a married user as the vector v = \(\left\{ 1,0,0,0 \right\} \), where the first binary feature of value 1 represents married. For continuous attributes, e.g. age, we normalize it to reduce the impact of value by Max-Min Normalization. For missing attribute values, we set the feature vector to all zeros. Thus, we aggregate all the feature vectors together and then we obtain the user attribute information matrix U. We obtain \(s_{ij}\) by calculating the cosine similarity of u\(_i\) and u\(_j\).

4.3 Self-Feedforward Layer

To weight features in structure and attribute, we design a self-feedforward layer for both structure and attribute embedding. As shown in Fig. 1, the self-feedforward layer consists of two fully connected layers, which convert features in structural and attribute information into two dense vectors respectively. And we apply dropout, residual connections and layer normalization on the self-feedforward layer to prevent overfitting and improve the efficiency of neural network. The weights of each feature are denoted as W\(_{struc}\) and W\(_{attr}\) separately. The final output of self-feedforward layer of vertex i is denoted as h\(_i^{(0)}\), which can be expressed as follows:

$$\begin{aligned} \mathbf h _{i}^{(0)}= \begin{bmatrix} Dropout(\sigma (\mathbf W _{struc}\cdot \mathbf a _{i})),\lambda Dropout(\sigma (\mathbf W _{attr}\cdot \mathbf s _{i})) \end{bmatrix} \end{aligned}$$
(1)

where \(\mathbf h _{i}^{(0)}\) denotes the input layer of the autoencoder of vertex i, \(\lambda \) adjusts the impact of attributes, \(\sigma \) denotes the activation function, W\(_{struc}\) and W\(_{attr}\) are parameters we need to learn, which measure the weights of structural and attribute information respectively. For a better convergence and preventing overfitting, we design an \(\mathcal {L}\)2-norm regularizer in this part, which is defined as follows:

$$\begin{aligned} \mathcal {L}_{F-reg}=\left\| \mathbf W _{struc} \right\| _F^2 + \left\| \mathbf W _{attr} \right\| _F^2 \end{aligned}$$
(2)

where \(\left\| \cdot \right\| _F^2\) denotes the square of F-norm.

4.4 Reconstructing Procedure

To capture the local and global proximity in structure and attribute jointly, we extend a deep autoencoder model to measure the interplay between structure and attribute and encode them into a latent space. A deep autoencoder consists of two parts, i.e. the encoder and decoder. At the encode step, multilayer non-linear function \(f(\cdot )\) is applied to map the input data into a low-dimensional vector space, which is also called embedding space. On the contrary, there are also multilayer mirrored non-linear function \(g(\cdot )\) is extended to map the embedding space to the reconstruction space. In this paper, the input data of deep autoencoder is the output of self-feedforward layer \(\mathbf h _{i}^{(0)}\), and the representation of hidden layers can be denoted as follows:

$$\begin{aligned} \mathbf h _{i}^{(k+1)}=\sigma (\mathbf W ^{(k)} \cdot \mathbf h _{i}^{(k)} + \mathbf b ^{(k)}) \end{aligned}$$
(3)

where \(W^{(k)}\), \(b^{(k)}\) is the k\(^{th}\) hidden layer weight matrix and biases.

The output of reconstruction layer is denoted as \(\widehat{\mathbf{R }}\). Our goal of this step is to minimize the reconstruction error of output and input. The input data R that need to be reconstructed is shown as follows:

$$\begin{aligned} \mathbf R =\begin{bmatrix} \mathbf A , \lambda \mathbf S \end{bmatrix} \end{aligned}$$
(4)

And the loss function is denoted as follows:

$$\begin{aligned} \mathcal {L}=\left\| \widehat{\mathbf{R }} -\mathbf R \right\| _F^2 \end{aligned}$$
(5)

However, due to the sparsity of the input data, the number of zero elements in R is much larger than that of non-zero ones, which means that it is more prone to reconstruct the zero elements rather than non-zero ones. However, it is contrary to our intention. To address this problem, we impose an offset coefficient matrix to reset the weights of different elements, and the redesigned loss function is denoted as follows:

$$\begin{aligned} \mathcal {L}_{global}=\left\| (\widehat{\mathbf{R }}-\mathbf R )\odot \mathbf B \right\| _F^2 \end{aligned}$$
(6)

where B is the offset coefficient vector, b\(_{ij}\)=1 when r\(_{ij}\)=0 while b\(_{ij}=\beta >1\) when r\(_{ij}\)=1, and \(\odot \) means the Hadamard product. With the help of the deep autoencoder, the vectors of the vertexes which have semblable features would have similar representation. However, it is not only necessary to capture the global similarity, but also vital to preserve the local similarity. A pair of nodes with edges should also have semblable embedding i.e. the local structure similarity. Therefore, we aggrandize a loss function for local structure similarity, and the expression is shown as follows:

$$\begin{aligned} \mathcal {L}_{struc-local} = \sum _{i,j=1}^{n}a_{ij}\left\| \mathbf x _i-\mathbf x _j \right\| _2^2 \end{aligned}$$
(7)

where \(a_{ij}\) is an element of adjacency matrix, \(a_{ij}=1\) when there is a link between node i and node j. \(x_{i}\) denotes the final embedding of node i. The local structure loss function aims to make the embedding of two connected nodes closer.

Similar with the local structure similarity, we also design a loss function to constraint the local attribute similarity. A pair of nodes with semblable attributes should also have semblable embedding. The loss function for local attribute similarity is denoted as follows:

$$\begin{aligned} \mathcal {L}_{attr-local} = \sum _{i,j=1}^{n}s_{ij}\left\| \mathbf x _i-\mathbf x _j \right\| _2^2 \end{aligned}$$
(8)

where \(s_{ij}\) is an element of attribute similarity matrix. The greater value of \(s_{ij}\), the attributes of node i and node j are more similar. Akin to the local structure loss function, the local attribute loss function aims to make the embedding of two nodes which have similar attributes closer.

4.5 Loss Functions and Optimization

To preserve the global and local similarity simultaneously, we combine the aforementioned loss functions in a integrated framework and propose a unified structure and attribute preserving model NANE. The joint loss function is denoted as follows:

$$\begin{aligned} \begin{aligned} \mathcal {L}&=\alpha \mathcal {L}_{global}+\gamma \mathcal {L}_{struc-local}+\theta \mathcal {L}_{attr-local}+\eta \mathcal {L}_{reg}\\&=\alpha \left\| (\widehat{\mathbf{R }}-\mathbf R )\odot \mathbf B \right\| _F^2+\gamma \sum _{i,j=1}^{n}a_{ij}\left\| \mathbf x _i-\mathbf x _j \right\| _2^2\\ {}&+\theta \sum _{i,j=1}^{n}s_{ij}\left\| \mathbf x _i-\mathbf x _j \right\| _2^2+\eta \mathcal {L}_{reg} \end{aligned} \end{aligned}$$
(9)

where \(\alpha \),\(\gamma \),\(\theta \),\(\eta \) are four hyper-parameters to adjust the weights of each part. Moreover, \(\mathcal {L}_{reg}\) is the total regularization which consists of two parts: the regularization of self-feedforward layer \(\mathcal {L}_{F-reg}\) and the regularization of deep autoencoder \(\mathcal {L}_{ae-reg}\). We define \(\mathcal {L}_{reg}\) as

$$\begin{aligned} \begin{aligned} \mathcal {L}_{reg}&=\mathcal {L}_{F-reg}+\mathcal {L}_{ae-reg}\\&=\left\| \mathbf W _{struc} \right\| _F^2 + \left\| \mathbf W _{attr} \right\| _F^2 + \sum _{k=1}^{K}(\left\| \mathbf W ^{(k)} \right\| _F^2+\left\| \mathbf b ^{(k)} \right\| _F^2) \end{aligned} \end{aligned}$$
(10)

To optimize the aforementioned framework, we apply RMSProp to minimize the objective in Eq.(9), which is able to adjust the learning rate for each parameter. We feed mini-batch into our model each time to accelerate the speed of training. Besides, in order to prevent falling into a local optimum solution, it is essential to find a good set of initialized parameters. Therefore, we adapt Restricted Boltzmann Machine to pre-train the parameters at first, which is a classic method of parameter initialization in neural network [7]. The integrated algorithm is presented in Algorithm 1.

figure a

5 Experiments

In this section, we evaluate our method by performing experiments on three real-world network datasets and compare it with several state-of-the-art baseline algorithms.

5.1 Datasets

We conduct experiments on three real-world networks: Facebook, Hamilton and Rochester. The statistics of the three datasets is summarized in Table 2.

Ego-Facebook is an ego-network which was collected from survey participants using the Facebook app. The dataset contains 1403-dimensional node features, 4039 nodes and 88234 edges. Besides, people education type is used as group label [15].

Hamilton and Rochester are two datasets collected by Adam D’Angelo of Facebook, consists of nodes from the Facebook networks at each of 100 American institutions [30]. Each node contains 7 attributes: student/faculty status flag, gender, major, second major/minor, dorm/house, year, and high school, which is described by a 144-dimensional and 235-dimensional feature vector respectively, and student/faculty status flag is selected as group class. Two datasets include 2314 nodes, 192788 edges and 4563 nodes, 322808 edges separately.

Table 2. The statistics of the three datasets

5.2 Baseline Methods

We compare NANE with several state-of-the-art network embedding methods, which are divided into two categories. Structure-Based NRL Methods

  • DeepWalk [24] generates node sequences by random walks, and feed them into Skip-Gram model to learn network embedding.

  • node2vec [9] introduces biased random walks to DeepWalk, which aims to capture the local and global structure jointly.

  • LINE [28] imposes two separate objective functions to extract the first-order and the second-order proximity both.

  • SDNE [33] leverages a deep neural network which is a non-linear mapping operation to exploit structural information.

Attribute-Aware NRL Methods

  • LANE [10] fuses structural, attribute and label information together to preserve node similarity. Here, we only use the version without utilizing label information.

  • UPP-SNE [35] generates random walks to capture node pairwise similarity and embeds user profile information into a low-dimensional vector space.

5.3 Parameter Settings

For a fair comparison, we set the embedding dimension to 256 for all methods. In DeepWalk, node2vec and UPP-SNE, we set the window size t to 10, the walk length l to 80. In node2vec, we empirically set the return hyperparameter p to 2.0, and the in-out hyperparameter q to 0.5. In LINE, we set the first-order vector dimension to 128, and the second-order vector dimension to 128 in the same way. In SDNE, we set the weight \(\gamma ,\alpha ,\beta \) to 1, 500, 100 respectively, and the learning rate is 0.01. In LANE, we tune the hyper-parameters of \(\beta _1,\beta _2\) by grid search. In UPP-SNE, we set the number of iterations to 20. For our method, we use a grid search to find the best parameters. We set \(\alpha \) to 1000, \(\beta \) to 100, and \(\gamma \),\(\theta \) and \(\eta \) to 1. The rest parameters are given in Table 3.

Table 3. Parameter settings of the three datasets

5.4 Node Classification

We first evaluate the effectiveness of NANE by multi-class node classification. To be fair, we range the training size from 15% to 75% by taking 10% as a step, and apply a rbf-kernel svm classifier with \(\gamma =100\) to all of the generated node representations. For each training size, we split train and test set in a random way for 10-times, and then conduct 5-fold cross-validation and output the average micro F1-score of node classification on three different datasets.

Table 4. Node classification F1-score(%) on Facebook
Table 5. Node classification F1-score(%) on Hamilton
Table 6. Node classification F1-score(%) on Rochester

Tables 4, 5 and 6 show the average classification accuracy of all the methods on Facebook, Hamilton and Rochester, where the best results are bold-faced. It is not hard to see that our method consistently yields the best classification results among all the baselines. Extraordinary accuracy compared with structure-based NRL algorithms demonstrates that our method works well on the fusion of attribute information. Especially on Facebook, NANE achieves more than 30% improvement over all the structure-based NRL baselines, pointing to the significant performance on node classification. Moreover, NANE also outperforms all the attribute-aware NRL baselines, demonstrating its effectiveness on capturing the local and global similarity in structure and attribute. Our method imposes an early fusion of structural and attribute information, which enables our method to exploit the interplay between structure and attribute. On top of that, our method economizes the global information with a pairwise constraint, resulting in remarkable consequences.

5.5 Parameter Sensitivity

In this section, we explore the parameter sensitivity of our method. We select three crucial parameter batch size, \(\lambda \) and t to conduct our experiments and investigate how these parameters affect the results. The train ratio is selected as 55%. We report the accuracy of node classification on different datasets with disparate parameters. In turns, we study the effect of one parameter with fixing the rest. And the results of our experiments are shown in Fig. 2.

We first show how the batch size affects the performance in Fig. 2(a). As we can see, the performance of our method on Hamilton and Rochester is stable with different values. However, performance on facebook is much fluctuant. The best value of batch size on facebook is 600. We then show how the weight of attribute information \(\lambda \) influence the result in Fig. 2(b). It not hard to see that when the weight of attribute information approach to 1, which means structural information is as important as attribute information, the performance is much better. However, the performance of our method falls badly when lambda is larger than 10. It indicates that it may leads to the very opposite effect when the influence of attributes is too large. At last, how the dimension of the output of self-feedforward layer t affects the result is shown in Fig. 2(c). We can see that when t is close to the embedding dimension, the performance is much better in most case.

Fig. 2.
figure 2

The impact of parameters batch size, \(\lambda \), and t

5.6 Node Clustering

For node clustering task, we apply K-Means++ to network embedding on three datasets and leverage ARI [11] and NMI [32] to evaluate the consequence of clustering. We set the number of clusters same as the group number, calculate the indexes 10 times to reduce occasionality. the average results are shown in Fig. 3.

As we can see, NANE achieves the best performance among almost all the cutting-edge baselines. Similar clusterings have a positive ARI, whereas negative values indicate poor performance. Furthermore, values of exactly 0 on NMI present purely independent label assignments. On Facebook, the values of ARI are lower than zero for LINE and SDNE, verifying that the representations of LINE and SDNE have independent labelings. Specifically, NANE achieves 0.125 improvement compared with the second best results. On Hamilton, NANE also yields the best clustering results, outperforming 0.127 and 0.178 lift on ARI and NMI respectively. It firmly demonstrates the better performance of NANE on unsupervised learning task.

Fig. 3.
figure 3

The results of node clustering on three datasets

6 Conclusion and Future Work

Attributed network, due to its inherent heterogeneity and sparsity, presents new challenges for many tasks. In this paper, we propose a Neural-based Attributed Network Embedding method to capture structural and attribute information jointly. We deem that it is crucial to explore the global and local similarity simultaneously for a better representation. Firstly, we impose a fully-connected layer with some dropout to extract the relevance in attribute and structure respectively. And then we leverage a deep autoencoder model which is a non-linear mapping to reconstruct the global information with a local constraint. Experiments on three different real-world datasets demonstrate the remarkable performance of our method comparing with cutting-edge algorithms. Our future work in this area will focus on the following directions. We will develop NANE to a multi-task learning algorithm and make it suitable for large-scale industrial-grade data. Furthermore, We are also interested in exploring how to capture user behaviors in time series and user images and map them into embedding space.