Keywords

1 Introduction

Large-scale information networks are common information carriers in real world. Mining knowledge from complex information networks can help people to understand network structure [1] or information dissemination patterns [2]. Network representation learning (NRL) [3] is a basic issue in network mining area which mainly studies how to map features of vertex in a network to a low-dimensional, continuous real-valued embedding, and the process of mapping is not only try to preserve the structural feature s, but also try to preserve the properties of the vertex. Embedding learned by NRL can be used as input feature for machine learning methods, and has important applications in the real world, such as network visualization [5], network reconfiguration [4, 6], community detection [7], link prediction [8], etc.

The traditional NRL method is similar to dimensionality reduction, such as Graph Factorization (GF) [9], Local Linear Embedding (LLE) [10] Large-scale Information Network Embedding (LINE) [11] and HOPE algorithm [12], etc. These methods use single matrix to present the similarity graph structure of networks, and obtain low-dimensional embedding for networks by factorize this matrix. Matrix Factorization based Methods is unstable and incomplete because they have strong dependence on constructing single feature matrix. Our task selects multiple features from networks, and design an unsupervised fusion algorithm based on deep neural networks.

2 Related Work

With the advent of the era of big data, deep learning (DL) technology are developing rapidly. DL can discover complex structures in big data via multiple processing layers. DL brings significant results in many areas, such as computer vision, language modeling, etc. In recent years, scholars have done a lot of research on applying DL models to represent graphs or networks. Deepwalk [13] and node2vec [14] use random walk to generate sequence of nodes and adopt an unsupervised neural language model (SkipGram) [15] for networks embedding. SDNE [5] uses an unsupervised deep self-encoder to model the second-order proximity, the hidden layer of the deep self-encoder is the embedding of networks. In addition, convolutional neural networks (CNN) and its variants have been widely adopted in representation learning. PATCHY-SAN [16] selects fixed-length node sequence to assemble the neighborhood of nodes and directly use the original CNN model designed for Euclidean domains. GCN [17] defines the convolution in the spectral domain, and constructs a semi-supervised model for node classification task. However, networks usually contains multiple types of information, such as node attribute information, structure information, text information, etc. Existing method is incomplete because it only learns single types of information. In addition, existing method lack universality because each representation learning model is designed with a specific optimization goals.

Unlike previous approaches, we propose an unsupervised learning algorithm named multi-view of network embedding, also known as MVNE. MVNE uses multiple vertex attribute (text information, geographic location, user tags, etc.), network global topology, and local topology features as input features, and they are treated as network views. The views express the characteristics of different aspects of the network. We consider multiple localized first-order approximation spectral graph convolutions to extract features from views, and fuse features by analyzing correlation between them. The model can be applied to various network tasks because it learns representations in a fully unsupervised setting.

3 Multi-View Learning of Network Embedding

3.1 A Subsection Sample

We define a network \( {\text{G}} = \left( {V,E} \right) \), \( V = \left\{ {v_{1} , \ldots ,v_{i} , \ldots ,v_{N} } \right\} \) is the collection of network vertices, where \( N \) is the number of vertices. \( E \) is the collection of network edges. \( e_{ij} = \left( {v_{i} ,v_{j} } \right) \in E \) represents an edge between \( v_{i} \) and \( v_{j} \). \( {\text{A}} \) is the adjacency matrix. If there is an edge between \( v_{i} \) and \( v_{j} \), then \( A_{ij} = 1 \), otherwise \( A_{ij} = 0 \). The vertices feature matrix \( X \) corresponding to \( G \) is a highly sparse matrix. Dimension of \( X \) is usually expressed as \( \left| {\text{V}} \right| \times {\rm{m}} \), where \( {\text{m}} \) is the feature space size of the attribute. Vertices usually have multiple attributes, such as geographic location, age, hobbies, etc. Let \( Attr = \left\{ {X_{1} , \ldots ,X_{p} } \right\} \) denote the feature matrices set of network \( G \) which are treated as views of networks. In this paper, we assume that the input to our algorithm is an undirected network \( G \) and its feature matrices \( Attr \). The goal of our algorithm is mapping each vertex to a low-dimensional vector \( {\text{z}} \in {\mathbb{R}}^{d} \) by fusing the information contained in A and \( Attr \), where \( d \ll \left| V \right| \).

3.2 Feature Extraction Based on Graph Convolution

Convolutional neural networks (CNN) has achieved good results in areas. CNN can process Euclidean data (e.g. image data) efficiently. However, network data belongs to Graph-structured Data. In order to learn the features in Graph-structured Data, this paper use a variant of CNN which called spectral convolution to extract feature map from views in networks. The definition of spectral convolution is as shown in Eq. (1).

$$ g_{\theta } *X = Ug_{\theta } U^{T} X $$
(1)

\( {\text{X}} \in {\mathbb{R}}^{{\left| {\rm{V}} \right| \times {\text{m}}}} \) is a feature matrix, \( {\text{g}}_{\uptheta} = {\rm{diag}}\left(\uptheta \right) \) is a filter. Spectral convolution \( g_{\theta } \) is generated by decomposing the normalized graph Laplacian matrix shown as Eq. (2).

$$ L = I_{N} - D^{{ - \frac{1}{2}}} AD^{{ - \frac{1}{2}}} = U^{T}\Lambda U $$
(2)

\( D \) is the degree matrix, \( U \) is the matrix of eigenvectors of \( L \), \( \varLambda \) is the diagonal matrix of eigenvalues of \( {\text{L}} \). According to Eqs. (1) and (2), it can be seen that filter \( g_{\theta } \) is a function of eigenvalues \( \Lambda \). We can obtain the filter \( g_{\theta } \) via the eigenvalue decomposition of \( L \). However, in large-scale networks, eigenvalue decomposition of L is computationally expensive. So we use \( K^{th} \)-order Chebyshev polynomial to approximate \( g_{\theta } \left( \varLambda \right) \) in Eq. (3).

$$ g_{{\theta^{\prime } }} *X = Ug_{{\theta^{\prime } }} U^{T} X = \sum\nolimits_{k = 0}^{K} {\theta_{k}^{\prime } T_{k} \left( {U{\tilde{\Lambda }}U^{T} } \right)X} = \sum\nolimits_{k = 0}^{K} {\theta_{k}^{\prime } T_{k} \left( {{\tilde{\text{L}}}} \right)X} $$
(3)

In Eq. (3), \( {\tilde{\Lambda }} = \frac{2}{{\lambda_{ \hbox{max} } }}\Lambda - {\text{I}}_{\rm{N}} \), \( \lambda_{ \hbox{max} } \) is the largest eigenvalue of \( {\text{L}} \). \( \uptheta^{\prime } \in {\mathbb{R}}^{\text{K}} \) is a vector of Chebyshev coefficients. \( {\text{T}}_{\rm{k}} \left( {{\tilde{\text{L}}}} \right) = 2{\tilde{\rm{L}}\text{T}}_{{{\rm{k}} - 1}} \left( {{\tilde{\text{L}}}} \right) - {\rm{T}}_{{{\text{k}} - 2}} \left( {{\tilde{\rm{L}}}} \right) \), with \( {\text{T}}_{0} \left( {{\tilde{\rm{L}}}} \right) = 1 \) and \( {\text{T}}_{1} \left( {{\tilde{\rm{L}}}} \right) = {\tilde{\text{L}}} \). If \( {\text{K}} = 2 \) and \( \lambda_{ \hbox{max} } \approx 2 \), we can obtain

$$ g_{{\theta^{\prime } }} *X \approx \theta_{0}^{\prime } X + \theta_{1}^{\prime } \left( {L - I_{N} } \right)X \approx \theta \left( {\tilde{D}^{{ - \frac{1}{2}}} \tilde{A}\tilde{D}^{{ - \frac{1}{2}}} } \right)X $$
(4)

where \( {\tilde{\text{A}}} = {\rm{A}} + {\text{I}}_{\rm{N}} \), \( {\tilde{\text{D}}} \) is the degree matrix of \( {\tilde{\text{A}}} \). The equation has parameter \( \uptheta =\uptheta_{0}^{\prime } = -\uptheta_{1}^{\prime } \), and it is a matrix of filter parameters in graph convolution network. As an example, we use two-layer convolution network with different \( {\text{W}} \) to learn multiple views in networks. The graph convolution network can be expressed as follows:

$$ {\text{z}}\left(\uptheta \right) = {\rm{ReLU}}\left( {{\hat{\text{A}}\rm{ReLU}}\left( {{\hat{\text{A}}\rm{XW}}^{0} } \right){\text{W}}^{1} } \right) $$
(5)

In Eq. (5), \( \uptheta \) to denote the vector of all filter parameters \( {\text{W}} \). \( {\hat{\text{A}}} = {\tilde{\rm{D}}}^{{ - \frac{1}{2}}} {{\tilde{\text{A}}{\tilde{\rm{D}}}}}^{{ - \frac{1}{2}}} \) is a symmetric and sparse adjacency matrix which converges the weight information of the nodes in the first-order domain of the target node. \( {\text{z}}\left(\uptheta \right) \) is the feature map learned from feature \( {\text{X}} \). We will obtain multiple feature maps by multiple convolution operations. This process is shown in Fig. 1. We consider three views in network.

Fig. 1.
figure 1

A schematic of MVNE

3.3 MVNE Algorithm

Canonical Correlation Analysis (CCA) is used to mine complex relation mappings between two views \( ({\text{X}}_{1} ,{\rm{X}}_{2} ) \in {\mathbb{R}}^{{{\text{n}}_{1} }} \times \) \( {\mathbb{R}}^{{{\text{n}}_{2} }} \) by finding pairs of projections \( {\text{w}}_{1} ,{\rm{w}}_{2} \) that are maximize the correlation between views. The goal of CCA is shown in Eq. (6), where \( \Sigma _{11} \) and \( \Sigma _{22} \) are covariance, \( \Sigma _{12} \) is cross-covariance.

$$ \left( {{\text{w}}_{1}^{ *} ,{\rm{w}}_{2}^{ *} } \right) = {}_{{{\text{w}}_{1} ,{\rm{w}}_{2} }}^{\text{argmax}} {\rm{corr}}\left( {{\text{w}}_{1}^{\prime } {\rm{X}}_{1} ,{\text{w}}_{2}^{\prime } {\rm{X}}_{2} } \right) = \begin{array}{*{20}c} {\text{argmax}} \\ {{\rm{w}}_{1} ,{\text{w}}_{2} } \\ \end{array} \frac{{{\rm{w}}_{1}^{\prime }\Sigma _{12} {\text{w}}_{2} }}{{\sqrt { {\rm{w}}_{1}^{\prime } \Sigma _{11} {\text{w}}_{1} {\rm{w}}_{2}^{\prime } \Sigma _{22} {\text{w}}_{2} } }} $$
(6)

Inspired by CCA, we fuse multi-view by finding a canonical coordinate space that maximizes correlations between the projections of views. We use \( {\text{X}} = \left( {{\rm{x}}_{1} ,{\text{x}}_{2} , \ldots ,{\rm{x}}_{\text{n}} } \right) \in {\mathbb{R}}^{{{\text{N}} \times {\rm{m}}_{\text{x}} }} \) and \( {\text{Y}} = \left( {{\rm{y}}_{1} ,{\text{y}}_{2} , \ldots ,{\rm{y}}_{\text{n}} } \right) \in {\mathbb{R}}^{{{\rm{N}} \times {\text{m}}_{\rm{y}} }} \) obtained from network as an example to explain the principle of view fusion. \( {\text{m}}_{\rm{x}} \) and \( {\text{m}}_{\rm{y}} \) are dimensions of views. The goal of our task is learning the parameters \( \uptheta \) in Eq. (7) for every network views, this is express as

$$ {}_{{\theta_{1} ,\theta_{2} }}^{argmax} corr\left( {z\left( {X;\theta_{1} } \right),z\left( {Y;\theta_{2} } \right)} \right) $$
(7)

Let \( {\text{Z}}_{\rm{X}} = {\text{z}}\left( {{\rm{X}};\uptheta_{1} } \right) \) and \( {\text{Z}}_{\rm{y}} = {\text{z}}\left( {{\rm{Y}};\uptheta_{2} } \right) \) be the matrix produced by the graph convolutional layer on two views. \( \Sigma _{\text{xx}} = {\rm{Z}}_{\text{X}} {\rm{Z}}_{\text{X}}^{\prime } + {\rm{r}}_{1} {\text{I}} \), \( \Sigma _{\text{yy}} = {\rm{Z}}_{\text{Y}} {\rm{Z}}_{\text{Y}}^{\prime } + {\rm{r}}_{2} {\text{I}} \) are covariance matrices of \( \left( {{\text{Z}}_{\rm{X}} ,\,{\text{Z}}_{\rm{y}} } \right) \), and \( \Sigma _{\text{xy}} =\Sigma _{\rm{yx}}^{\prime } = {\text{Z}}_{\rm{X}} {\text{Z}}_{\rm{Y}}^{\prime } \) is the cross-covariance matrices of \( \left( {{\text{Z}}_{\rm{X}} ,\,{\text{Z}}_{\rm{y}} } \right) \). \( {\text{r}}_{1} ,{\rm{r}}_{2} > 0 \) is the regularization constant to reduce over-fitting in training data. We define \( {\text{O}} =\Sigma _{\rm{xx}}^{{ - \frac{1}{2}}}\Sigma _{\text{xy}}\Sigma _{\rm{yy}}^{{ - \frac{1}{2}}} \) according to the objection in Eq. (6), then we use the traces of \( {\text{O}} \) to simplify the calculation of the objective function in Eq. (7)

$$ \begin{array}{*{20}c} {arg{ \hbox{max} }} \\ {\theta_{1} ,\theta_{2} } \\ \end{array} corr\left( {Z_{X} ,Z_{Y} } \right) = max\,tr\left( {O^{\prime } O} \right)^{{{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 2}}\right.\kern-0pt} \!\lower0.7ex\hbox{$2$}}}} $$
(8)

In order to find \( \uptheta_{1} ,\uptheta_{2} \) such that Eq. (8) is as high as possible, we calculate the gradient of \( {\text{corr}}\left( {{\rm{Z}}_{\text{X}} ,{\rm{Z}}_{\text{Y}} } \right) \) with respect to \( \uptheta_{1} ,\uptheta_{2} \), then use back propagation. Algorithm describes the multi-view fusing and embedding generation process.

figure a

4 Experiments

4.1 Dataset

In order to evaluate the effectiveness of MVNE, we use community detection, network reconstruction and network visualization to evaluate different embedding generated by different methods. Table 1 gives a properties list of all real network in our experiment. We construct random walk matrix \( X_{R} \in R^{\left| V \right| \times \left| V \right|} \) based on network to preserve local topology structure, and \( X_{{R_{ij} }} \) denote the frequency of node \( v_{j} \) appearing in random walk sequence of node \( v_{i} \). \( X_{R} \) can preserve node centrality and higher-order proximity between nodes. In addition, some networks in Table 1 contain rich information, such as region, hobbies, etc. We construct feature matrices base on attributes by one-hot coding, and use 5 layers MVNE to generate embedding.

Table 1. Experimental network basic properties.

4.2 Community Detection

Community detection is a basic task of network analysis. The interaction between nodes in same community is more frequent than the interaction among other communities in networks. We use K-means to classify and assign community label for every node base on embedding. Modularity can be used to evaluate the quality of community detection. Embedding with high modularity is high-quality. We report performance of eight models including our model. The result is shown in Table 2.

Table 2. Modularity result of community detection.

We can see that MVNE has higher modularity than other benchmark algorithms. The results show that the embedding learned by MVNE is more effective because it contains multiple network information.

4.3 Network Reconstruction

The purpose of network reconstruction is to rebuild the links between nodes based on similarity of node pairs. The similarity between nodes is evaluated by the distance of their embedding. The experiment randomly selects 20% node pairs from whole pairs as a sub-network sample. We take k pairs of nodes with the highest similarity as predicted links and calculate actual link ratio to evaluate the accuracy of network reconstruction. If node embedding is effective, the accuracy will be high.

Figure 2 shows the mean and standard deviation of corresponding accuracy. Accuracy of network reconstruction decreases with increasing of k value. MVNE achieves better network reconstruction with different k values. The reconstruction precision can reach about 80% while k = 2.

Fig. 2.
figure 2

Reconstruction accuracy of networks

4.4 Visualization

We can visualize embedding to make better understanding for topology and characteristics of networks intuitively. Embedding learned by different methods are different in the ability of visualization and interpretation. We compare visualization ability of embedding learned by different algorithms in Football network. Each algorithm learns 64-D embedding for nodes, and use t-SNE [4] to reduce dimension to 2-D. We color nodes to observe the basic community structure of networks. The results are shown in Fig. 3. Some nodes belonging to different communities are mixed up in HOPE, LINE and GF. The embedding generated by DeepWalk and Node2Vec can represent clear community structure, but there are still a few nodes belonging to different communities mixed. MVNE is more effective than other benchmark algorithms because nodes belonging to same communities are separated clearly.

Fig. 3.
figure 3

Visualization of Football network

5 Future Work

In this paper, we propose a Multi-View Learning algorithm to generate embedding of networks. It fuses multiple features of networks by an unsupervised learning process. In the future, how to improve the learning efficiency of MVNE on large-scale network is a very important problem. In addition, introducing dynamic interaction information as a feature into NRL process is also worthwhile to study.