1 Introduction

Learning vector representations for nodes in large networks has been proved extremely useful in various complex network analysis tasks [7, 9]. The main idea of network representation learning is to use dimensionality techniques to map the high-dimensional information about a node’s neighbors or other inherent properties into a low dimensional space [7, 9, 12, 22]. Since machine learning techniques cannot be implemented on the network directly, the learnednode representation is essential to machine learning based graph mining tasks such as node classification, link prediction or community detection [5, 7, 9].

Real world networks can be classified into three categories based on the structure: naive networks, attributes networks and multi-layer networks. The naive network only has the basic elements of a network, namely nodes and edges [7, 18]. The attributed network consists of nodes and edges along with node attributes and edge attributes [7, 12, 21]. The multi-layer network, also called heterogeneous network, is the most complex one [22]. The nodes and edges of a multi-layer network are of different types, each type of node defines a layer, and the interactions between nodes can be either intra- or inter-layers [5, 19, 22]. Although the embedding of attributed and multi-layer networks have received increasing interests, it is not always possible to obtain additional information of network in the real applications. Thus embedding naive networks is a more general problem and this work focuses on the network embedding of naive networks.

Measuring the similarity of nodes is the most straightforward method to provide evidence for network embedding [15]. Nodes which are more similar in a network should still be more similar when mapping the network from a high-dimensional space to alow-dimensional space [16]. However, different similarity measurements have their own strengths and weaknesses [15]. For example, common neighbor based methods can easily capture the local topology similarity, but it would do nothing for the nodes who are similar with each other but have few common neighbors in the network. On the other hand, some global similarity measurements have advantages in measuring the similarity in global structure, but the precision of local similarity calculation cannot be guaranteed.

The majority of existing network embedding models merely use one kind of node similarity measurements. For example, LINE [23] captures the 1st and 2nd order relationship between nodes, but it fails to reflect 3rd or higher order relationships. Random walk based methods such as DeepWalk [17] and Node2vec [8] get the similarity information by the co-occurrence probability of nodes. Theoretically, random walk [18] methods can only get partial structural similarities. What’s worse, due to the link sparseness problem existing extensively in the networks, it is difficult get global similarities in the whole graph level.

Since no single similarity measurement can capture all the structural information of network, it is natural to utilize multiple similarity measurements and combine them together in the embedding procedure. Though some works [16, 23] try to utilize several node similarities to preserve more network information, they fail to consider the interrelationships between the latent spaces preserving different node similarities. This causes both network information insufficiency and network information redundancy.

To solve the problems of existing works, we propose a novel multi-view network embedding model to capture the structural information derived from different node similarity measurements. Latent spaces are firstly generated to preserve different node similarity sets. These latent spaces are regarded as the views for multi-view network embedding. Multiple views are fused to generate an overall latent space to represent the network structural information. The overall latent space merges the common representation of multiple views and view-specific representation of each view, in which common representation is learned via a Canonical Correlation Analysis [1, 11] based method and view-specific representation [2] is learned via a neural network based method. Experiments held on real-world datasets with node classification task verify that: the proposed model contributes to better node classification performances compared with network embedding models preserving single node similarity or simple combination of node similarities. In addition, since the proposed model preserves node similarity ensemble, it can describe the overall network structure with limited number of training samples. This contributes to its superior performances with very low ratio of training sets.

2 Problem formulation

Node similarity is the most popular measure to represent the network structural information [15, 25]:

  • Definition 1 (Node Similarity) For a given network G = (V, E), where V is the node set and E is the edge set. The node similarity function maps the node pair into a real space, which is denoted as S : V × V → . S(vi, vj) denotes the node similarity between node vi and vj.

Network embedding maps the high-dimensional network information into a low-dimensional latent space [7, 9]. Node similarity is always used to guide the network embedding [26], i.e., node similarity information should be preserved in network embedding. Network embedding preserves a single set of node similarities is defined as single-view network embedding:

  • Definition 2 (Single-view network embedding) For a network G = (V, E), using m similarity measures \( \mathcal{S}=\left\{{\mathbf{S}}_1,{\mathbf{S}}_2,\dots, {\mathbf{S}}_m\right\} \), single-view network embedding maps each vertex v ∈ V into a low dimensional vector space d by preserving a non-empty subset of \( \mathcal{S} \). The mapping function ofthe single-view network embedding is represented as f : V → d.

Each node similarity represents the network structural information from one-side point of view. None of them can describe the whole network. For example, common neighbor based node similarities represent the local information of network, measuring relationships with nodes who are directly connected to the active node or the nodes who are within 2 hops away from the active node [10, 15]. They cannot measure global network structural information. Random walk based node similarities represent the global information of network, measuring relationships with nodes who are randomly selected in the network [10, 15, 18]. They cannot measure local network structural information [15].

As shown in Definition 2, single-view network embedding preserves a set of node similarities. However, it is still not enough to represent the overall network structural information. Instead of preserving a single set of node similarities, a combination of node similarity sets should be preserved to represent the overall network structural information.

Preserving a set of node similarities by single-view network embedding, the generated latent space is defined as a view of the latent space preserving the overall network structural information. Multiple views are fused as the low-dimensional latent representation of the overall network. This is defined as multi-view network embedding:

  • Definition 3 (Multi-view Network Embedding) Multi-view network embedding maps multiple latent vector spaces, which are generated by multiple single-view network embedding preserving different node similarity sets, into a single low dimensional latent space to represent the network. For a given network G = (V, E), the latent space candidates which are used for multi-view network embedding are represented as \( \mathcal{M}=\left\{{f}_1(V),{f}_2(V),\dots, {f}_n(V)\right\} \), where fi(V), i = 1, 2, …, n, is the latent space generated by single-view network embedding. Multi-view network embedding is a mapping \( F:\mathcal{M}\to \mathbf{V}\in {\mathbb{R}}^d \), where V is the multi-view latent space of G.

3 The proposed method

The proposed model generates low-dimensional latent spaces for high-dimensional network information via multi-view network embedding. Multiple latent spaces are firstly generated by single-view network embedding, preserving various node similarity ensembles. Common representation and view-specific representations are learned from these multiple latent spaces. The learned representations are merged as the latent space of multi-view network embedding to represent the network. The details of the proposed method are as follows.

3.1 Single-view network embedding

Single-view network embedding preserves node similarities of a non-empty set of \( \mathcal{S}=\left\{{\mathbf{S}}_1,{\mathbf{S}}_2,\dots, {\mathbf{S}}_m\right\} \), where Si is the similarity matrix between node of V with the ith node similarity measurement, and m is the total number of node similarities.

To preserve network information from node similarity point of view, single-view network embedding should minimize the difference between node similarities of the original network \( {\mathcal{S}}_i \) and node similarities of the latent network space \( {\mathbf{S}}_i^{\prime } \). Sigmoid function [18] is used in this work to measure the node similarity of the latent representation space:

$$ {s}^{\prime}\left({v}_i,{v}_j\right)=\frac{1}{1+{e}^{-{\mathbf{v}}_i^{\mathbf{T}}{\mathbf{v}}_{\mathbf{j}}}} $$
(1)

where vi and vj denote the representation vector of node vi and vj in the representation vector space. The similarity matrix in the representation space is denoted as S, where S[i, j] = s(vi, vj).

This work focuses on the directed network, and the representation of each node preserves two kinds of similarities: the similarities with nodes in which the active node acts as the source nodes of links, and the similarities with nodes in which the active node acts as the target node of links. For a node vi, let the source node representation Xi and the target node representation Yi be the representation preserving these two kinds of node similarities. The similarity matrix in representation space is:

$$ {\mathbf{S}}_i^{\prime }= Sigmoid\left({\mathbf{X}}_i\cdotp {\mathbf{Y}}_i^T\right) $$
(2)

where \( {\mathbf{S}}_i^{\prime}\left[j,k\right]=1/\left(1+{e}^{-{\mathbf{x}}_j^{\mathbf{T}}{\mathbf{y}}_k}\right) \).

The similarity difference function is defined based on KL Divergence [2, 3]:

$$ L=\sum \limits_{{\mathbf{S}}_j\in {\mathcal{S}}_i}\sum \limits_{k=1}^{\left|V\right|}\sum \limits_{l=1}^{\left|V\right|}{\mathbf{S}}_j\left[k,l\right]\log\ \frac{{\mathbf{S}}_j\left[k,l\right]}{{\mathbf{S}}_i^{\prime}\left[k,l\right]}. $$
(3)

Since \( \log\ \frac{{\mathbf{S}}_j\left[k,l\right]}{{\mathbf{S}}_i\left[k,l\right]} \) may be less than 0, \( \log \left(1+\frac{{\mathbf{S}}_j\left[k,l\right]}{{\mathbf{S}}_i\left[k,l\right]}\right) \) is used to substitute \( \log\ \frac{{\mathbf{S}}_j\left[k,l\right]}{{\mathbf{S}}_i\left[k,l\right]} \) in (3). Let \( {\mathbf{X}}_i^{\ast } \) and \( {\mathbf{Y}}_i^{\ast } \) be the optimal representation of Xi and Yi. \( {\mathbf{X}}_i^{\ast } \) and \( {\mathbf{Y}}_i^{\ast } \) can be calculated as:

$$ \left({X}_i^{\ast },{Y}_i^{\ast}\right)=\underset{X_i^{\ast },{Y}_i^{\ast }}{\min}\sum \limits_{S_j\in {\mathcal{S}}_i}\sum \limits_{k=1}^{\left|V\right|}\sum \limits_{l=1}^{\left|V\right|}{S}_j\left[k,l\right]\log\ \left(1+\frac{S_j\left[k,l\right]}{S_i^{\prime}\left[k,l\right]}\right). $$
(4)

For a node vi, its latent vector representation with single-view embedding is denoted as fi(V), in which fi(.) is the embedding mapping function by preserving the ith node similarity set. fi(V) is the combination of \( {X}_i^{\ast } \) and \( {Y}_i^{\ast } \):

$$ {f}_i(V)=\frac{X_i^{\ast }+{Y}_i^{\ast }}{2}, $$
(5)

where \( {X}_i^{\ast } \) and \( {Y}_i^{\ast } \) are calculated by (4).

3.2 Common representation with latent space alignment

By preserving network structural information of n node similarity sets with single-view network embedding, the candidates of multi-view network embedding are achieved: \( \mathcal{M}=\Big\{{f}_1(V) \),f2(V), ⋯, fn(V)}, where fi(V), i = 1, 2, ..., n, is the latent space by preserving the node similarity information of the ith node similarity set with single-view network embedding. To merge these n latent spaces, their common representation is firstly extracted.

Let \( \mathfrak{V} \) be an underlying space which is observable by the latent spaces of \( \mathcal{M} \). A mapping function \( {H}_i:\mathfrak{V}\to {\mathbb{R}}^{d^{\hbox{'}}} \) (i = 1, 2, .., n) is used to represent the projection from \( \mathfrak{V} \) to the latent spaces. For a node vi ∈ V, let \( {\mathfrak{v}}_i\in \mathfrak{V} \) be its representation in \( \mathfrak{V} \), and f1(vi), f2(vi), ..., fn(vi) are its latent vector representations generated by single-view network embedding,\( {H}_j\left({\mathfrak{v}}_i\right)={f}_j\left({v}_i\right) \), j = 1, 2, .., n [27]. Then we can get:

$$ {\mathfrak{v}}_i={H}_1^{-1}\left({f}_1\left({v}_i\right)\right)=...={H}_n^{-1}\left({f}_n\left({v}_i\right)\right). $$
(6)

\( {H}_j^{-1} \) (j = 1, 2, .., n) should satisfy (6) to achieve the optimal solution. However, since 6 cannot be solved directly, motivated by the Canonical Correlation Analysis (CCA) [1, 11, 20], a series of mapping functions \( {H}_j^{-1} \) (j = 1, 2, .., n), which maximize the correlation of two vector spaces’ projections, are used as the approximate solution of (6). Therefore, (6) is relaxed to the following optimal problem:

$$ {\displaystyle \begin{array}{ll}& \left({H}_1^{-1\ast },{H}_2^{-1\ast },...,{H}_n^{-1\ast}\right)=\\ {}& \underset{H_1^{-1},{H}_2^{-1},...,{H}_n^{-1}}{\max}\sum \limits_{i=1}^n\sum \limits_{j=i+1}^n\rho \left({H}_i^{-1}\left({f}_i(V)\right),{H}_j^{-1}\left({f}_j(V)\right)\right)\end{array}} $$
(7)

where \( \rho \left({H}_i^{-1}\left({f}_i(V)\right),{H}_j^{-1}\left({f}_j(V)\right)\right) \) is the correlation of two latent vector spaces’ projections \( {H}_i^{-1}\left({f}_i(V)\right) \) and \( {H}_j^{-1}\left({f}_j(V)\right) \).

To simplify the calculation of (7), linear transformation \( {\mathbf{W}}_i\in {\mathbb{R}}^{d\times {d}^{\hbox{'}}} \) is used to approximate \( {H}_i^{-1} \) [1], and (7) is rewritten as:

$$ {\displaystyle \begin{array}{l}\left({\mathbf{W}}_1^{\ast },{\mathbf{W}}_2^{\ast },...,{\mathbf{W}}_n^{\ast}\right)=\\ {}\underset{W_1,{W}_1,...,{W}_n}{\max}\sum \limits_{i=1}^n\sum \limits_{j=i+1}^n\rho \left({\mathbf{W}}_i^T\cdotp {f}_i(V),{\mathbf{W}}_j^T\cdotp {f}_j(V)\right)\end{array}} $$
(8)

Since W1, W1, ..., Wn are mutually constrained, it is difficult to solve the optimization problem in (8) [20]. We therefore design a CCA based common representation learning model to fuse all candidate latent spaces. Two latent spaces of the multi-view network embedding candidates are combined to generate a new latent space, and this new latent space is added to the multi-view network embedding candidates to substitute the original two latent spaces. This process is iteratively executed until there is only one latent space left in the multi-view network embedding candidate set. This latent space is the common representation of the latent space alignment. The details of CCA based common representation learning model is given in Algorithm 1. \( {\mathcal{M}}_0 \) is the latent space candidate set that need to be transformed, and \( {\mathcal{M}}_1 \) denotes the transformed latent space set. In each iteration, two latent spaces V1 and V2 of \( {\mathcal{M}}_0 \) are fused to generate a new latent space V*, which is the average of the projection of V1 and the projection of V2. V* is added to \( {\mathcal{M}}_1 \) for thenext iteration of latent space fusing, and this process terminates until the size of \( {\mathcal{M}}_0 \) is one. The final latent space of \( {\mathcal{M}}_0 \) is the common representation of \( \mathcal{M}=\left\{{f}_1(V),{f}_2(V),...,{f}_n(V)\right\} \), which is represented as Vcommon \( \in {\mathbb{R}}^{\left|V\right|\times {d}^{\hbox{'}}} \).

3.3 View-specific representation learning

For multiple views generated by single-view network embedding, view-specific representation are extracted for view fusion. This is achieved by representing nodes’ characteristics as Gaussian distributions [2]: each node’s characteristic is described as a full distribution rather than a single vector, by measuring the Gaussian distributions of nodes when preserving node similarities, view-specific representation is learned.

For a network G = (V, E), its generated latent spaces by single-view network embedding is \( \mathcal{M}=\left\{{f}_1(V),{f}_2(V),...,{f}_n(V)\right\} \). To simplify the illustration, we denote the concatenation of \( \mathcal{M} \) as M = [f1(V), f2(V), ..., fn(V)] ∈ |V| × (nd). In this part, we aim to find a lower-dimensional Gaussian distribution embedding \( {v}_{specific}^i=\mathcal{N}\left({\mu}_i,{\Sigma}_i\right) \), \( {\mu}_i\in {\mathbb{R}}^{d^{\hbox{'}}} \), \( {\Sigma}_i in{\mathbb{R}}^{d^{\hbox{'}}\times {d}^{\hbox{'}}} \) for every node vi [2].

For a node vi ∈ V, we define k-hop neighborhood [25] which is the set of nodes who are k hops away from node vi: Nik = {vj ∈ V| vi ≠ vj, sp(vi, vj) = k}, where sp(vi, vj) is the length of the shortest path from vi to node vj. Nodes in different neighborhood should satisfy [2]:

$$ d\left({\mathcal{N}}_i,{\mathcal{N}}_j\right)<d\left({\mathcal{N}}_i,{\mathcal{N}}_{j^{\hbox{'}}}\right),\forall {v}_i\in V,{v}_j\in {N}_{ik},{v}_{j^{\hbox{'}}}\in {N}_{i{k}^{\hbox{'}}},k<{k}^{\hbox{'}}, $$
(9)

where \( {\mathcal{N}}_i \), \( {\mathcal{N}}_j \) and \( {\mathcal{N}}_{j^{\hbox{'}}} \) are the corresponding Gaussian distribution of node vi, vj and \( {v}_{j^{\hbox{'}}} \) respectively, and Nik, \( {N}_{i{k}^{\hbox{'}}} \) are the k-hop and k'-hop neighborhoods of vi.

figure a

To solve (9), asymmetric KL divergence is used to measure the difference between two normal distributions [2, 3]. Given the latent Gaussian distribution representation of two nodes vi and vj we define:

$$ {\displaystyle \begin{array}{ll}& d\left({\mathcal{N}}_i,{\mathcal{N}}_j\right)={D}_{KL}\left({\mathcal{N}}_j\Big\Vert {\mathcal{N}}_i\right)=\\ {}& \frac{1}{2}\left[ tr\left({\Sigma}_i^{-1}{\Sigma}_j\right)+\left({\mu}_i-{\mu}_j\left){}^T{\Sigma}_i^{-1}\left({\mu}_i-{\mu}_j\right)-{d}^{\hbox{'}}-\log\ \right(\frac{\mathit{\det}\left({\varSigma}_j\right)}{\mathit{\det}\left({\varSigma}_i\right)}\right)\right]\end{array}}, $$
(10)

where μi and Σi are the parameters of normal distribution of \( {\mathcal{N}}_i \), μj and Σj are the parameters of normal distribution of \( {\mathcal{N}}_j \), d' is the dimension of Gaussian distribution \( {\mathcal{N}}_i \) and \( {\mathcal{N}}_j \), tr(.) denotes the trace of a matrix, and det(.) denotes the determinant value of a matrix.

A neural network with two hidden layer is then used to calculate the parameters of the Gaussian distribution by learning the Gaussian distribution from the concatenation M = [f1(V), f2(V), ⋯, fn(V)] ∈ |V| × (nd) [2]. The first layer of the neural network is defined as:

$$ {h}_1\left({\mathbf{m}}_i\right)= Relu\left({\Theta}_1\cdotp {\mathbf{m}}_i+{\mathbf{b}}_1\right), $$
(11)

where mi is the representation of node vi in M, Θ1 and b1 are the weight and bias of the first layer, and Relu(.) is the activation function [6]. The diagonal covariance matrices are used to represent the covariance of the Gaussian distribution [2]. the parameters of node vi’s Gaussian distribution are defined as:

$$ {\mu}_i= Relu\left({\Theta}_{\mu}\cdotp {h}_1\left({\mathbf{m}}_i\right)+{\mathbf{b}}_{\mu}\right), $$
(12)
$$ {\Sigma}_i= Relu\left({\Theta}_{\Sigma}\cdotp {h}_1\left({\mathbf{m}}_i\right)+{\mathbf{b}}_{\Sigma}\right), $$
(13)

where Θμ and bμ are the weight and bias of the output layer calculating μ, and ΘΣ and bΣ are the weight and bias of the output layer calculating Σ.

To satisfy the constraints of (9) with (10), (11), (12) and (13), an objective function is defined to penalize the ranking errors according to the energy of pairs. KL divergence between a pair of nodes vi and vj is defined as their energy: \( {\mathcal{E}}_{ij}={D}_{KL}\left({\mathcal{N}}_j\Big\Vert {\mathcal{N}}_i\right) \) [2]. The loss function is therefore defined as:

$$ {\displaystyle \begin{array}{ll}L& =\sum \limits_{v_i\in V}\sum \limits_{k<{k}^{\hbox{'}}}\sum \limits_{v_j\in {N}_{ik},{v}_{j^{\hbox{'}}}\in {N}_{i{k}^{\hbox{'}}}}\left({\mathcal{E}}_{ij}^2+{\exp}^{-{\mathcal{E}}_{i{j}^{\hbox{'}}}}\right)\\ {}& =\sum \limits_{\left({v}_i,{v}_j,{v}_{j^{\hbox{'}}}\right)\in \mathcal{T}}\left({\mathcal{E}}_{ij}^2+{\exp}^{-{\mathcal{E}}_{i{j}^{\hbox{'}}}}\right),\end{array}} $$
(14)

where \( \mathcal{T}=\left\{\left({v}_i,{v}_j,{v}_{j^{\hbox{'}}}\right)| sp\left({v}_i,{v}_j\right)< sp\left({v}_i,{v}_{j^{\hbox{'}}}\right)\right\} \) is the set of all valid triplets. \( {\mathcal{E}}_{ij} \) is the set of positive examples whose energy is lower than the set of the negative examples \( {\mathcal{E}}_{i{j}^{\hbox{'}}} \). For agiven target node vi, the energy \( {\mathcal{E}}_{ij} \) should be the lowest for nodes vj in its 1-hop neighborhood, followed by a higher energy for nodes in its 2-hop neighborhood and so on [2].

The optimal solution of (14) is the view-specific representation for the candidate latent space of \( \mathcal{M}=\Big\{{f}_1(V),{f}_2(V),..., \)fn(V)}.

View-specific representation is extracted via Gaussian distribution \( {v}_{specific}^i=\mathcal{N}\left({\mu}_i,{\Sigma}_i\right) \). \( {\mu}_i\in {\mathbb{R}}^{d^{\hbox{'}}} \) is a vector, while \( {\Sigma}_i\in {\mathbb{R}}^{d^{\hbox{'}}\times {d}^{\hbox{'}}} \) is a diagonal matrix. μi and the diagonal element of Σi are concatenated to a 2d' dimension vector. \( {\mathbf{v}}_{specific}^i=\left[{\mu}_i,{\Sigma}_i^{\prime}\right] \) is used to denote the view-specific representation node vi, where \( {\Sigma}_i^{\prime } \) is the the diagonal element of Σi. Vspecific is used to denote the view-specific embedding space of the candidate latent spaces.

3.4 Joint representation with common and view-specific representations

With the latent spaces generated by single-view network embedding, common representation are extracted with the latent space alignment, and view-specific representations are extracted to reveal the characteristics of the latent spaces. The latent vector representation of multi-view network embedding is the joint representation of the common representation and view-specific representations. The output of multi-view network embedding is V = [Vcommon, Vspecific].

4 Model optimization

4.1 Node Similarity Ensemble Strategy

Single-view network embedding generates a single view of latent network representation by preserving a set of node similarities. Suppose there are m node similarities, i.e., \( \mathcal{S}=\left\{{\mathbf{S}}_1,{\mathbf{S}}_2,...,{\mathbf{S}}_m\right\} \), the number of possible node similarity sets is 2m − 1. If latent spaces are generated for all possible node similarity sets by single-view network embedding, there are 2m − 1 views need to be fused for the final latent representation of the multi-view network embedding. If m is large, the computational complexity of merging these views is high. In addition, it is not necessary to generate views for all possible node similarity sets. Since some subsets belong to other subsets of S, there is great information redundancy in the 2m − 1 non-empty subsets of S.

To reduce the computational complexity and the information redundancy, the most valuable node similarity sets should be selected for single-view network embedding. The latent spaces preserving the most valuable node similarity sets maximize the preserved network structural information while minimize the information redundancy. These views are then merged as the latent representation of the proposed multi-view network embedding model.

Node similarity ensemble strategy is defined as follows: for node similarity set \( \mathcal{S}=\left\{{\mathbf{S}}_1,{\mathbf{S}}_2,...,{\mathbf{S}}_m\right\} \), its optimal node similarity ensemble set \( {\mathcal{S}}_{candidate} \) should satisfy the following constraints:

  1. (1)

    \( \forall {\mathcal{S}}_i\in {\mathcal{S}}_{candidate} \), then \( {\mathcal{S}}_i\subseteq \mathcal{S} \) and \( {\mathcal{S}}_i\ne \Phi \);

  2. (2)

    \( \forall {\mathcal{S}}_i,{\mathcal{S}}_j\in {\mathcal{S}}_{candidate} \), then \( {\mathcal{S}}_i\notin {\mathcal{S}}_j \) and \( {\mathcal{S}}_j\notin {\mathcal{S}}_i \);

  3. (3)

    \( {\mathcal{S}}_{candidate} \) should be as large as possible.

Node similarity ensemble strategy is a k-combination problem of \( \mathcal{S} \). If \( k=\left\lfloor \frac{m}{2}\right\rfloor \), where ⌊.⌋ is the rounding down function, \( {\mathcal{S}}_{candidate} \) satisfies the above constraints. The optimal node similarity ensemble set is therefore represented as: \( {\mathcal{S}}_{candidate}=\left\{{\mathcal{S}}_i\left|{\mathcal{S}}_i\subseteq \mathcal{S},\right|{\mathcal{S}}_i|=\left\lfloor \frac{m}{2}\right\rfloor \right\} \)

4.2 Triplet Sampling Strategy

For a large scale network G, it is hard to compute the complete loss in (14) since the amount of triplets in \( \mathcal{T} \) is extremely large. A naive approach to address this problem is to sample triplets from \( \mathcal{T} \) uniformly [2], i.e. replace \( {\sum}_{\left({v}_i,{v}_j,{v}_{j^{\hbox{'}}}\right)\in \mathcal{T}} \) with \( {\mathbbm{E}}_{\left({v}_i,{v}_j,{v}_{j^{\hbox{'}}}\right)\sim \mathcal{T}} \) in (14). However, using this naive sampling approach, low-degree nodes are less likely to be sampled in the triplets compared with high-degree nodes. The representations of low-degree nodes are therefore hard to be updated. Since the degree distribution of complex networks follows the power-law [24], majority of nodes have low degree. The naive approach is not effective for most nodes of the network.

Triplet sampling strategy used in this work is defined as follows [2]: for a node vi, one node is randomly sampled from each of vi’s neighborhoods (Ni1, Ni2, etc.), and triplets are then generated from these randomly sampled nodes. The constraints of vi is \( {\mathcal{E}}_{i1}<{\mathcal{E}}_{i2} \), \( {\mathcal{E}}_{i1}<{\mathcal{E}}_{i3} \), ⋯, \( {\mathcal{E}}_{i1}<{\mathcal{E}}_{iK} \), \( {\mathcal{E}}_{i2}<{\mathcal{E}}_{i3} \), \( {\mathcal{E}}_{i3}<{\mathcal{E}}_{i4} \), ⋯, \( {\mathcal{E}}_{i2}<{\mathcal{E}}_{iK} \), ⋯, \( {\mathcal{E}}_{iK-1}<{\mathcal{E}}_{iK} \), where K is the radius of the network.

Using triplet sampling strategy, the parameters of view-specific representation learning can be optimized. Parameters (Θ1, b1μ, b, ΘΣ, b) of neural network model are optimized such that the loss L (in (14)) is minimized and the sampled triplets set \( \mathcal{T} \) satisfies the constrains. The parameters are optimized using Adam [13] with a fixed learning rate 0.001.

5 Experimental results

5.1 Experiment Setup

Datasets

Experiments are held on 3 real-world datasetsFootnote 1 to evaluate the performances of the proposed model. These datasets are Citeseer, Cora and Pubmed-Diabetes. The detail information of these datasets is given in Table 1: nodes of Citeseer and Cora have 6 labels, and nodes of Pubmed-Diabetes have 3 labels. Compared with Cora (average degree 4.01) and Pubmed-Diabetes (average degree 4.50), Citeseer (average degree 2.86) is more sparse. These three datasets are pre-processed to be connected networks. Since the scale of Pubmed-Diabetes (19,717 nodes) is much larger than the other two datasets, we sample it into a small network with around 2, 000 nodes and the average degree is around 4.5.

Table 1 Statistics of the real-world networks

Node similarity measurements

Eleven node similarity measurements are used in this work to represent the network structural information. The detailed information of the node similarities is given in Table 2. The common neighbors method(CN) [25] is a typical local similarity-based approach which measures two nodes’ similarity by the number of their one-hop common neighbors. The Adamic-Adar index(AA) [25], the resource allocation index(PA) [15] and resource allocation based on common neighbor interactions (RA-CNI) [15] improve CN by differentiating the influences the node’s on-hop neighbors. The Jaccard index method [25] (JC) improves CN by measuring the ratio of two nodes’ one-hop common neighbors over their complete one-hop neighbor set. The Salton index method(SA) [15], also known as the cosine similarity, and the Sorensen index method(SO) [15] improve JC by using different measurements of each node’s one-hop complete neighbor set. In practical, the node similarity measured by SA is approximately twice the value of JC. SO is less sensitive to outliers compared with JC. The hub promoted index method(HPI) [15] measures the node similarity by the ratio of two nodes’ one-hop common neighbors over the minimum of one node’s one-hop neighbors, while the hub depressed index method(HDI) [15] measures the node similarity by the ratio of two nodes’ one-hop common neighbors over the maximum of one node’s one-hop neighbors. HPI discourages link formation between hub nodes and encourage link formation between low-degree nodes and hub nodes. This is opposite to HDI. The local Leicht-Holme-Newman index(LLHN) [15] is defined as the ratio of actual paths of length two between two nodes and a value proportional to the expected number of paths of length two between them. The preferential attachment index(PA) [15] is based on the phenomena that many real network node degrees follow a power law distribution. So if the degree of two nodes is large, they are more likely to form a link.

Table 2 Node similarity measurements used in this work

Baseline methods

Four of the most popular network embedding models are used as the baselines of this work:

  • DeepWalk: DeepWalk [17, 18] uses random walk to get the node sequence and then Skip-Gram algorithm is used to learn the node representation vectors.

  • Node2Vec: Node2Vec [8, 18] uses improved random walk step to get the node sequence and then uses the skip-Gram algorithm to learn the node representation vectors.

  • LINE: LINE [18, 23] is a popular network embedding methods which considers the first and second order proximities information.

  • GraRep: GraRep [4, 18] carries out matrix factorization based on different sized random walk.

Parameter settings

The dimension of the latent representation is set to be 16 for all datasets, i.e., d = 16. The parameters of the baseline methods use default settings.Footnote 2 For the proposed model, the dimension of each single view is 16, the dimension of common representation is \( \frac{16}{2} \), and the dimension of the Gaussian distribution in view-specific representation is \( \frac{16}{4} \). In view-specific representation, we use a two hidden layer {3000, 600} neural network.

5.2 Performance Evaluation

The network embedding performances are verified on node classification task, in which Support Vector Machine [14] is used to classify nodes. Macro-F1 score and Micro-F1 score are used to evaluate the node classification performances. For each dataset, different ratio of the labelled nodes are randomly selected as the training data, and the rest of nodes are used as the test data. The results are averaged over 50 different runs by sampling different training data.

Let MvNR represent the proposed multi-view network embedding model, node classification performances with MvNR is given in Figure 1. It is shown that: (1) Node classification performances with MvNR are superior to node classification performances with baseline methods. Both Micro-F1 score and Macro-F1 score by using MvNR are higher than using baseline methods on all three experimental datasets. (2) If the ratio of training set is low, node classification performances with MvNR are significantly higher than node classification performances with baseline methods. For example, when the training ratio is 0.1, compared with the best Micro-F1 score and Macro-F1 score achieved by baseline methods, Micro-F1 score and Macro-F1 score by using MvNR are about 15% and 25% higher on Citeseer dataset, about 30% and 40% higher on Cora dataset, and about 30% and 40% higher on Pubmed-Diabetes dataset. (3) Node classification performances are less sensitive to the ratio of training set by using MvNR. Baseline methods, especially random walk based methods Deepwalk and Node2Vec, are sensitive to the ratio of training set: node classification performances decrease significantly with the decrease of training sample ratio. Using MvNR, there is only slightly change on node classification performances with the decrease of training sample ratio.

Figure 1
figure 1

Node classification performances with the proposed multi-view network embedding model

Deepwalk and Node2Vec preserve global node similarities between the active node and the nodes selected with co-occurrence probability when randomly walked in the network. LINE and GraRep preserve local node similarities, in which LINE measures the node similarity between the active node and the nodes with the first and the second order proximity, while GraRep measures the node similarity between the active node and the nodes with high order proximity. If the ratio of the training sample is low, it is hard describe the overall network structure by node similarities used in the baseline methods. This leads to their limited node classification performances. Since MvNR preserves network information with node similarity ensemble, the involved nodes similarities reflect the network structural information from different aspects of view. It can describe the overall network structure with limited number of training samples. This contributes to its superior node classification performances with low ratio of training set.

Figure 2 illustrates the comparison of node classification performances with MvNR and the average node classification performances preserving each node similarity listed in Table 2. It is shown that node classification performances with MvNR are significantly better than node classification performances preserving single node similarity. Micro-F1 score and Macro-F1 score by using MvNR are about 6% and 10% higher on Citeseer dataset, about 8% and 12% higher on Cora dataset, and about 4% and 8% higher on Pubmed-Diabetes dataset. Table 3 gives the detail information of node classification performances preserving each single node similarity respectively on Pubmed-Diabetes dataset. By preserving single node similarity, the highest Micro-F1 score and Macro-F1 score are 0.765 and 0.716 respectively, while the lowest Micro-F1 score and Macro-F1 score are 0.607 and 0.469 respectively. The selected node similarities contribute to reasonable node classification performances compared with baseline methods shown in Figure 1. By preserving node similarity ensemble with common latent space and view-specific latent spaces, MvNR contributes to better node classification performances compared with preserving any single node similarity, and better node classification performances compared the average performances of preserving all possible node similarities.

Figure 2
figure 2

Comparison of node classification performances with the proposed model and the average node classification performances preserving all possible single node similarity

Table 3 Node classification performances preserving single node similarity and node similarity ensemble on Pubmed-Diabetes dataset

To sum up, by preserving node similarity ensemble, the proposed multi-view network embedding model contributes to better node classification performances compared with network embedding models preserving single node similarity or simple combination of node similarities. It is only slightly influenced by the ratio of training set, and it contributes to valuable node classification performances with very low ratio of training set.

6 Conclusions

This work proposes a novel multi-view network embedding model with node similarity ensemble. Instead of preserving single node similarity in the network embedding, the proposed method generates multiple views of latent spaces, preserving various node similarity sets. To maximize the involved network information while minimize the information redundancy, common representation is extracted from multiple views to merge with multiple view-specific representations. The common representation of multiple views is extracted with a CCA based model, and view-specific representations are extracted with a neural network based model analyzing Gaussian distributions of nodes. Node classification performances with the proposed method are superior to node classification performances with existing network embedding models, especially when there is only very limited number of training samples.