Keywords

1 Introduction

With the development of Internet, information networks are ubiquitous in the real world (e.g. academic networks [21, 22], movie recommendation networks [18]). These information networks have attracted deep insight and one of the most popular analysis methods is the network embedding, i.e., network representation learning. Network embedding embeds the networks into the low-dimensional space to directly measure the neighborhood similarities between different nodes and has been proved to be highly effective when solving the traditional tasks such as node classification [2], clustering [3], link prediction [1].

Traditional network embedding approaches (e.g. DeepWalk [15], node2vec [6], LINE [24]) mainly focus on the representation learning for networks with a single type of relationship. Later, researchers further propose some convolutional neural network embedding methods (e.g. GCN [8], graphSAGE [7]) to encode both network structure and features of nodes. However, the majority of practical information networks are de facto multi-view networks, involving multiple types of relationships and node features. Figure 1 shows an example of multi-view network. In this academic network, relationships of authors may include co-author relationship which means whether two authors have been collaborated on a paper and citing relationships which means whether one author cited the papers written by the other one. Features of authors may include the research interests and titles of papers they write.

Fig. 1.
figure 1

An example of multi-view network which contains multiple types of relationships between nodes and different node features in each single view.

Recently, the multi-view network embedding problems has received more attention. Various methods proposed for learning representations with multiple views perform well on many applications [4]. Nevertheless, they suffer from specific limitations as well. Methods [5, 10, 20] based on matrix factorization are faced with the expensive computational cost, thus are not suitable for large-scale data. Clustering methods [12, 27, 28] neglect the different importance of different vies with a lack of consideration about weight learning. And more recent methods like [16, 19] based on deep learning miss the utilization of node features which promote the embedding performance besides the network structure.

In this paper, we propose Intra-view and Inter-view attention for Multi-view Network Embedding (I2MNE), a novel method to overcome the limitations mentioned above. The intra-view attention is introduced to specify the different importance of neighbors when aggregating neighbor features of each node for single view. Similarly, the inter-view attention is also introduced to assign the different significance of views when integrating representations across different views. The attention weights and the representations can be efficiently trained through the back propagation algorithm.

We conduct experiments on two real-world datasets of different domains. The effectiveness of our approach is evaluated on the classification task. The experimental results show that I2MNE outperforms other state-of-the-art methods. To summarize, we make the following contributions:

  • We propose I2MNE to study multi-view network embedding, which aims to learn node representations by leveraging structure and feature information from multiple views.

  • We introduce intra-view attention when aggregating the node features from neighbors and inter-view attention when integrating representations across different views to learn robust node representations.

  • We conduct experiments on two real-world multi-view networks. Experimental results demonstrate the effectiveness and efficiency of our proposed approach over many competitive methods.

The rest of this paper is organized as follows. In Sect. 2, we describe the problem definition. In Sect. 3, we present the I2MNE algorithm for multi-view network embedding in detail. In Sect. 4, we analyze the learned node representation and compare the proposed model with the existing network embedding approaches on two real world datasets. Conclusions are given at the end.

2 Problem Definition

In this section, we formally define the problem of network embedding in multi-view network. Firstly, the multi-view network is defined as follows:

Definition 1 (Multi-view Network)

A multi-view network is defined as \(G=(V,E,C)\), where V is the set of nodes representing objects; \(E=\cup _{k=1}^K E_k\), K is the number of views, \(E_k \subseteq V \times V\) is the set of edges representing relationships between two nodes in view k and \(C = \cup _{k=1}^K C_k\), \(C_k=\{{\varvec{f}}_i, \forall n_i \in V\}\) denotes the features of objects in view k, \({\varvec{f}}_i \in \mathbb {R}^F\) denotes features of node \(n_i\) and F is the dimension of features.

Our goal is to learn the node representations in the multi-view network. We define this problem as follows:

Definition 2 (Multi-view Network Embedding)

Given a multi-view network, denoted as \(G=(V,E,C)\), the aim of multi-view network embedding is to learn low-dimensional representations \(O \in \mathbb {R}^{|V|\times d}\), where \(d \ll |V|\) is the number of embedding dimensions.

Fig. 2.
figure 2

The Intra-view Attention. It specifies different importance to neighbors (\(n_2,n_3,n_4 \in \mathcal {N}_1\) in this example) of each node (\(n_1\) in this example).

3 Our Approach

In this section, we present our approach, I2MNE, which embeds the nodes with features in multi-view network into a common space. We first aggregate the features of each node’s neighbors for single view to encode the node proximities and node features. Then we integrate node representations across different views. Inspired by the recent progress of the attention mechanism [11], we introduce the intra-view attention (shown in Fig. 2) to automatically specify different weights to nodes within neighbors and inter-view attention (shown in Fig. 3) to assign different weights to views.

3.1 Embedding Generation

Intra-view Attention. In each single view, for each node \(n_i\) with its feature \({\varvec{f}}_i\) and its neighbors’ features \({\varvec{F}}_i \in \mathbb {R}^{|\mathcal {N}_i|\times F}\), where \(\mathcal {N}_i\) denotes the neighbors of \(n_i\), we first introduce a content-based score function as follow:

$$\begin{aligned} {{\varvec{s}}}_i = \mathrm {score}({\varvec{f}}_\mathrm{i}, {{\varvec{F}}}_\mathrm{i})={{\varvec{f}}}_\mathrm{i} \mathrm{W}_\mathrm{a} {{\varvec{F}}}_\mathrm{i}^\mathrm{T} \end{aligned}$$
(1)

where \(W_a \in \mathbb {R}^{F \times F}\) is the intra-view attention weight matrix. This function indicates the importance of neighbors’ features to node \(n_i\). To make the scores comparable across different nodes, we use a softmax function to normalize them as follow:

$$\begin{aligned}{}[{{\varvec{a}}}_i]_j=\mathrm {softmax}_\mathrm{j}({{\varvec{s}}}_\mathrm{i}) = \frac{\exp ([{{\varvec{s}}}_\mathrm{i}]_\mathrm{j})}{\sum _{\mathrm{j}'=1}^{|\mathcal {N}_\mathrm{i}|}{\exp ([{{\varvec{s}}}_\mathrm{i}]_{\mathrm{j}'})}} \end{aligned}$$
(2)

where \([\cdot ]_j\) means the j-th value of the vector and \(j \in \{1,\cdots ,|\mathcal {N}_i|\}\).

Then, we introduce the context-vector \({{\varvec{c}}}_i\) which captures the relevance of features between \(n_i\) and its neighbors \(\mathcal {N}_i\) using the normalized attention scores as follow:

$$\begin{aligned} {{\varvec{c}}}_i = {{\varvec{a}}}_i {{\varvec{F}}}_i \end{aligned}$$
(3)

Next, we use a weight matrix \(W \in \mathbb {R}^{F \times F'}\), where \(F'\) is the dimension of hidden layer, to get the aggregated vector as follow:

$$\begin{aligned} {{\varvec{f}}}'{}_i = \sigma (W[{{\varvec{f}}}_i\oplus {{\varvec{f}}}_{\mathcal {N}_i}\oplus {{\varvec{c}}}_i]) \end{aligned}$$
(4)

where \({{\varvec{f}}}_{\mathcal {N}_i}\) is the feature vector aggregated by the neighbors of \(n_i\), \(\sigma \) is the activation function and \(\oplus \) is the concatenation operation. And the aggregation strategy of \({{\varvec{f}}}_{\mathcal {N}_i}\) can be chosen as [7].

Finally, we apply the aggregated vector with a normalization to get the hidden representation \({{\varvec{z}}}_i\) and search P depth for neighbor sampling [7]. The details can be found in Algorithm 1 step 1–12.

Fig. 3.
figure 3

Inter-view Attention. It assigns different importance to hidden representations from different views.

Inter-view Attention. For each \({{\varvec{z}}}_i^k\) of view k, where \(k \in \{1,2,\cdots ,K\}\), we introduce inter-view attention as follow:

$$\begin{aligned} {{\varvec{s}}}'{}_i^{k}=\mathrm {score'}({{\varvec{z}}}_\mathrm{i}^\mathrm{k},\bar{{{\varvec{Z}}}_\mathrm{i}^\mathrm{k}})= {{\varvec{z}}}_\mathrm{i}^\mathrm{k} \mathrm{W}_{\mathrm{a}'}^\mathrm{k} {\bar{{{\varvec{Z}}}_\mathrm{i}^\mathrm{k}}}^\mathrm{T} \end{aligned}$$
(5)
$$\begin{aligned}{}[{{\varvec{a}}}'_{k}]_j = \mathrm {softmax}_\mathrm{j}({{\varvec{s}}}'{}_\mathrm{i}^\mathrm{k})=\frac{\exp ([{{\varvec{s}}}'{}_\mathrm{i}^\mathrm{k}]_\mathrm{j})}{\sum _{\mathrm{j}'=1}^{\mathrm{K}-1}{\exp ([{{\varvec{s}}}'{}_\mathrm{i}^\mathrm{k}]_{\mathrm{j}'})}} \end{aligned}$$
(6)

where \(\bar{Z_i^k}\) is the set of all the hidden representations for node \(n_i\) except view k, \(j \in \{1,\cdots ,K-1\}\) and \(W_{a'}^k \in \mathbb {R}^{F' \times F'}\) is the inter-view attention weight matrix.

Then, the view-context vector \({{\varvec{v}}}_i^k\) is defined to integrate the hidden representations from different views as follow:

$$\begin{aligned} {{\varvec{v}}}_i^k = {{\varvec{a}}}'_k \bar{{{\varvec{Z}}}_i^k} \end{aligned}$$
(7)

Finally, the final node representation is defined as follow:

$$\begin{aligned} {{\varvec{o}}}_i = \sigma (W'[Z_i \oplus V_i]) \end{aligned}$$
(8)

where \(Z_i=[{{\varvec{z}}}_i^1 \oplus {{\varvec{z}}}_i^2 \oplus \cdots \oplus {{\varvec{z}}}_i^K]\), \(V_i=[{{\varvec{v}}}_i^1 \oplus {{\varvec{v}}}_i^2 \oplus \cdots \oplus {{\varvec{v}}}_i^K]\) and \(W' \in \mathbb {R}^{2KF' \times d}\) is the weight matrix. The details can be found in Algorithm 1 step 13–14.

figure a

3.2 Parameters Learning of I2MNE

For each single view, we apply a graph-based loss function with negative sampling techniques [13, 14] to \({{\varvec{z}}}_i, \forall n_i \in V\) as follow:

$$\begin{aligned} L_s({{\varvec{z}}}_i)=-\log {(\sigma ({{\varvec{z}}}_i^T{{\varvec{z}}}_j))}-\sum _{n=1}^N{E_{v_n\sim P_n(n_j)}\log {(\sigma (-{{\varvec{z}}}_i^T{{\varvec{z}}}_{v_n}))}} \end{aligned}$$
(9)

where \(n_j\) is a node that co-occurs near \(n_i\) on fixed-length random walk, \(v_n\) is a negative sample, \(\sigma (x)=1/(1-\exp (-x))\) is the sigmoid function, N is the number of negative samples and \(P_n\) denotes the negative sampling distribution. This loss function encourages that the embeddings of connected nodes are similar to each other, while enforcing that the embeddings of disparate nodes are highly distinct.

For multi-views integration, we apply cross-entropy loss for the classification task as follow:

$$\begin{aligned} L_{m} = \sum _{n_i\in S}L({{\varvec{z}}}_i,y_i) \end{aligned}$$
(10)

where S is the set of labeled nodes, \(y_i\) is the label of node \(n_i\), and L is the cross-entropy loss function.

With the above definitions, the overall loss function is defined as follows:

$$\begin{aligned} L = \sum _{n_i\in V}L_s({{\varvec{z}}}_i) + L_m \end{aligned}$$
(11)

Our objective is to minimize the overall loss function and it can be efficiently optimized with the back propagation algorithm [17]. Following the suggestion of [16], in each iteration of the optimization, we first optimize graph-based loss on each single view, learn the hidden representations and tune the parameters of intra-view attention. Then update the parameters of inter-view attention with the labeled data by optimizing the cross-entropy loss.

4 Experiment

4.1 Datsets

The detailed statistics of the datasets are shown in Table 1.

Table 1. Statistics of datasets

AMiner. We use AMiner dataset [25]Footnote 1 to analyze the research fields of authors. There are two views in the multi-view author network including the co-author relationship view and the citing relationship view. The former is constructed by authors who publish the same paper and the latter is constructed by authors who cite others’ paper. The features of authors in the co-author view are the research interests. We use word2vec [13] pre-trained by English Wikipedia Skip-GramFootnote 2 to learn the textual features. And the titles of all the papers published by each author are treated as the node features in the citing view. we use Doc2vec [9] pre-trained by English Wikipedia DBOW\(^{4}\) to learn the textual features. We choose all the papers from the most popular venueFootnote 3 in eight research fields defined by [23] and select all the relative authors who publish these papers. There are 16604 authors with labels in the filtered dataset.

Flickr. We use Flickr dataset [26]Footnote 4 to analyze the community membership of users. The multi-view user network includes the friendship view and the tag-similarity view. For the latter one, we first calculate the user similarity by their tags using TF-IDF. Then we construct the tag-similarity relationship using 100-nearest neighbors. And the textual features are the tags of users learned by word2vec.

4.2 Compared Algorithms

In this section, our proposed approach is compared with the following methods for performance analysis:

Single-view Methods:

  • LINE [24]: A network embedding method without node features.

  • Deepwalk/node2vec [6, 15]: We find that the results of them have no significant differences, so we use \(p=1\) and \(q=1\) [6] in node2vec for comparison.

  • GraphSAGE [7]: A network embedding method with node features.

  • I2MNE-Intra: A variant of our proposed method only use the intra-view attention for single views.

Multi-view Methods

  • *-concat: We concatenate the embeddings learned from all the single views by the node2vec, GraphSAGE and I2MNE, respectively.

  • *-mean: We calculate the mean average of the embeddings learned from all the single views by the node2vec and GraphSAGE, respectively.

  • MVE [16]: A multi-view network embedding method with attention mechanism, which can also apply to our problem but cannot utilize the features of nodes and the attention is different from ours.

  • I2MNE-NoInter: A variant of our proposed method without the inter-view attention and calculate the mean average embeddings of all the single views.

  • I2MNE: our proposed method for multi-view network embedding with node features.

4.3 Parameter Settings

For all the methods except *-concat, the dimension is set as 128 by default. For the concatenation methods, the dimension is set as 128K, where K is the number of views. The dimension of features is set as 300. The number of negative samples is set as 5, and the learning rate is set as 0.001. For node2vec, we set the walk length as 40, the window size as 10. All the embedding vectors are finally normalized.

4.4 Results

We evaluate the network embeddings on the classification task. A logistic regression classifier fed by the embeddings of all labeled nodes is employed. We set \(75\%\) nodes as training data and the rest of nodes are used for testing. The classification experiments are repeated independently for 10 times and the averaged Macro-F1 and Micro-F1 measures are reported in Table 2. Note that, for single-view methods, the best results on single views are reported. From this table, we have the following observations:

  1. (1)

    For the single view, I2MNE-Intra achieves the best performance than the other single-view methods. In addition, I2MNE-concat also outperforms the other concatenation methods for the multi-view network. These indicate that the intra-view attention can capture the impact of neighbors’ features on the nodes and improve the performance.

  2. (2)

    On the Flickr dataset, I2MNE achieves significant improvement comparing to all the methods on both measurements. I2MNE-NoInter already outperforms all the baselines in terms of Macro-F1 due to the importance of intra-view attention. And I2MNE gains further improvement over I2MNE-NoInter with 0.0534 Macro-F1 score. It shows that the inter-view attention plays an important role in our methods.

  3. (3)

    Especially for the AMiner dataset, I2MNE achieves significant improvements than other methods with \(0.0768 \sim 0.1125\) gains in terms of Macro-F1. In addition, we observe that I2MNE achieves the best performance in terms of Macro-F1 while I2MNE-NoInter gets the best results in terms of Micro-F1. It is perhaps that the classes in the AMiner dataset are imbalance, so the Macro-F1 which treats all the classes equally is more reasonable than the Micro-F1 which equally treats all the instances. From the result, we also see that the inter-view attention can preserve more distinction between classes compared to the intra-view attention.

Table 2. Results of classification on both datasets.

5 Conclusion

In this paper, we propose two effective attentions to learn representations of multi-view networks associated with node features, named I2MNE. We introduce the intra-view attention to leverage the feature information from neighbors and the inter-view attention to make full use of the information from different views. Experiments on two real-world datasets demonstrate that our proposed model is effective and efficient for multi-view network embedding. In the future, we plan to apply our model for more tasks (e.g. link prediction). What’s more, we also plan to investigate the embedding of networks with edge features.