Keywords

1 Introduction

Network data are ubiquitous in the real world, ranging from social networks like Wechat and Facebook, marketing networks, airline transportation networks to academic citation networks, to name a few. Abundant useful knowledge is concealed in these networks which can benefit network analysis and applications in reality. For instance, in social networks, link prediction analysis could lower the cost and difficulty for users to seek friends online as well as offer a chance for service providers to improve their user experience. As the size of networks grows, opportunities come with challenges. On the one hand, it enriches the network treasure house and provides ample materials for network researchers. On the other hand, more complex relationships coupled in the networks are increasing the challenges dramatically in the analysis tasks.

Recently, as a novel dimensional reduction technique in analyzing large-scale networks, network embedding is proposed and has attracted a surge of research attention in many researches ranging from data mining, machine learning to mathematics. The main target of network embedding is to preserve as much information as possible from the network with a low dimension representation for each node. To achieve this goal, multiple approaches have been proposed, such as GraRep [2], DeepWalk [14], LINE [17] and SDNE [20]. More importantly, a lot of real-world applications have demonstrated their value in the downstream learning tasks, such as node classification, link prediction and data visualization.

Despite the improvement it gains, current works of network embedding mostly concentrate on preserving the structure of pure networks. In the real world, nodes in a network are usually accompanied with rich side information, such as attributes and labels. The attribute homophily theories [9, 10] show the strong connection between node attributes and topological structure. They depend on and influence each other in the network. For instance, articles in Wikipedia might not only cite or be cited by other related articles, but also contain a detailed explanation of the specific object, which helps in link prediction tasks to precisely provide editors with highly related articles. Moreover, labels such as group or community categories also provide useful information to assist in network learning. Even a limited number of labeled nodes can conduct a discriminative embedding. Taking Wechat as an example, users in the same group chat tend to share posts or links of related themes which is informative in precise advertisement targeting. Thus, the importance of side information is self-evident, whilst network embeddings ignoring the side information not only weaken the ability of expression but also blur the representations.

However, it is not easy for the pure network embedding methods like DeepWalk to incorporate additional information during its random walk in the original network since the heterogeneity and incompleteness complicate the situation. Thus, applying the pure network embedding methods directly is problematic. In contrast to the pure network embedding, side information network embedding targets at leveraging the discrepancy of the heterogeneous data sources and distilling the complementary information. What’s more, attributes and labels might be sparse, noisy and incomplete. Hence, it is nontrivial to study the problem of fusing labels and attributes into network structure and learning discriminative representations for network nodes. Some recent works have scratched the surface of this topic, yet various problems exist. They either lack careful and specific consideration of side information or are trapped in time-consuming learning models. Exhaustive discussions are given in Sect. 2.

In this paper, we investigate the side information network embedding deeply. Inspired by the groundbreaking work DeepWalk and the constructive follow up work Node2vec on the pure network, we propose an innovative random walk scheme to integrate multiple knowledge on side information network. We aim at answering the following questions: (1) How to incorporate topological structure, attribute information and node labels into a unified representation meanwhile tackling the incomplete, sparse and noisy problem accompanied; (2) How much does this random walk scheme contribute to downstream learning tasks like node classification.

Our main contribution is a flexible framework for learning latent representations for the attributed network with a limited number of labels, called Side Information Network Embedding (SINE). The key ideas of SINE are:

  • Measure the node relationships with others on attributes information and then evaluate the importance of attributes and geometric structure for each node individually. In contrast to treating information of each node unanimously, learning on a discriminative data makes the delicate embedding possible.

  • Establish label hubs and label hyperlinks for the labeled nodes to communicate with each other explicitly. And we design a label biased random walk scheme to integrate label information potentially.

  • Generate sampled contexts (neighborhoods) for nodes, which contain immediate geometric neighbors, similar nodes in the aspect of attributes and nodes explicitly or latently sharing the same label. Thus, in such all-side neighborhood built with nodes in a heterogeneous relationship, nodes can be modeled with more precise representation. The more frequently two nodes appear in the similar neighborhoods, the more likely they possess similar information.

The rest of this paper is organized as follows: First a brief overview of pure network and side information network embedding is provided, followed by the proposed SINE framework. Then sound experiments are presented. Finally, conclusion and future works are discussed.

2 Related Work

Network embedding can be traced back to the manifold learning, which aims to analyze the structure of manifold and map it into a low dimension Euclidean space to facilitate the machine learning algorithms. However, these methods, such as IsoMap [18], LLE [16], LE [1] and LPP [5], are trapped in the time-consuming eigen-decomposition and not applicable for large scale network embedding.

Recently, inspired by the Skip-Gram [11] learning word representation from its context, [14] propose DeepWalk that generates node neighborhoods with a truncated random walk to simulate the relationship between words and sentences, and bring prosperity to the embedding community. In Node2vec [4], a follow up work of DeepWalk, authors propose a biased random walk which can explore neighborhood under control of extra parameters. To preserve the structure similarity, Struc2vec [15] generates node contexts on the graph which is newly constructed based on structure similarity. On the other line of pure network embedding, a variety of methods [2, 15, 17, 20] are proposed. For example, to preserve first- and second-order proximity, LINE [17] proposes a joint probability and conditional probability model while SDNE [20] adopts an autoencoder model. However, losing sight of labels and attributes may set a limit on the performance of all these topological structure based methods.

Some recent efforts have explored the possibility of integrating side information of the node to learn a better representation. TADW [21] employs an inductive matrix factorization to integrate attributes. SNE [8] proposes a multi-layer perceptron to model the reconstruction error by concatenating attribute record as an input. While they don’t model the attribute affinity, which is essential for network analysis. TriDNR [13] learns three kinds of relation node-attribute, inter-node and attribute-label in a coupled deep model. Label information is not used for inter-node relationship modeling, which might weaken its representation power. LANE [7] learns a smooth representation from three individual representations of structure, attribute and label. AANE [6] accelerates the joint learning process of attribute and network structure. However, they equally treat the effect of attribute information on each node, which is too coarse in learning the representations. MMDW [19] only integrates label information by a semi-supervised model, which jointly optimizes the matrix factorization of adjacency matrix and the max-margin classifier of SVM. DANE [3] learns a consistency from the structure and attribute representation which captures the nonlinearity encoded by two autoencoders. However, it suffers from the high computational drawbacks. All in all, the existing methods come across various deficiency. To overcome the problems they meet, we propose a new model with strong pertinence.

3 Framework of SINE

In this section, the problem formulation is firstly given. Then we present the feature learning framework in our method. Next, we introduce attribute embedding module followed by the label embedding module.

3.1 Problem Formulation

We consider the problem of learning node representations in three aspects: structure, attributes and labels. Let \(G=\{V,E,W\}\) be a pure network, where V represents the nodes of the network, \(E\subseteq (V \times V)\) are the topological connections and W are edge weights (one for unweighted network). With side information, network is further denoted as \(G_S=\{V,E,W,X,Y\}\), with multiple attributes \(X=\{X^{(i)}\}_{i=1}^m\), \(X^{(i)}\in \mathbb {R}^{|V |\times s_i}\) where m is the number of attributes and \(s_i\) is the size of the \(i^{th}\) feature space, and \(Y\in \mathbb {R}^{|V |\times |\mathcal {Y} |}\) where \(\mathcal {Y}\) is the set of labels. We define a function \(\mathcal {L}:V\rightarrow \mathcal {Y}\), and \(\mathcal {L}(u) = i\) if node u is labeled with i. Formally, we aim to learn the low-dimensional representation \(H\in \mathbb {R}^{|V |\times d}\) which can incorporate information from three sources of data. As a result, H could achieve better performance in the downstream tasks such as node classification. We denote \(h_u\), a column of \(H^T\), as the representation of node u.

3.2 Feature Learning Framework

We extend the Skip-Gram architecture [11] to the side information network. Formally, in network \(G_S\) we maximize the log-probability of observing a network neighborhood \(N_S(u)\) for node u conditioned on its representation \(h_u\):

$$\begin{aligned} \max _H \sum _{u\in V} \log Pr(N_S(u)\vert h_u). \end{aligned}$$
(1)

With the assumption of conditional independence and symmetry effects from neighbors, Eq. 1 simplifies to:

$$\begin{aligned} \max _H \sum _{u\in V} \Big [\log \lambda _u + \sum _{v\in N_S(u)} h_v^T \cdot h_u\Big ], \end{aligned}$$
(2)

where \(\lambda _u=\sum _{v\in V}\exp (h_v^T \cdot h_u)\). We can see that nodes in a more similar neighborhood would have similar representations. And a semantically rich neighborhood can more precisely describe the intrinsic correlation on the node. In the following subsections, we will propose our method to integrate side information into the neighborhood. As for the problem of the expensive computational cost on \(\lambda _u\) in Eq. 2, negative sampling [12] is adopted. We optimize Eq. 2 using stochastic gradient ascent over the model parameters defining the features h.

3.3 Measure the Attribute Importance

In contrast to assuming attribute information on different nodes is independent like TADW and SNE, we measure the correlation between nodes with respect to attributes. In detail, a kernel method \(\mathcal {K}\) is taken to measure the attribute affinity between any pair of nodes: \(\mathcal {K}(u,v)=\phi (X_u)\cdot \phi (X_v)\). We construct the attribute network A that encodes the affinity between two nodes. The edge weight between two nodes u and v is then given by:

$$\begin{aligned} A_{uv} = \mathcal {K}(u,v), \forall u,v \in V. \end{aligned}$$
(3)

In \(G_S\), now we have a stack of information networks, 1 topological network G and m weighted networks \(\{A^{(i)}\}_{i=1}^m\) built from diverse attributes.

Since the attributes information and topological knowledge are concealed in different networks, we ought to learn the representation from each of the networks. A straightforward way is to build different neighborhoods \(N_S^{(i)}(u)\) (\(N_S^{(0)}(u)\) denote the neighborhood of node u on G) for each node u on each network of \(G\cup \{A^{(i)}\}_{i=1}^m\) and then concatenate their representation learned from respective Skip-Gram models as the final representation. Although from an information preserving view point concatenating different representations could maintain characteristics in diverse networks, it neither alleviates the effect of noise nor distills information hidden across representations from the perspective of integrating information. Furthermore, any incomplete attribute information, which is quite common in real-world datasets, can crash it down for the unobserved node in one of the networks. Another way is to combine neighborhoods \(\bigcup _i N_S^{(i)}(u)\) extracted from different networks and then learn the representation from a unique Skip-Gram model. Yet the combination that treats all the side information equally without discrimination for the individual node is careless and unacceptable in the analysis, not to mention the expensive computational cost of building multiple neighborhoods for a node. All in all, how to discriminatingly learn the side information and efficiently sample node neighborhood matters.

Supported by the analysis above, we first propose a measurement on the neighborhood of attribute affinity networks to evaluate the local property. Intuitively, the more similar neighbor is, the more attention should be paid to exploring this neighbor. Exploring the neighborhood shares the same principle. To this end, we define the local cohesion of node u on \(A^{(i)}\) as follow:

$$\begin{aligned} \rho _u^{(i)} = \frac{\bar{a}_u^{(i)}}{\bar{a}^{(i)}} = \frac{avg_t (A_{ut}^{(i)})}{avg_{s,t} (A_{st}^{(i)})}, \end{aligned}$$
(4)

where \(\bar{a}^{(i)}\) is the average edge weight of the \(i^{th}\) complete affinity network \(A^{(i)}\) and \(\bar{a}^{(i)}_u\) is the average weight of edges that associate with node u w.r.t. \(A^{(i)}\). Thus, the larger \(\rho _u^{(i)}\) is, the more informative immediate neighbors are provided. Then, by comparing the strength of node’s local cohesion in different networks, we can distinguish the importance of different attributes for a specific node. In other words, if neighbors are more similar with node u in a certain network than in others, this network should undertake more responsibility for exploring neighbors. For the importance of topology, we can calculate in the same way. We denote \(A^{(0)}=G\), \(A^{(0)}_{uv}=W_{uv}\) and \(\rho _u^{(0)}=1\) for unweighted network, which is also included in Eq. 4.

Then we propose a multi-network random walk strategy to generate node neighbors \(N_S(u)\). Walking across \(\{A^{(i)}\}_{i=0}^m\) generates a semantically rich node sequence that incorporates diverse node relationships (or similarities) from different networks. After that, we can construct a neighborhood with multi-relation neighbors for each node. In the proposed random walk scheme, we first decide “which network should be traversed” by \(\rho _u^{(i)}\), namely choose the more important data source for node u. The probability is proportional to the importance, in particular:

$$\begin{aligned} P(u, A^{(i)}) = \frac{\rho _u^{(i)}}{\sum _{i=0}^m \rho _u^{(i)}}. \end{aligned}$$
(5)

And then we carry out the weighted random walk in the chosen network (e.g. \(A^{(i)}\)) with the probability as follow:

$$\begin{aligned} P_A(u,v) = \frac{A_{uv}^{(i)}}{\sum _{x\in V} A_{ux}^{(i)}}. \end{aligned}$$
(6)

3.4 Fuse the Label Information

Modeling label information is entirely different from attributes, the other kind of side information. Labels are much more refined and scarce in networks. Owning to the sparsity, if we treat labels like attributes to construct an independent network, there would be two problems: Firstly, a great number of nodes without labels will be absent in this network; Secondly, the linkage connecting nodes sharing same the label will build information isolated island, which has no assistance to other nodes. In network analysis, it is always assumed that the node’s label is highly correlated to the topological structure and could be affected by its labeled neighbors according to their similarity. We propose two ways to explicitly and potentially fuse labels information in the topological neighborhood as shown in Fig. 1.

Fig. 1.
figure 1

Illustration of label incorporation way. The nodes colored in cyan are with the same label and the others’ label are unknown. The explicit way: The label hub \(c_i\) is linked with label hyperlink (e.g. \((v,c_i)\) and \((x_3,c_i)\)) presented in dashed line. Nodes sharing the same label can walk to each other via their common label hub. The potential way: The following example is given to show the influence of node sequence and restrict the influence within \(2^{nd}\) order, which compatible with Node2vec. The walk just transitioned from labeled node v to unlabeled node u and is now evaluating its next step out of node v. Edge notations indicate search bias \(\alpha \) for SINE and \(\beta \) for Node2vec. \(\alpha \) comprise of bias from topology and label.

To take advantage of label’s guidance in gathering node together, we first introduce the notions of label hyperlink and label hub that help to explicitly learn the label information in the random walk procedure. By building an imaginary label hub \(c_i,i\in \mathcal {Y}\) for each label on network G, nodes with same label can connect to each other through the label hyperlink to the corresponding label hub. In particular, the unnormalized probability of walking through label hyperlink is defined as follow:

$$\begin{aligned} \tau (u,c_i) = \left\{ \begin{array}{ll} \gamma , &{} \quad \text {if }\,\,\mathcal {L}(u) = i,\\ 0, &{} \quad \text {otherwise,} \end{array} \right. \end{aligned}$$
(7)

where \(\gamma \) is a hyperparameter. This explicit method will directly bridge the gap between labeled nodes which are not so close to each other in the topological network. It is also reasonable that nodes with the same label are much closer than those nodes with different labels. In this way, nodes in the neighborhood containing the same label hub are more likely to have similar representations than those who don’t.

However, the explicit label hub method restricts the influence of label within the labeled nodes, and can not spread the labeled nodes’ information to affect the unlabeled neighbors. Thus, we resort to the random walk sequence for a helping hand. Intuitively, a node sequence would be more likely to walk to the related labeled community where it came from. Since nodes in the same community are similar both in topology and label and the node sequence can be regarded as a sampling of the corresponding community, the alternative nodes that are either immediate neighbors of sequence or sharing the same label in the sequence would be more attractive. To measure the attraction of the alternative nodes, we present the biased random walk with two additive parts: topological and labeled parts. Consider a random walk that has a traversed node sequence \(T=\{u_i\}_{i=1}^n\) with length n and now resides at node \(u_n\) (Fig. 1). The walk now defines the unnormalized transition probability of its neighbor x as follow: \(\tau (u_n,x)=\alpha \cdot w_{u_n,x}\), where \(\alpha =\alpha _{topo}+\alpha _{label}\),

figure a

and d(Ux) denotes the shortest distance between node x and nodes in set U, I is the indicator function, m is the range of influence of T and \(\{a_i\}_{i=1}^m\) controls the label influence of different distance. To make it clear, \(\alpha _{topo}\) controls the sequence to revisit T with bias 1/p and walk around T with bias 1/q. While the \(\alpha _{label}\) controls the probability of traversing the neighbors with label that has been visited in T. We perform the label biased random walk with the probability as follow:

$$\begin{aligned} P_G(u,v) = \frac{\tau (u,v)}{\sum _{x\in V\cup \{c_i\}_{i=1}^{|\mathcal {Y} |}} \tau (u,x)}. \end{aligned}$$
(10)

It occurs to us that when we restrict the influence of T within the last two nodes (i.e. \(m=1\)) when computing \(\alpha _{topo}\), it is similar to Node2vecWalk defined in [4] with exchanging parameters 1 and 1/q. We denote the bias as \(\beta \) and explain in Fig. 1. The pseudocode for SINE Walk is given in Algorithm 1. The time complexity analysis of this algorithm is given in the experiment section.

To sum up, by generating the node sequences in the newly designed network with the proposed random walk scheme, we can incorporate topology, attributes and labels information into each node’s neighborhood \(N_S(u)\). Then we can learn the node representation \(h_u\) by solving Eq. 2 with stochastic gradient ascent method.

figure b

4 Experiments

In this section, we conduct experiments to evaluate the effectiveness of our proposed framework SINE. In particular, we want to answer the following questions.

  1. (1)

    What are the impact of attributes information on network embedding and how effective is the multi-network random walk strategy to incorporate attributes?

  2. (2)

    How effective is the guidance impact of the label in the label biased random walk scheme?

  3. (3)

    How effective are the node representations learned by SINE compared with other state-of-the-art methods in the downstream tasks?

Table 1. Statistics of the dataset

4.1 Datasets

In our experiments, we employ 5 real-world datasets: BlogCatalog (BC), Flicker, Cora, Citeseer and Wiki. All of them are publicly available, and specially the first two have been used in [7]. BlogCatalog and Flickr are social media networks. Each node is a user and links are the interaction between them. We take their descriptions as the attributes and the groups or categories they joined as labels. Cora, Citeseer and Wiki are citation networks. Each node is a publication and the links are citation relationships between them. The attribute of each node is the bag-of-words representation of the corresponding paper. Statistics of the datasets are summarized in Table 1. Note that all these datasets provide only one attribute feature.

4.2 Baseline Methods

We compare our method with 7 baseline methods. To evaluate the contribution of the side information, two pure network embedding methods, four attributed network embedding methods and a labeled attributed network embedding method are used for comparison. The first category contains DeepWalk [14] and Node2vec [4]. The second category includes AANE [6], TADW [21], SNE [8] and DANE [3]. The last one contains LANE [7].

4.3 Metric and Parameter Settings

We perform the multi-class node classification task to evaluate the quality of node representations learned by different methods. To be more specific, we randomly select some portion of the nodes as training set and the remaining as a test set. We train a one-vs-rest SVM classifier on the training set and evaluate it on the test set. For each training ratio, we repeat the trial for 10 times and report the average results. To measure the classification result, we employ Micro-F1 and Macro-F1 scores as metrics.

In SINE, we compute the attribute affinity network with \(\mathcal {K}(\cdot , \cdot )\) defined as cosine similarity of attributes. In experiments, we only preserve the top 20 similar neighbors for each node, randomly sample 20 neighbors when performing label biased random walk, and restrict the attraction of the sequence nodes within two step with \(a_1=r, a_2=s\), which is the trade-off between the computational cost and the accuracy. The default parameters of SINE are set as follows: window size \(k = 5\), walks per node \(t=20\), walk length \(l=20\), label biased random walk parameters \(p=4\), \(q=4\), \(r=s=4\), label hub weight \(\gamma =0.5\). The label ratio used for embedding is \(10\%\). For fairness of comparison, the dimension of embedding vectors d is set to 100 for all the methods. The parameters of DeepWalk and Node2vec are kept the same with SINE. The rest parameters for other algorithms are set following the suggestion in their original papers or source codes.

Table 2. Micro-F1 score of classification

4.4 Performance Evaluation

In this section, we will answer the questions proposed in the beginning of Sect. 5 one by one.

Effectiveness of Multi-network Random Walk Strategy. To answer the first question, we evaluate the proposed multi-network random walk strategy which performs random walk cross multiple networks (including topological network and attribute affinity networks) by conducting a series of experiments. We first perform random walk on the attribute affinity network (Attribute) and topological network (Structure) respectively and feed the node sequences to Skip-Gram model to get the embeddings for each network. Then we mix the node sequences generated on these two networks and use this mixed corpus to produce embeddings in the same way (Combine). Finally, we perform the proposed multi-network random walk strategy without labels. The classification results of these four methods on BlogCatalog dataset with different training ratios are shown in Table 3.

Table 3. F1-score of classification on BlogCatalog

The results in Table 3 illustrate the improvement of our multi-network random walk strategy. Specifically, compared to the first two methods which only utilize either attribute information or network structure, the Combine and SINE methods always achieve significantly better performance, showing that attribute information is valuable on network embedding. More importantly, our method outperforms other methods in all situations, which proves that our proposed multi-network random walk strategy is effective. In contrast to treating attribute and structure separately, we consider the correlation and interaction between them by a unified random walk sequence to effectively incorporate attribute information, leading to much better node representations.

Effect of Label Information. To answer the second question, we investigate the guidance effect of the label by varing the ratio of labeled nodes from 10% to 90% when performing labeled biased random walk. The training ratios of SVM classifier is fixed to 50%. The result is presented in Fig. 2.

Fig. 2.
figure 2

Classification results of different label ratios

From Fig. 2, we can see that with the increase of label ratio, both metrics are rising, which validates the guidance effect of label on embedding.

Effectiveness of SINE. To study the effectiveness of our SINE framework which is mentioned in the third question, we compare its performance with all baseline methods with varing the training ratio from 10% to 50%. The classification results of eight methods on five datasets are shown in Table 2. Due to the limitation of space, we only show the result of micro-F1 score and the result of macro-F1 score is similar. From Table 2, we find that our method achieves better performance in most situations with the following observations.

  • First, by incorporating the attribute information, most attributed network embedding methods achieve significant improvement compared to the pure network embedding methods.

  • Second, with the proposed framework, SINE outperforms other baseline methods in most situations. This is because SINE can effectively integrate the side information and get much more valuable node representations, resulting in better classification results.

  • Third, our method performs fairly well when the training ratio is quite small while other baseline methods degrade quickly as the training ratio decreases due to that their representations are noisy and inconsistent in training set and test set. Compared to other algorithms, SINE learns node representations from three data sources, including network structure, attributes information and node labels, which makes the representations more consistent and less noisy.

Fig. 3.
figure 3

Classification results of different dimensions

Fig. 4.
figure 4

Results of parameters analysis and scalability

4.5 Parameter Analysis

In this section, we investigate the effects of parameters, including embedding dimension d, label hub weight \(\gamma \), and labeled bias r and s. We fix the training ratio to 50% and test the classification F1 scores with different parameters. For dimension d, we vary it from 10 to 100 and conduct experiments on five datasets. Figure 3 shows the variations of classification results with different d. The result suggests that our method is stable when d within a reasonable range. As for label related parameters \(\gamma \), r, and s, we set the other two parameters to zeros when analyzing one of them. We vary each of them in different range on the Citeseer dataset and the result is presented in Fig. 4. We can find out that the influences of these three parameters are similar. As they increase, the performance becomes better due to the guidance effect of label. However, when they are larger than a threshold, random walk method will always walk to labeled node without walking to its neighbors, losing the information of topological neighborhoods, which reduces the quality of node representations.

4.6 Scalability

The time complexity of our random walk scheme is \(O(tle\cdot \vert V \vert )\) where e is the average number of edges. In practice, we set \(e=20\) as mentioned in parameter settings so it can be regarded as a constant. The time complexity of Skip-Gram is \(O(k(d + d \cdot \log \vert V \vert ))\) where the window size k and the embedding vector size d are constants so the total complexity is still \(O(\vert V \vert )\). To test for scalability, we learn node representations using SINE with default parameter values for Erdos-Renyi graphs with node sizes increasing from \(10^2\) to \(10^{5}\). We compute the average running time for 10 independent executions. The result of running time (in log scale) is shown in Fig. 4d. We observe that SINE scales linearly with the size of nodes, which is acceptable in practice. Thus, SINE can be applied to large-scale networks.

5 Conclusion

In this paper, we propose a novel network embedding framework SINE, which can learn high-quality node representations for networks with side information, including attributes and labels. We design a flexible random walk scheme to generate semantically rich neighborhoods for nodes, which contains the proximity in topological structure, node attributes and node labels. The extensive experiments on 5 real-world datasets validate its effectiveness and efficiency.