Keywords

1 Introduction

The recent success of neural graph embeddings such as LINE [18], DeepWalk [14] and node2vec [7] has opened a new path for analyzing networks. Despite these embeddings outperform spectral ones in tasks such as link prediction and multi-label node classification, Spectral Graph Theory [3] is still key tool for understanding and characterizing neural embeddings [16].

In this paper, we contribute with empirical evidence showing that neural embeddings (Sect. 2) can boost their performance in multi-label classification by embedding edges instead of nodes. We conjecture that this fact is due to the spectral properties of line graphs, whose nodes are the edges of the original graphs (Sect. 3). However, since general line graphs have not been fully characterized yet, we can only correlate our empirical findings (Sect. 4) with some of the well known properties of line graphs and the spectral characterization of neural embeddings.

2 Classic vs Neural Embeddings

2.1 Classic Embeddings

Let \(G=(V,E,\mathbf {A})\) be a graph/network with \(n=|V|\) nodes, \(m=|E|\) edges, where \(E\subseteq V\times V\), and adjacency matrix \(\mathbf {A}\). Then, node embedding consists of finding a mapping \(f:V\rightarrow \mathbb {R}^d\) (with \(d\ll n\)) so that the resulting d-dimensional vectors capture the structural properties of each vertex. As a result, we have \(||f(i)-f(j)||^2\rightarrow 0\) if nodes i and j are structurally similar within the graph G. Traditionally, nodal structural similarity was associated with the reachability of node j from node i (and vice versa) through random walks [10]. This characterization leaded to define both hitting times \(H_{ij}\) (expected steps taken by a random walk to reach j from i) and commute times \(CT_{ij}=H_{ij}+ H_{ji}\) (which also includes the expected steps needed to return to i from j). Since random walks are encoded by transition matrices of the form \(\mathbf {P}=\mathbf {D}^{-1}\mathbf {A}\), where \(\mathbf {D}=diag(d_1,\ldots ,d_n)\) is the diagonal matrix with the degrees of the nodes, the spectral analysis of \(\mathbf {P}\) is a natural way of understanding both hitting and commute times. More precisely, let \(\lambda _1=1\ge \lambda _2\ge \ldots \ge \lambda _{n}\ge -1\) be the spectrum of the transition matrix. It is well known that hitting times and commute times are highly conditioned by the spectral gap \(\lambda =1-\max \{\lambda _2,|\lambda _{n}|\}\). When several communities are encoded by a connected graph G, then \(H_{ij}\) and \(CT_{ij}\) are only meaningful when \(\lambda \rightarrow 0\) (small bottlenecks between communities); otherwise, these quantities rely on the local densities (degrees) of the nodes i and j, and one cannot discriminate whether two nodes belong to the same community or not [11]. Consequently, the applicability of node embeddings based on commute times to clustering is quite limited (see representative examples of image segmentation and tracking in [15]). In this regard, recent research is focused on simultaneously minimizing the spectral gap and shrinking (whenever possible) inter-community commute distances via graph densification [4, 5] before embedding the nodes.

Therefore, once G is processed (or rewired) commute times embedding leads to learn two matrices \(\mathbf {X},\mathbf {Y}\in \mathbb {R}^{n\times d}\), whose rows are denoted by \(\mathbf {x}_i\) and \(\mathbf {y}_i\) respectively and \(\mathbf {x}_i\) is the embedding of the node i. Following [15], the commute times embedding matrix \(\mathbf {X}\) results from factorizing

$$\begin{aligned} vol(G)\mathcal{G}=\mathbf {X}\mathbf {Y}^T, \end{aligned}$$
(1)

where \(vol(G)=\sum _{i=1}^nd_i\) is the volume of the graph and \(\mathcal{G}\) is its Green’s function, i.e. the pseudo-inverse of the normalized graph Laplacian \(\mathcal{L} = \mathbf {I}-\mathbf {D}^{-1/2}\mathbf {A}\mathbf {D}^{-1/2}\), whose spectrum is \(1-\lambda _1=0,1-\lambda _2,\ldots ,1-\lambda _n\le 2\), i.e. if \(\lambda _i\) is an eigenvalue of \(\mathbf {P}\) then \(1-\lambda _i\) is an eigenvalue of \(\mathcal{L}\).

2.2 Neural Embeddigs

Neural embeddings such as LINE [18], DeepWalk [14] and node2vec [7], exploit random walks in a different way. Namely, they simulate a fixed number N of random walks with fixed length L emanating from the nodes of G and then capture co-ocurrence statistics of pairs of nodes. The first node of the i-th path \(w_i\), assimilated to a word in a textual corpus (skip-gram model), is sampled from a prior distribution \(P(w_i)\). Then, the context of \(w_i\) is given by the nodes/words surrounding it in a T-sized window \(w_{i-T},\ldots ,w_{i-1},w_{i+1},\ldots ,w_{i+T}\), according to the transition matrix \(\mathbf {P}\). Then, the node-context pairs (wc) are given by \((w_{i-r},w_i)\) and \((w_i, w_{i+r})\) for \(r=1,\ldots ,T\). All these pairs are added to the multiset \(\mathcal{D}\) used for learning with negative sampling. Negative sampling implies not only to consider likely node-context pairs (wc) but also b unlikely ones \((w,c')\): the negative samples \(c'\), are nodes that can be drawn from the steady-state probability distribution of the random walk, i.e. \(P_N(i)=d_i/vol(G)\). This process is called Skip Gram Learning with Negative Sampling (SGNS) and leads to the following factorization [9]:

$$\begin{aligned} \mathbf {M} = \mathbf {X}\mathbf {Y}^T,\;\text {with}\; \mathbf {M}_{ij}=\log \left( \frac{\#(w_i,c_j)|\mathcal{D}|}{\#(w_i)\#(c_j)}\right) - \log b, \end{aligned}$$
(2)

where: \(\#(w_i,c_i)\) is the number of times the corresponding node-context pair is observed,\(\#(w_i)\) is the number of times the node i is observed and similarly for node \(\#(c_j)\); finally \(\log (.)\) is the element-wise logarithm and b is the number of negative samples.

2.3 LINE and DeepWalk vs node2vec Factorizations

These strategies differ in the way they sample (and thus vectorize) the graph for SGNS. LINE and DeepWalk rely on first-order random walks whereas node2vec is driven by second-order random walks.

LINE and DeepWalk. LINE’s factorization is a direct result from the cost function associated with SGNS. In particular, the latent representations of both the word/node \(\mathbf {x}_i\) and the context \(\mathbf {y}_j\) are assumed to be correlated with the existence on an edge between nodes i and j, i.e. \(\mathbf {A}_{ij}\log g(\mathbf {x}^T_i\mathbf {y}_j)\) is maximized, where g(.) is the sigmoid function. Following [16], this leads to

$$\begin{aligned} \mathbf {x}^T_i\mathbf {y}_j=\log \left( \frac{vol(G)\mathbf {A}_{ij}}{d_id_j}\right) - \log b\;\Rightarrow \log \left( vol(G)\mathbf {D}^{-1}\mathbf {A}\mathbf {D}^{-1}\right) - \log b = \mathbf {X}\mathbf {Y}^T. \end{aligned}$$
(3)

DeepWalk, on the other hand, leads to a more complex factorization. Assuming that the first node of each random walk is drawn from the steady state distribution, we have that, when \(L\rightarrow \infty \),

$$\begin{aligned} \frac{\#(w_i,c_j)|\mathcal{D}|}{\#(w_i)\#(c_j)}\overset{p}{\rightarrow } \frac{vol(G)}{2T}\left( \frac{1}{d_j}\sum _{r=1}^T\mathbf {P}^r_{ij}+ \frac{1}{d_i}\sum _{r=1}^T\mathbf {P}^r_{ji}\right) \end{aligned}$$
(4)

where \(\overset{p}{\rightarrow }\) denotes convergence in probability. This yields

$$\begin{aligned} \log \left( \frac{vol(G)}{T}\left( \sum _{r=1}^T\mathbf {P}^r\right) \mathbf {D}^{-1}\right) -\log b = \mathbf {X}\mathbf {Y}^T, \end{aligned}$$
(5)

which is equivalent to LINE for \(T=1\).

node2vec. The underlying idea of this embedding is to add more flexibility to the random walk. This is done by defining two parameters p and q, that control, respectively the likelihood of immediately revisit a node in the walk and making the walk very local. To that end, node2vec needs to evaluate the probability of the next nodes given the preceding one in the walk, i.e. we have a 2nd-order random walk. This walk is characterized by the hypermatrix \(\mathbf {\underline{P}}\), where \(\mathbf {\underline{P}}_{i(jk)}\) denotes the probability of reaching j from j given that the node preceeding j is k. Thus, the 2nd order random walk can be reduced to a 1st order one on the edges of the graph [1] as it is done in the implementation of node2vec. The stationary distribution \(\mathbf {X}_{ik}\) for this type of random walks satisfies \(\sum _{k}\mathbf {\underline{P}}_{i(jk)}\mathbf {X}_{ik}=\mathbf {X}_{ij}\). Qiu et al. [16] have found that

$$\begin{aligned} \frac{\#(w_i,c_j)|\mathcal{D}|}{\#(w_i)\#(c_j)}\overset{p}{\rightarrow } \frac{\frac{1}{2T}\sum _{r=1}^T\left( \sum _{k}\mathbf {X}_{ik}\mathbf {\underline{P}}^r_{j(ik)}+\sum _{k}\mathbf {X}_{jk}\mathbf {\underline{P}}^r_{i(jk)}\right) }{\left( \sum _{k}\mathbf {X}_{ik}\right) \left( \sum _{k}\mathbf {X}_{jk}\right) } \end{aligned}$$
(6)

and, despite the matricial expression for the factorization is more elusive, the final factorization differs significantly from those of DeepWalk and LINE.

3 Node vs Edges Embedding

3.1 The Line Graph

In this paper, we are mainly concerned with the impact of embedding the edges of G instead of its nodes. This means that a word \(w_i\) in the previous expressions is not yet associated with a node of G but with a node of its line graph \(L_G\). The nodes of \(L_G\) are the edges of G and there is an edge in the line graph if two edges in G share a node. More formally, given the \(n\times m\) incidence matrix \(\mathbf {B}\) where \(\mathbf {B}_{i\alpha }\) is 1 if the link \(\alpha \) is related to node i and 0 otherwise, we have that the \(m\times m\) adjacency matrix \(\mathbf {C}=\mathbf {B}^T\mathbf {B}-\mathbf {I}\) has elements \(C_{\alpha \beta }=\sum _{i=1}^n\mathbf {B}_{i\alpha }\mathbf {B}_{i\beta }(1-\delta _{\alpha \beta })\).

3.2 Spectral Analysis

Some interesting properties of line graphs vs G:

  • Boosted edge density. A single node i in G leads to a clique of \(d_i(d_i-1)/2\) edges in \(L_G\) (see Fig. 1). Despite this gives a high prominence to notable nodes of G it flexibilizes community detection [6]. In addition the steady state distribution of a random walk in \(L_G\) is \(P_N(\alpha _{(i,j)})=d_{\alpha }/vol(L_G)\) where \(d_{\alpha }=d_i+d_j-2\) and \(vol(L_G)=\sum _{\alpha ,\beta }C_{\alpha \beta }=\sum _{i=1}^n d_i(d_i-1)\).

  • Redundant spectrum for \(m>n\). Let \(\lambda _1(L_G)\ge \lambda _2(L_G)\ge \ldots \ge \lambda _m(L_G)\) be the spectrum of \(\mathbf {C}\). Then, for \(m>n\), \(\lambda _{n+1}=\ldots =\lambda _m=-2\). As a result, \(\lambda _i(\mathbf {L}(L_G))\ge 4\), for the largest \(m-n\) eigenvalues of \(\mathbf {L}(L_G)\), the unnormalized Laplacian matrix of \(L_G\) [19]. This increases significantly the medium-large eigenvalues of \(\mathcal{L}(L_G)\) with respect to \(\mathcal{L}(G)\) (see Fig. 2).

  • Majorization of the spectrum of G. This is really a conjecture derived from the bound \(\lambda _2(L_G)\le \frac{m}{2}-1\) in comparison to that for G: \(\lambda _2(G)\le \frac{n}{2}-1\). Empirical data shows that the lowest part of the spectrum of \(\mathcal{L}-\mathbf {I}\) in the line graph majorizes that of G (blue lines in Fig. 2). Since the spectrum driving DeepWalk is (approximately) of the form \(\frac{1}{T}\sum _{r=1}^T\lambda _i^r\) this leads (in general) to small spectral gaps for the line graphs, and thus slower mixing times of the random walks (more randomness). Green lines show the real spectra driving random walks in DeepWalk. In all cases, \(T=10\).

Fig. 1.
figure 1

Barbell graph linking two cliques (left) and its line graph (right)

Fig. 2.
figure 2

Eigenvalues of the original (top) and line graph (bottom), for Cora (left) and CiteSeer (right) databases (Color figure online)

4 Experiments and Discussion

4.1 Datasets (Networks)

  • CiteSeer for Document Classification [17]. Citation network containing 3312 scientific publications with 4676 links between them. Each publication is classified in one of 6 categories.

     

    Nodes

    Edges

    Line graph edges

    Gap

    Con. comps

    Labels

    Multi-label

    wiki

    2405

    12761

    355644

    0.000000

    45

    19

    No

    cora

    2708

    5278

    52301

    0.000000

    78

    7

    No

    citeseer

    3327

    4676

    27174

    0.000000

    438

    6

    No

    ppi

    3890

    38739

    3018220

    0.000000

    35

    50

    Yes

    pos

    4777

    92517

    49568882

    0.576132

    1

    40

    Yes

    facebook

    4039

    88234

    9314849

    0.000837

    1

    10

    Yes

  • Cora [17]. Citation network containing 2708 scientific publications with 5278 links between them. Each publication is classified in one of 7 categories.

  • WikiFootnote 1. Contains a network of 2405 web pages with 17981 links between them. Each page is classified in one of 19 categories.

  • Facebook social circles [13].

  • Wikipedia Part-of-Speech (POS) [12]. Co-ocurrence of words appearing in the first million of the bytes of the dumping of Wikipedia. The categories correspond to the labels of Part-of-Speech (POS) inferred by the Stanford POS-Tagger. Contains 4777 nodes, and 92517 undirected links. Each node may have several labels. We have 40 labels (categories).

  • Protein-Protein Interactions (PPI)Footnote 2 [2]. We use a subgraph of the PPIs associated with the Homo Sapiens. The network has 3890 nodes and 76584 links. Each node may have several labels corresponding to the 50 possible categories.

Facebook, PPI and POS have been retrieved from SNAPFootnote 3 [8]. CiteSeer and Cora have been retrieved from LINQSFootnote 4.

All the networks are considered as undirected graphs. Originally single-labelled networks are transformed into multi-label networks when their line-graph is computed (or sampled, for the sake of efficiency). Nodes with more than one label are the border nodes between two categories (inter-class), and the nodes that hold one label are intra-class nodes.

 

Inter-class nodes

Intra-class nodes

wiki

4526 (35%)

8235 (65%)

cora

1003 (19%)

4275 (81%)

citeseer

1190 (25%)

3486 (75%)

We have used the implementations of node2vec and DeepWalk included in the framework. OpenNEFootnote 5. The default values for p and q in node2vec are \(p=q=1\). After optimizing p and q in the range \(\{0.25,0.5,1,2,4\}\) the maximum improvement of node2vec wrt DeepWalk in the classification score is 0.014 (micro and macro). Regarding spectral embeddings, CTE and LLE have been only tested in networks with a single connected component (pos and facebook). In particular, commute times (CTE) have a poor performance in multi-label classification because their factorization relies on the Green’s function and this means that only the inverse of each eigenvalue is considered. However, DeepWalk is controlled by a polynomial associated with each eigenvalue.

 

node2vec

DeepWalk

cte

lle

Nodes

Edges

Nodes

Edges

Micro-F1

citeseer

0.591071

0.768626

0.595833

0.780930

cora

0.808715

0.903557

0.817578

0.920538

wiki

0.663342

0.840709

0.692436

0.859129

pos

0.447845

0.697572

0.471620

0.696640

0.375037

0.393165

ppi

0.197435

0.590731

0.205786

0.609937

facebook

0.911516

0.999900

0.911516

0.999900

0.239217

0.231214

 

node2vec

DeepWalk

cte

lle

Nodes

Edges

Nodes

Edges

Macro-F1

citeseer

0.544490

0.730930

0.545774

0.745995

cora

0.798782

0.898863

0.804928

0.917838

wiki

0.528603

0.764205

0.597948

0.787771

pos

0.084183

0.773035

0.094148

0.774001

0.041637

0.033617

ppi

0.168237

0.566912

0.178405

0.587934

facebook

0.821899

0.999377

0.822243

0.999407

0.108742

0.045155

Fig. 3.
figure 3

Evolution of the performance as a function of the fraction of known labels in the training set. Micro-F1 (top) and Macro-F1 (bottom)

Fig. 4.
figure 4

t-SNE embeddings. Original graph (top) and line-graph (bottom), for Facebook (left), Cora (center) and CiteSeer (right) databases.

In Fig. 3, we show the performance in multi-label classification (according to the percentage of nodes with known labels). The line-graph versions of node2vec and DeepWalk clearly outperform their nodal counterparts. The similarity in terms of performance of node2vec and DeepWalk is due to the fact that the 2nd order random walk of node2vec is not applied at the level of edges (it is unfeasible for large networks). Finally, in Fig. 4 we show the t-SNE embeddings. Edge embeddings clearly produce denser communities.

5 Conclusions

In this paper, we have contributed with empirical evidence showing that embedding edges clearly outperforms node-based embeddings in neural SGNS strategies. We conjecture that this is due to the slower mixing times of random walks in line graphs. Future work includes a detailed check of this conjecture as well as more efficient (in time and space) strategies for designing walkers on the line graphs.