Graph data involves rich and complex potential relationships and plays an important role in many real-world applications, being used extensively in areas such as social networks, recommendation systems, science, and NLP. As AI continues to gain popularity, a growing number of machine learning tasks need to analyze and process graph data. One effective method for graph analysis is to map a graph’s elements to a low-dimensional vector space while retaining the graph’s structure and property information. This low-dimensional vector is called a graph vector (or “graph embedding”), which is described below.

8.1 Graph Vector Overview

A graph is a data structure comprising a set of vertices, which are interconnected by lines called edges, and relationships between the vertices, as shown in Fig. 8.1. The graph is called a directed graph if each edge has a direction (in this case, the edges are similarly known as directed edges). Conversely, the graph is called an undirected graph if the edges have no direction. Two graphs are isomorphic if they have the same number of vertices and edges and if the second graph can be obtained by permuting all vertices in the first graph one by one to the names of the vertices in the second graph. For example, a pentagon with five vertices and a five-pointed star with five vertices are considered isomorphic. The number of edges associated with the vertex represents the degree of a vertex. A common storage representation of a graph is the adjacency matrix, which can be represented by the vertex set V and the edge matrix E, as shown in Formula (8.1).

$$\user2{V} = \left( {v_{1} ,v_{2} ,v_{3} ,v_{4} ,v_{5} } \right)\user2{E} = \left[ {\begin{array}{*{20}c} {\text{0}} & {\text{1}} & {\text{0}} & {\text{0}} & {\text{1}} \\ {\text{1}} & {\text{0}} & {\text{1}} & {\text{0}} & {\text{1}} \\ {\text{0}} & {\text{1}} & {\text{0}} & {\text{1}} & {\text{0}} \\ {\text{0}} & 0 & 1 & 0 & 1 \\ 1 & 1 & 0 & 1 & 0 \\ \end{array} } \right]$$
(8.1)
Fig. 8.1
figure 1

Common storage representation of a graph

Graphs are used in all sorts of real-world applications (e.g., in communication networks, social networks, e-commerce networks, and traffic networks). Because they contain rich information, comprising potentially billions of vertices and relationships between the vertices (edges), graph analysis is of particular importance.

However, these graphs are usually high-dimensional ones that contain massive volumes of information, making it difficult to directly process them. An important method for analyzing and processing such graphs is the graph embedding method (GEM), which uses a low dimension, dense vector to represent a graph’s vertex and reflect its structure information. The following key features [1] are paramount in a good GEM [1].

  1. 1.

    Neighborhood awareness: The distance between hidden vectors on vertices reflects the distance between the vertices on the graph.

  2. 2.

    Low dimension: This feature is necessary to facilitate subsequent calculations.

  3. 3.

    Adaptation: Adding a vertex (or edge) should not cause all calculation processes to be repeated.

  4. 4.

    Continuity: Continuous representations have smooth decision boundaries and enable refined representations of the graph members.

Depending on the application scenario, the GEM can be divided into vertex embedding, edge embedding, mixed embedding, and whole graph embedding. In the first category, vertex embedding, algorithms such as the classical DeepWalk and Node2Vec, and graph-based neural network ones such as graph convolutional networks (GCNs) and graph attention networks (GATs), are used.

The classical GEMs have two disadvantages: First, during the learning embedding process, parameters are not shared between vertices, and calculation efficiency is low. Second, because learning is directly performed on a particular structure graph, there is a lack of generalization ability, and new or dynamic graphs cannot be processed. Although CNNs are well known for processing Euclidean data, non-Euclidean data is difficult to process. CNNs and GEMs have promoted the development of the graph neural network (GNN) model, which captures the dependence of graphs through message transfer between vertices of graphs. Despite the original GNN being difficult to train and offering suboptimal results, researchers have made significant improvements in its network architecture, optimization methods, and parallel computing, enabling it to achieve good learning capabilities. Over the last few years, GNN has become a popular graph analysis method [2] due to advantages such as excellent performance and high interpretability. In addition, the algorithms represented by GCNs and graph attention networks are gaining significant attention.

This chapter explores the vertex embedding algorithm, with each section centering on the following topics:

  1. 1.

    Section 8.2 focuses on the classical graph embedding method DeepWalk.

  2. 2.

    Section 8.3 examines the classical graph embedding method Large-scale Information Network Embedding (LINE).

  3. 3.

    Section 8.4 discusses the classical graph embedding method Node2Vec.

  4. 4.

    Sections 8.5 and 8.6 cover the algorithms based on graph neural networks, including GCN and GAT.

  5. 5.

    Section 8.7 delves into the application of graph neural networks in the recommendation system.

8.2 DeepWalk Algorithm

The sparsity of graph representation data (such as the adjacency matrix) makes it a challenging task to design algorithms. We need to eliminate the adverse impacts of data sparsity during network application (such as network classification, recommendation, and anomaly detection) in order to develop high-quality machine learning algorithms. Establishing a method to map the complex and high-dimensional sparse graph data to the low-dimensional dense vector is therefore of the utmost importance. Because machine learning cannot directly deal with natural languages, we must convert words into vectors composed of numeric values so that we can subsequently establish models for analysis. In the field of NLP, one of the most prominent algorithms is Word2Vec, which was inspirational in Bryan Perozzi’s proposal of the DeepWalk algorithm [1] in 2014. DeepWalk, a classical unsupervised learning algorithm in graph embedding, performs well in the absence of information and is readily usable by statistical models.

8.2.1 Principles of the DeepWalk Algorithm

The DeepWalk algorithm learns the low-dimensional vector representation of a vertex in a graph by truncating the local information of the random walk (which is often used as a similarity measure in content recommendation and community discovery). In the field of natural language, we can consider the vertex as a word, and the sequence of the vertices obtained through the random walk is like a sentence. The input of the DeepWalk algorithm is a connected graph (either directed or undirected), and the output is a vector representation of all vertices in the graph. Figure 8.2 provides an example of low-dimensional vector representation of the DeepWalk learning vertex, where (a) shows that the input is a graph and (b) shows that the output is a two-dimensional vector representation of each vertex in the input graph. The vector dimension is determined to be 2 because the two-dimensional vector is easy to visualize. In the figure, the vertices shown in the same color are similar to each other. The more vertices that two vertices have in common, the shorter the distance between the two-dimensional vectors corresponding to the two vertices.

Fig. 8.2
figure 2

Example of low-dimensional vector representation of the DeepWalk learning vertex

In order to perform model learning using the natural language modeling algorithm, datasets are required, which are the corpus of several sentences and the vocabulary of several words. Conversely, in the DeepWalk algorithm, the corpus is a set of random walk vertex sequences with limited length, and the vocabulary is the vertex of the graph.

The DeepWalk algorithm is divided into two parts—the input (graph) and the output (vector representation)—as shown in the following figure.

  1. 1.

    Generating a vertex sequence through random walk

We define a random walk \(\user2{W}_{{v_{i} }}\) with vertex vi as its root vertex. For graph G, we first perform an even, random sampling on a vertex vi, which is the root vertex of the random walk \(\user2{W}_{{v_{i} }}\). We then perform uniform sampling on the neighbors of the currently sampled vertex until the number of vertices in the random walk reaches the maximum length t. Note that the lengths of vertex sequences in the random walk can be different. Random walk not only captures community information, but also has the following two advantages:

  1. 1.

    Parallel local exploration for the vertex is easy to implement.

  2. 2.

    Global recalculation is not required when minor changes occur locally, thereby facilitating online learning.

  1. 2.

    Skip-Gram

Skip-Gram is a Word2Vec algorithm in NLP [3] and can learn the random walk \(\user2{W}_{{v_{i} }}\) to obtain a vector representation. Given a keyword, Skip-Gram calculates the probability of maximizing the occurrence of surrounding words; that is, it predicts the context. This is explained in more detail in Sect. 7.1. Skip-Gram traverses all possible collocations appearing in the random walk window w. For each collocation, the occurrence probability of neighbor vertices is maximized by each vertex vi and its representation vector Φ(vj) ∈ ℝd. The label dimension is equal to the number |V| of vertices (similar to the one-hot vector), and the number of vertices is generally large. Using Softmax to calculate this probability directly would consume a large amount of computing resources for learning, so instead we can use the hierarchical Softmax [4, 5], which approximates the probability to accelerate the training. Hierarchical Softmax takes the prediction problem and, by assigning vertices to leaf nodes of a binary tree, converts it into maximizing the probability of a path. If we assume that the path to the vertex uk is a sequence (b0, b1, …, \(b_{{\left\lceil {{\text{log}}|V|} \right\rceil }}\)) regarding the tree node, we can obtain the following:

$$P(u_{k} |\user2{\Phi }\left( {\nu _{j} } \right)) = \mathop \prod \limits_{{l = 1}}^{{\left\lceil {{\text{log}}|V|} \right\rceil }} P(b_{l} {\text{~}}|\user2{\Phi }\left( {\nu _{j} } \right))$$
(8.2)

where

  • P(uk|Φ(vj)) is the probability that the vertex uk is the context of the vertex vj;

  • Φ(vj) is the vector representation of the vertex vj; and

  • P(bl|Φ(vj)) is the probability that the lth node in the path of the vertex uk is selected along the binary tree starting from the vertex vj.

For P(uk|Φ(vj)), the calculation of time complexity decreases from O(|V|) to O(log|V|).

Figure 8.3 and the subsequent description provide details about the process used in the DeepWalk algorithm.

Fig. 8.3
figure 3

Process of the DeepWalk algorithm

  1. 1.

    In Fig. 8.3a, a random walk sequence \(\user2{W}_{{v_{{\text{4}}} }}\) with v4 as the root vertex is obtained.

  2. 2.

    In Fig. 8.3b, a sample is generated on a sequence \(\user2{W}_{{v_{{\text{4}}} }}\) by continuously sliding the window (with a length of 2w + 1). If we assume that the vertex in the window is [1, 3, 5] and the sample is {(1, 3), (1, 5)}, we can conclude that the center vertex v1 is mapped to its vector representation Φ(v1).

  3. 3.

    In Fig. 8.3c, Hierarchical Softmax decomposes P(v3|Φ(v1)) and P(v5|Φ(v1)) into probability distribution that corresponds to the path from the root to v3 and v5. It maximizes the two probabilities by updating Φ, which is the vertex representation matrix that needs to be calculated.

8.2.2 Implementation of the DeepWalk Algorithm

This section builds on the theoretical information provided earlier by outlining the pseudocode necessary to implement DeepWalk and Skip-Gram.

Algorithm 8.1

Pseudocode for Implementing DeepWalk

figure a

Algorithm 8.2

Pseudocode for Implementing Skip-Gram

figure b

8.3 LINE Algorithm

DeepWalk performs well on many datasets because it is a graph embedding method based on random walk. However, the DeepWalk algorithm considers the similarities between points based on only the explicit connections between such points (e.g., points 6 and 7 in Fig. 8.4); it ignores the possibility that similarities may exist between points that are not connected in the information network. In Fig. 8.4, for example, there is no direct connection between points 5 and 6. But there are similarities between them, because they share points 1, 2, 3, and 4.

Fig. 8.4
figure 4

Information networkFootnote

Source: http://www.www2015.it/documents/proceedings/proceedings/p1067.pdf.

We can use an analogy here: If two people have many mutual friends, we can assume that those two people probably have common hobbies and habits. By carefully designing the loss function and considering the similarities between points 6 and 7 as well as those between points 5 and 6, we can ensure that the vector representation obtained by the LINE algorithm [6] retains information about both the local and global network architectures.

The LINE algorithm has strong universality and can be used for both directed and undirected graphs. Furthermore, it can be used for both weighted and unweighted graphs. The following sections provide a brief overview of the LINE algorithm and pseudocode for its implementation.

8.3.1 Principles of the LINE Algorithm

Directly connected points always exhibit similarities between them. If there is a direct connection between two vertices, the weight wij of the edge connecting the two vertices represents a first-order similarity, which is a direct similarity between the pairs of vertices. Conversely, if no direct connection exists between two vertices, the first-order similarity is 0. Figure 8.4 shows a first-order similarity, between points 6 and 7.

The first-order similarity applies only to undirected graphs. For each undirected edge (i, j) in an information network, the joint probability between the vertices vi and vj is defined as follows:

$$p_{1} (v_{i} ,{\text{ }}v_{j} ) = \frac{1}{{1 + \exp ( - \user2{\Phi }(v_{i} )^{{\text{T}}} \bullet \user2{\Phi }(v_{j} ))}}$$
(8.3)

where

  • p1 (vi, vj) is the joint probability between the vertices vi and vj;

  • Φ (vi) is the low-dimensional vector representation of the vertex vi, and Φ (vi)∈ℝd.

This formula defines probability distribution p \((\bullet, \bullet)\) in |V| × |V|, where V denotes the number of vertices. \(\hat{p}_{1} = \frac{{w_{{ij}} }}{\user2{W}}\) is the empirical probability, where \(W = \sum\nolimits_{{(i,j) \in E}} {w_{{ij}} }\). To preserve the first-order similarity, we can use Kullback–Leibler (KL) divergence to define the following objective function:

$$D_{{{\text{KL}}}} \left( {\hat{p}_{1} ( \bullet , \bullet ),p_{1} ( \bullet , \bullet )} \right) = \sum\limits_{{(i,j) \in E}} { \hat{p}_{{\text{1}}} } (v_{i} ,v_{j} )({\text{log}}\hat{p}_{{\text{1}}} (v_{i} ,v_{j} ) - {\text{log}}p_{1} (v_{i} ,v_{j} ))$$
(8.4)

where

  • DKL(\(\hat{p}_{1}\) \((\bullet, \bullet)\), p1\((\bullet, \bullet)\)) is the KL divergence of the empirical joint probability distribution and the ground truth joint probability distribution;

  • \(\hat{p}_{1}\) \((\bullet, \bullet)\) is the empirical joint probability distribution between vertices; and

  • p1 \((\bullet, \bullet)\) is the ground truth joint probability distribution between vertices.

After removing constant terms from Formula (8.4), we can obtain the first-order similarity objective function:

$${\rm O}_{1} = - \sum\limits_{{(i,j) \in E}} {w_{{ij}} } {\text{log}}p_{1} (v_{i} ,v_{j} )$$
(8.5)

where

  • O1 is the first-order similarity objective function of the LINE algorithm;

  • wij is the edge weight between the vertices vi and vj; and

  • p1(vi, vj) is the ground truth joint probability between the vertices vi and vj.

As mentioned earlier, the first-order similarity represents the similarity between the points that have direct connections. However, the information network may contain many other points that have no direct connections. For such cases, the second-order similarity is defined to cover the similarity between the neighbor network architectures of the vertices u and v. If we use \(\user2{p}_{u} = {\text{ }}\left( {w_{{u,1}} ,w_{{u,2}} ,{\text{ }} \ldots ,w_{{u,|V}} } \right)\) to indicate the first-order similarity between the vertex u and all other vertices, we can conclude that the similarity between pu and pv is the second-order similarity between the vertices u and v. Similar to the first-order similarity, the second-order similarity between the vertices u and v is 0 if no vertex is connected to both u and v. Figure 8.4 shows a second-order similarity, between points 5 and 6.

For the second-order similarity, each vertex exists not only as itself but also as the context of other vertices, meaning that two additional vectors are required: Φ(vi) and Φ(vi). Φ(vi) is the vector representation of the vertex vi when it is regarded as itself, and Φ(vi)is the vector representation of the vertex vi when it is regarded as the context of other vertices. The second-order similarity can be used for both directed graphs and undirected graphs. In the information network, one undirected edge can be regarded as two directed edges, so for any directed edge (i, j), the probability that the vertex vjbecomes the context of vi is defined as:

$$p_{2} (v_{j} |v_{i} ) = \frac{{{\text{exp}}(\user2{\Phi }(v_{j} )^{{'{\text{ T}}}} \bullet \user2{\Phi }(v_{i} ))}}{{\sum\nolimits_{{k = 1}}^{{|V|}} {{\text{exp}}(\user2{\Phi }(v_{k} )^{{'{\text{ T}}}} \bullet \user2{\Phi }(v_{i} ))} }}$$
(8.6)

where

  • p2(vj | vi) is the probability that the vertex vj becomes the context of vi;

  • Φ(vi)T is the low-dimensional vector representation of vj as the context; and.

  • |V| is the number of vertices in the information network.

For each vertex vi, Formula (8.6) defines conditional distribution p2(\(\bullet\) | vi). Its empirical distribution \(\hat{p}_{2} \left( { \bullet {\text{ }}|v_{i} } \right)\) is defined as \(\hat{p}_{2}\) \(\left( {v_{j} |v_{i} } \right){\text{ }} = \frac{{w_{{ij}} }}{{d_{i} }}\), where \(d_{i} = \sum\nolimits_{{k \in N(i)}} {w_{{ik}} }\) (which is the out-degree of the vertex vi). To preserve the second-order similarity, we can use KL divergence to define the following objective function:

$$D_{{{\text{KL}}}} \left( {\hat{p}_{1} ( \bullet , \bullet ),p_{1} ( \bullet , \bullet )} \right) = \sum\limits_{{(i,j) \in E}} {d_{i} } \hat{p}_{2} (v_{j} |v_{i} )({\text{log}}\hat{p}_{2} (v_{j} |v_{i} ) - {\text{log}}p_{2} (v_{j} |v_{i} ))$$
(8.7)

where

  • DKL \({\text{(}}\hat{p}_{1} \left( { \bullet {\text{ }},{\text{ }} \bullet } \right),p_{{\text{1}}} \left( { \bullet {\text{ }},{\text{ }} \bullet } \right))\) is the KL divergence of the empirical joint probability distribution and the ground truth probability distribution between the vertices used as contexts;

  • \(\hat{p}_{2}\) (vj | vi) is the empirical probability that the vertex vj becomes the context of the vertex vi; and.

  • di is the out-degree of the vertex.

After removing constant terms from Formula (8.7), we can obtain the second-order similarity objective function:

$${\rm O}_{2} = - \sum\limits_{{(i,j) \in E}} {w_{{ij}} \log p_{2} (v_{j} |} v_{i} )$$
(8.8)

To preserve both the first- and second-order similarities, the LINE algorithm minimizes O1 and O2 and then concatenates the low-dimensional vectors obtained based on O1 and O2. This makes it possible to obtain the low-dimensional vector representation Φ(vi) of each vertex vi.

If O2 is minimized directly, we must calculate the sum of all vertices when calculating the conditional distribution p2(\(\bullet\) | vi), resulting in the time complexity of minimizing O2 reaching O(|V|2). Here, the objective function that defines negative sampling becomes:

$${\text{log}}\sigma (\user2{\Phi }(v_{j} )^{{'{\text{ T}}}} \bullet \user2{\Phi }(v_{i} )) + \sum\limits_{{n = 1}}^{K} {E_{{n\sim P_{{\text{n}}} {\text{(}}v{\text{)}}}} } [{\text{log}}\sigma ( - \user2{\Phi }(v_{n} )^{{'{\text{ T}}}} \bullet \user2{\Phi }(v_{i} ))]$$
(8.9)

where

  • σ is the sigmoid function, and \(\sigma \left( x \right){\text{ }} = \frac{1}{{1 + {\text{exp}}\left( x \right)}}\);

  • K is the number of negative samples in each data sampling; and

  • Pn(v) ∝ \(d_{v}^{{3/4}}\).

If we replace logp2(vj| vi) with the objective function for negative sampling, the objective function of the second-order similarity becomes:

$$O_{2} = - \sum\limits_{{(i,j) \in E}} {w_{{ij}} \left\{ {{\text{log}}\sigma (\user2{\Phi }(v_{j} )^{{'{\text{ T}}}} \bullet \user2{ \Phi }(v_{i} )) + \sum\limits_{{n = 1}}^{K} {E_{{v_{n} jP_{{^{{_{n} }} }} (v)}} } [{\text{log}}\sigma ( - \user2{\Phi }(v_{n} )^{{'{\text{ T}}}} \bullet \user2{\Phi }(v_{i} ))]} \right\}}$$
(8.10)

In addition, when O1 is minimized directly, uik = ∞, where i = 1, 2, …, |V|; and k = 1, 2, …, d. To avoid uik = ∞, we need to change the objective function by performing negative sampling:

$$O_{1} = - \sum\limits_{{(i,j) \in E}} {w_{{ij}} \left\{ {{\text{log}}\sigma (\user2{\Phi }(v_{j} )^{T} \bullet \user2{\Phi }(v_{i} )) + \sum\limits_{{n = 1}}^{K} {E_{{v_{n} \sim P_{{^{{_{n} }} }} (v)}} } [\log \sigma ( - \user2{\Phi }(v_{n} )^{T} \bullet \user2{\Phi }(v_{i} ))]} \right\}} {\text{ }}$$
(8.11)

Regardless of whether O1 or O2 is minimized, the objective function includes wij, which appears in the gradient when we use the gradient descent method for minimization. For different edges, wij may vary significantly, making it difficult to select an appropriate learning rate. If we select a higher learning rate, gradient explosion may occur on the edge with a larger wij. Conversely, if we select a lower learning rate, gradient disappearance may occur on the edge with a smaller wij. To overcome this conundrum, we therefore need to perform edge sampling for optimization, by using the Alias method to sample the original weighted edges. The probability of each sampled edge is proportional to the weight of the edge in the original graph, and the sampled edge weight is used as a binary edge (the weight is 0 or 1). This solves the problem of wij differing for different edges.

8.3.2 Implementation of the LINE Algorithm

This section builds on the theoretical information provided earlier by outlining the pseudocode necessary to implement the LINE algorithm.

Algorithm 8.3

Pseudocode for Implementing the LINE Algorithm

figure c

8.4 Node2Vec Algorithm

The DeepWalk and LINE algorithms described in Sects. 8.2 and 8.3, respectively, depend on the strict concept of network neighborhood and both are insensitive to the network-specific connection mode. The DeepWalk algorithm samples the vertex neighborhood by using depth-first search (DFS) random walk, whereas the LINE algorithm samples vertices by using breadth-first search (BFS). As shown in Fig. 8.5, the Node2Vec algorithm—a graph embedding method and an extension of the DeepWalk algorithm—integrates both DFS and BFS. In Fig. 8.5, the neighborhood for DFS is composed of vertices sampled in ascending order of distance to the source vertex u (e.g., s4, s5, and s6), whereas that for BFS is limited to the vertices adjacent to the source vertex u (e.g., s1, s2, and s3).

Fig. 8.5
figure 5

DFS and BFS policies from source vertex uFootnote

Source: https://cs.stanford.edu/~jure/pubs/node2vec-kdd16.pdf.

8.4.1 Principles of the Node2Vec Algorithm

The Node2Vec algorithm, proposed by Aditya Grover in 2016 [7], is used to learn the continuous vector representations of a graph’s vertices. Compared with both the DeepWalk and LINE algorithms described earlier, Node2Vec can effectively explore different neighborhoods (homogeneity and structural equivalency) by designing a biased random walk process for vertices, allowing it to learn more comprehensive representations of the vertices. The Node2Vec algorithm functions in a similar way to the DeepWalk algorithm and can be divided into two processes: biased random walk and learning vector representations.

  1. 1.

    Biased random walk

The biased random walk process is implemented by assigning different sampling probabilities to different vertices. If we assume that the source vertex is u, the random walk length is l, the ith vertex is ci, and the start vertex is c0 = u, we can use the following formula to calculate the sampling probability of the vertex ci:

$$P(c_{i} = x|c_{{i - 1}} = v) = \left\{ {\begin{array}{*{20}l} {\frac{{\pi _{{vx}} }}{Z},} \hfill & {(v,x) \in \user2{E}} \hfill \\ {0,} \hfill & {{\text{else}}} \hfill \\ \end{array} } \right.$$
(8.12)

where

  • πvx is the unnormalized transition probability between the vertex v and the vertex x; and

  • Z is a normalization constant.

To implement the biased random walk process, Node2Vec introduces p and q—two parameters that are used in calculating the transition probability in Formula (8.12). If we assume that the walk is performed through edge (t, v) to the vertex v, we can calculate the transition probability of the edge (t, v) on the basis of v, allowing us to determine the next vertex for the walk. We use the following unnormalized transition probability:

$$\pi _{{vx}} = \alpha _{{pq}} (t,x) \bullet w_{{vx}} \alpha _{{pq}} (t,x) = \left\{ {\begin{array}{*{20}c} {\frac{1}{p}} & {d_{{tx}} = 0} \\ {\text{1}} & {d_{{tx}} = 1} \\ {\frac{1}{q}} & {d_{{tx}} = 2} \\ \end{array} } \right.$$
(8.13)

where dtx is the shortest path distance from the vertex t to the vertex v.

dtx must be one of {0, 1, 2} so that the two parameters p and q can adequately control the walk process, as shown in Fig. 8.6. The return parameter p controls the probability of subsequently accessing the vertex in the previous step. If the current vertex is v, and p is greater than max(q, 1), the probability of accessing the vertex t in the previous step will decrease. Conversely, the probability will increase if p is less than min(q, 1). The “in–out” parameter q controls whether the walk process is more like BFS or DFS. A larger q indicates that the vertex for the random walk is closer to the vertex t, meaning that the walk process is more like BFS. Conversely, a smaller q indicates that the vertex is farther away, which is more like DFS.

Fig. 8.6
figure 6

Example of random walk

Each sampling step in biased random walk is based on the transition probability πvx. We can pre-calculate this probability by using the Alias sampling method [8, 9] and then use it directly in the sampling process. In this case, the sampling complexity is O(1), meaning that the walk process of the algorithm is faster.

  1. 2.

    Vector representations of vertices

The biased random walk process enables us to obtain a set of vertex sequences for random walk. Now we will introduce learning vector representations of vertices.

For a given graph G = (V, E), the mapping function from a vertex to a vector representation is f: V → ℝd, where d is the dimension of a representation vector. For each vertex u ∈ V, an algorithm is used to learn the vertex vector representation. This enables us to subsequently optimize the following formula of the objective function:

$$f^{*} = \mathop {{\text{argmax}}}\limits_{f} \sum\limits_{{u \in V}} {{\text{log}}P(N_{s} (u)|f(u))}$$
(8.14)

where

  • f(u) is the vector representation of the vertex u; and

  • NS(u) is the neighborhood of the vertex u under the neighborhood sampling strategy S.

However, due to the complexity involved in solving the optimization problem, we introduce the following two assumptions:

  1. 1.

    Conditional independence assumption:

    $$P(N_{S} (u)|f(u)) = \prod\limits_{{n_{i} \in N_{S} (u)}} {P(n_{i} |f(u))}$$
    (8.15)

    where ni is the vertex in the neighborhood of the vertex u under the neighborhood sampling strategy S.

  2. 2.

    Feature space symmetry assumption:

    $$P(n_{i} |f(u)) = \frac{{{\text{exp}}(f(n_{i} ) \bullet f(u))}}{{\sum\nolimits_{{v \in V}} {{\text{exp}}(f(v) \bullet f(u))} }}$$
    (8.16)

Given these two assumptions, we can simplify the objective function in Formula (8.14) as follows:

$$\begin{gathered} \frac{{\partial {\text{E}}}}{{\partial u_{j} }} = t_{j} - y_{j} = e_{j} f^{*} \\ = \mathop {{\text{argmax}}}\limits_{f} \sum\limits_{{u \in V}} {[ - {\text{log}}Z_{u} + \sum\limits_{{n_{i} \in N_{S} (u)}} {f(n_{i} )} \bullet f(u)]} \\ \end{gathered}$$
(8.17)

where Zu is the partition function of each vertex, and Zu = \(\sum_{{v \in V}} {{\text{exp}}(f(u) \bullet } (f(v)).\)

By using negative sampling, we are able to minimize the calculation costs of the partition function. And to optimize Formula (8.17), we can use SGD—similar to training a neural network—in order to continuously learn the parameters of mapping f to obtain the vector representation of each vertex.

8.4.2 Implementation of the Node2Vec Algorithm

Similar to the DeepWalk algorithm, the Node2Vec algorithm is mainly used to generate random walk sequences and learn vector representations. In this section, we describe how to implement the Node2Vec algorithm through pseudocode.

Algorithm 8.4

Pseudocode for Implementing Node2Vec

figure d

8.5 GCN Algorithm

In the field of computer vision, CNNs achieve good results because discrete convolution can effectively extract spatial features. For low-dimensional matrices (such as images or videos) where pixels are ordered, CNNs calculate weighted summation of center and adjacent pixels to extract spatial features. But when given high-dimensional graph data that lacks an ordered structure, CNNs find it difficult to process the data. In order to solve this problem, Bruna et al. proposed GCNs, which aggregate the vertex information of irregular graph data.

The approaches that apply convolution to graph domain can be divided into spectral and non-spectral. The GCN is a spectral approach that, by leveraging the spectral graph theory, implements convolution operations on topologies, and uses the Laplacian matrix to move convolution operations in the spatial domain to the spectral domain. By representing any vector on a graph as a linear combination of Laplacian eigenmatrices, the features of the graph can be extracted in the spectral domain. This results in the GCN being more effective than the non-spectral approach, which directly extracts the features in the spatial domain. Furthermore, because the GCN model can extract information about the entire graph in one go, and the parameters of the filter can be shared at all positions in the graph [10], there is no need to calculate the parameters of the filter for each vertex. This in turn significantly reduces the complexity of the model.

Building on the first generation of GCN, Defferranrd et al. proposed to replace the convolution kernel with the Chebyshev polynomial summation [11]. This method enables us to obtain a smooth filter in the frequency domain while reducing the model complexity. Subsequently, numerous approaches for replacing the convolution kernel with mathematical transformation have emerged. In the model described in the following section, Kipf and Welling limit the filter to run in the first-order neighborhood around each vertex, thereby reducing the calculation costs, increasing the network efficiency, and improving the model accuracy.

8.5.1 Principles of the GCN Algorithm

The formula of the two-layer GCN model selected by Kipf and Welling is as follows:

$$Z = f({\mathbf{X}},{\mathbf{A}}) = \text{Softmax} \left( {\user2{\hat{A}}{\text{ReLU}}\left( {\user2{\hat{A}XW}^{{(0)}} } \right)\user2{W}^{{(1)}} } \right)$$
(8.18)

where \(\hat{A} = \tilde{\user2{D}}^{{ - \frac{1}{2}}} \tilde{\user2{A}D}^{{ - \frac{1}{2}}} ,\tilde{\user2{D}}_{{ii}} = \sum {_{j} } \tilde{\user2{A}}i_{j} {\text{, and }}\tilde{\user2{A}} = \user2{A} + \user2{I}_{N}\);

  • A is the adjacency matrix of a graph;

  • Wl represents W(0) and W(1), which are weight parameters; and

  • X is the vertex eigenmatrix of the graph.

The following explains the origin of Formula (8.18), starting with the formula:

$$\user2{L} = \user2{I}_{{\text{N}}} - {\text{D}}^{{ - 1/2}} \user2{LD}^{{ - 1/2}}$$
(8.19)

We use this formula to define the symmetric normalized Laplacian matrix, where D is the vertex degree matrix.

The Laplacian matrix is then decomposed to obtain the following:

$$\user2{L} = \user2{U\Lambda U}^{{\text{T}}}$$
(8.20)

where

  • U is the normalized Laplacian eigenvector matrix (i.e., a spectral matrix), and

  • Λ is the corresponding eigenvalue matrix (a diagonal matrix).

The spectral convolution on the graph may be defined as the product of the signal x ∈ ℝ and the filter diag(θ) (θ ∈ ℝ) in the Fourier domain, that is:

$$q_{\vartheta } \star \user2{x} = {\text{ }}\user2{U}q_{\vartheta } {\text{ }}\user2{U}^{T} {\text{ }}\user2{x}$$
(8.21)

where

  • UTx is the graph Fourier transform of x; and

  • gθ is the function of the eigenvector \(\mathcal{L}\), that is, gθ(Λ).

Computing the Laplacian eigenmatrix involves substantial overheads if the graph data is large, so to reduce the calculation complexity, we can use the K-order truncation of Chebyshev polynomial and thereby approximate gθ(Λ).

$$g_{{\theta '}} (\Lambda ) \approx \sum\limits_{{k = 0}}^{K} {\user2{\theta }_{k}^{'} T_{K} (} \tilde{\user2{\Lambda }})$$
(8.22)

where

  • \(\tilde{\user2{\Lambda }}\) is the eigenvector matrix after scaling is performed based on the maximal eigenvalue λmax, of \(\mathcal{L}\), and \({\tilde{\mathbf{\Lambda }}} = {\text{ 2}}/\lambda _{{{\text{max}}}} \bullet {\mathbf{\Lambda }} - I_{{\text{N}}}\); and

  • θ is the Chebyshev parameter vector, where θ′ ∈ ℝK. The Chebyshev polynomial is defined by recursion: Tk(x) = 2xTk-1Tk-2(x), where T0(x) = 1 and T1 (x) = x.

By replacing gθ with gθ, we can obtain:

$$\begin{aligned} {\text{g}}_{{\theta ^{\prime } }} \star \user2{x} & {\mathbf{ \approx }}z\user2{U}\sum\limits_{{k = 0}}^{K} {\theta ^{'} T_{k} (\tilde{\user2{\Lambda }})\user2{U}^{T} \user2{x}} \\ & = \sum\limits_{{k = 0}}^{K} {\user2{\theta }^{'} \user2{U}T_{k} (\tilde{\user2{\Lambda }})\user2{U}^{T} \user2{x}} \\ \end{aligned}$$
(8.23)

Tk(\({\tilde{\mathbf{\Lambda }}}\)) is a k-order polynomial of Λ, and U \({\tilde{\mathbf{\Lambda }}}\) kUT = (U \({\tilde{\mathbf{\Lambda }}}\) UT) k =\(\tilde{\user2{L}}\) k, where \(\tilde{\user2{L}}\) = \(\frac{2}{{\lambda _{{{\text{max}}}} }}\) LIN. Formula (8.22) can therefore be expressed as follows:

$$g_{{\theta ^{\prime } }} \star \user2{x}{\mathbf{ \approx }}\sum\limits_{{k = 0}}^{K} {\user2{\theta }^{'} T_{k} (\tilde{\user2{L}})\user2{x}}$$
(8.24)

With the Chebyshev polynomial approximation, the spectral convolution is no longer dependent on the entire graph. Instead, it is related to only the k-order vertices (i.e., the kth-order neighborhood) of the center vertex.

After we perform the Chebyshev polynomial approximation, we can consider each convolution operation as aggregating the k-order neighbor information for each center vertex. Even so, the calculation amount remains high after approximation because the graph structure data is large. In order to reduce the calculation costs, we can further simplify the calculation by letting k = 1, meaning that the information of only first-order neighbors is aggregated at any given time. In this case, the spectral convolution can be approximated as a linear function of \(\tilde{\user2{L}}\). As mentioned earlier, only the dependence between the center vertex and the first-order neighbor is established. In order to solve this problem, a stacked GCN must be used to establish the dependence of k-order neighbors. We are able to obtain the first-order neighbor information of the second-order graph convolution after superimposing the first-order neighbor of the second-order graph convolution. This means that the center vertex will obtain the second-order neighbor information through the first-order neighbor of the second-order graph convolution, and so on. Furthermore, the Chebyshev polynomial does not limit the dependence of k-order neighbors when this dependence is established.

We can further simplify the calculation. In the linear model of the GCN, we can obtain the following first-order linear approximate expression of spectral convolution by defining λmax ≈ 2:

$$\begin{aligned} {\text{g}}_{{\theta ^{\prime } }} \star \user2{x} & {\mathbf{ \approx }}\user2{\theta }_{0}^{'} \user2{x} + \user2{\theta }_{1}^{'} (\user2{L} - \user2{I}_{N} )\user2{x} \\ & \user2{ = \theta }_{0}^{'} \user2{x} - \user2{\theta }_{1}^{'} \user2{D}^{{ - \frac{1}{2}}} \user2{AD}^{{ - \frac{1}{2}}} \user2{x} \\ \end{aligned}$$
(8.25)

Formula (8.25) includes only two parameters: \(\user2{\theta }_{0}^{'}\) and \(\user2{\theta }_{1}^{'}\). So to establish the dependence of k-order neighbors, we can use a k-layer filter.

We limit the number of parameters to avoid overfitting and minimize the matrix multiplication of each layer to reduce the calculation complexity. If we let θ = \(\user2{\theta }_{0}^{'}\) =  \(-\user2{\theta }_{1}^{'}\), we can express Formula (8.25) as follows:

$${\text{g}}_{\theta } \star \user2{x} \approx \user2{\vartheta }\left( {\user2{I}_{N} + \user2{D}^{{ - \frac{1}{2}}} A\user2{D}^{{ - \frac{1}{2}}} } \right)\user2{x}$$
(8.26)

The eigenvalue range of IN + \(\user2{I}_{N} + \user2{D}^{{ - \frac{1}{2}}} A\user2{D}^{{ - \frac{1}{2}}}\) is [0, 2], meaning that when the operation is repeated continuously (in very deep networks), gradient explosion or disappearance may occur. To avoid this problem, the renormalization trick is introduced:

$$\user2{I}_{N} + \user2{D}^{{ - \frac{1}{2}}} A\user2{D}^{{ - \frac{1}{2}}} \to \widetilde{\user2{D}}^{{ - \frac{1}{2}}} \widetilde{\user2{A}}\widetilde{\user2{D}}^{{ - \frac{1}{2}}}$$
(8.27)

where \({\tilde{\mathbf{A}}}\) = A + IN, \(\widetilde{\user2{D}}_{{ii{\text{ }} = }} \sum\nolimits_{j} {{\tilde{\mathbf{A}}}}\) When the representation of each vertex in a graph is not a separate scalar but is instead a vector of size C, we can use its variants for processing:

$$\user2{Z} = \widetilde{\user2{D}}^{{ - \frac{1}{2}}} \widetilde{\user2{A}}\widetilde{\user2{D}}^{{ - \frac{1}{2}}} \user2{X\vartheta }$$
(8.28)

where

  • θ is a parameter matrix, and θ ∈ ℝC × F; and

  • Z is the corresponding convolution result, and ∈ ℝN×F

In this case, the vertex representation of each vertex is updated to a new F-dimensional vector that includes the information of the corresponding first-order neighbor.

We are now able to obtain the layer-by-layer propagation expression of the graph CNN:

$$\user2{H}^{{(l + 1)}} = \sigma \left( {\widetilde{\user2{D}}^{{ - \frac{1}{2}}} {\text{ }}\widetilde{\user2{A}}\widetilde{\user2{D}}^{{ - \frac{1}{2}}} \user2{H}^{{(l)}} \user2{W}^{{(l)}} } \right)$$
(8.29)

where the input of the l-layer network is H(l), and H(l) ∈ ℝN×F (initial input is H(0) = X);

  • N is the number of vertices in the graph, and each vertex is represented by a d-dimensional eigenvector;

  • W(l) is the weight parameter that needs to be trained, and W(l∈ ℝd×d; and

  • σ is the activation function.

Through this derivation, we are able to obtain the GCN architecture:

$$\user2{Z} = f(\user2{X},\user2{A}) = \text{Softmax} \left( {\user2{\hat{A}}{\text{ReLU}}\left( {\user2{\hat{A}XW}^{{(0)}} } \right)\user2{W}^{{(1)}} } \right)$$
(8.30)

8.5.2 Implementation of the GCN Algorithm

This section describes how to implement the GCN algorithm through pseudocode.

Algorithm 8.5

Pseudocode for Implementing GCN

figure e

8.6 GAT Algorithm

For the spectral approaches represented by the GCN, each calculation relies on the Laplacian matrix eigenvector and graph structure. This makes it difficult to apply a GCN model on other graphs once it has been trained on a particular graph structure. Furthermore, because the GCN model lacks the inductive ability, it has a limited scope for application.

For the non-spectral approaches, convolution is defined directly on the graph to operate adjacent vertices in space. However, such approaches are problematic, in that it is challenging to define an operation that can handle neighbors of different sizes while also ensuring that CNN parameters can be shared. To address this challenge, researchers have made a series of improvements [12,13,14,15]. For example, in 2017, Hamilton et al. proposed a classical inductive learning algorithm called GraphSAGE. In GraphSAGE, sampling is performed based on the neighborhood of a fixed size, and each vertex is represented by an aggregate of its neighbors. This means that a vertex not present during the training can still be appropriately represented by its neighbor vertices if it subsequently appears at a later stage. GraphSAGE has shown promising results in several large-scale induction benchmark tests.

In practice, however, the impacts of neighbor vertices on the target vertices are different. The methods referred to earlier do not take into account that fact that different neighbors are of the same importance. In 2018, Petar et al., taking inspiration from the attention mechanism widely used in deep learning models, proposed the graph attention network (GAT) [16], a graph data vertex classification model based on the attention mechanism [16]. The attention mechanism imitates human intuition, focusing on salient parts helpful for the target task while ignoring other invalid information. Similarly, the GAT pays attention to its neighbors and determines the weights of the neighbor vertices through the self-attention strategy. Different neighbor vertices have different impacts on the target vertices, allowing the hidden representation of each target vertex to be calculated more effectively.

8.6.1 Principles of the GAT Algorithm

This section focuses on the graph attention layer, which is an important component of the graph attention network. Here we make the following assumptions:

  • The input of the current attention layer is a set of vertex features: h = {\(\vec{\user2{h}}\) 1, \(\vec{\user2{h}}\) 2, …, \(\vec{\user2{h}}\) N}, \(\vec{\user2{h}}\) i ∊ ℝF, where N is the number of vertices, and F is the number of features of each vertex.

  • The output of the attention layer is a new set of vertex features: h′ = {\(\vec{\user2{h}}_{1}^{'}\) 1, \(\vec{\user2{h}}_{2}^{'}\), …, \(\vec{\user2{h}}_{N}^{'}\)}, \(\vec{\user2{h}}_{i}^{'}\) ∈ ℝF.

Because we need at least one nonlinear transformation so that we can convert input features into higher-level features, we perform linear transformation on each vertex, and then use the self-attention mechanism a to calculate the attention correlation coefficient eij. This coefficient indicates the importance of the feature of vertex j to vertex i.

$$e_{{ij}} = a\left( {{\user2{W}\vec{\user2{h}}}_{i} ,{\user2{W}\vec{\user2{h}}}_{j} } \right)$$
(8.31)

where

  • W is the weight matrix, and W ∈ ℝF×F; and

  • a is the self-attention mechanism a: ℝF×ℝF→ℝ.

We then introduce the attention mechanism into the graph structure through masked attention: eij is calculated only for vertices of j ∈ Ni, where Ni is a neighbor vertex of vertex i (and includes vertex i itself). To normalize all neighbor vertices j of vertex i, making it easier to compare the coefficients of different vertices, we use the Softmax function:

$$a_{{ij}} = {\text{Softmax}}_{j} (e_{{ij}} ) = \frac{{{\text{exp}}(e_{{ij}} )}}{{\sum\nolimits_{{k \in N_{i} }} {{\text{exp}}(e_{{ik}} )} }}$$
(8.32)

The attention mechanism a may be a single-layer feedforward neural network, which is determined by \(\vec{a}\) ∈ ℝ2F and LeakyReLU nonlinear activation function (slope a = 0.2 when the input is negative). Figure 8.7a shows the network architecture of the attention mechanism.

Fig. 8.7
figure 7

Network architecture of the attention mechanismFootnote

Source: https://arxiv.org/pdf/1710.10903.pdf.

In order to calculate the attention coefficient, we use the following formula:

$$a_{{ij}} = \frac{{\exp ({\text{LeakyReLU}}(\vec{a} ^{T} [\user2{W}\vec{\user2{h}} _{i} ||\user2{W}\vec{\user2{h}} _{j} ]))}}{{\sum\nolimits_{{k \in N_{i} }} {\exp ({\text{LeakyReLU}}(\vec{a} ^{T} [\user2{W}\vec{\user2{h}} _{i} ||\user2{W}\vec{\user2{h}} _{k} ]))} }}$$
(8.33)

where \(\bullet\) T indicates the transpose operation, and || indicates the concatenation operation.

By performing the preceding calculation, we are able to obtain the normalized attention coefficient, which we can subsequently use to calculate the output features of each vertex.

$$\vec{\user2{h}} _{i}^{'} = \sigma \left( {\sum\limits_{{j \in N_{i} }} {a_{{ij}} \user2{W}\vec{\user2{h}} _{j} } } \right)$$
(8.34)

Similar to the Transformer model proposed by Vaswani et al., extending the self-attention mechanism to the multi-head attention mechanism improves the calculation stability. The calculation process involves performing K calculations separately based on the self-attention mechanism, and then concatenating the obtained features to obtain the final vertex representation:

$$\vec{\user2{h}} _{i}^{'} = \mathop {||}\nolimits_{{k = 1}}^{K} \sigma \left( {\sum\limits_{{j \in N_{i} }} {\mathop a\nolimits_{{_{{ij}} }}^{k} \user2{W}^{k} \vec{\user2{h}} _{j} } } \right)$$
(8.35)

where

  • \(\mathop a\nolimits_{{_{{ij}} }}^{k}\) is the normalized attention coefficient obtained through the Kth calculation based on the self-attention mechanism; and

  • Wk is the corresponding input weight matrix for linear transformation.

When the multi-head attention mechanism is used at the last layer of the network, concatenation is less effective and so is replaced with an averaging operation. After nonlinear activation, the vertex representation of the multi-head attention layer is obtained:

$$\vec{\user2{h}} _{i}^{'} = \sigma \left( {\frac{1}{K}\sum\limits_{{k = 1}}^{K} {\sum\limits_{{j \in N_{i} }} {\mathop a\nolimits_{{_{{ij}} }}^{k} } } \user2{W}^{k} \vec{\user2{h}} _{j} } \right)$$
(8.36)

Figure 8.7b illustrates the aggregation process of the multi-head graph attention layer (K = 3). Different arrow styles in the graph represent separate attention calculation processes. The representation of a target vertex in the graph attention network is the weighted sum of its first-order neighbor vertices including the target vertex, which is a calculation process on a local graph.

The GAT has a number of advantages, such as high efficiency, flexibility, and portability. To achieve high efficiency, the GAT implements parallel calculation for the local graph vertex neighbor pair. To realize flexibility, it assigns different weights to vertices of different degrees. And in terms of portability, the model can be extended to unknown graphs, representing the ability of inductive learning. The GAT considers the different importance of neighbor vertices to target vertices and has achieved good results in some practical scenarios.

8.6.2 Implementation of the GAT Algorithm

This section describes how to implement the GAT algorithm through pseudocode.

Algorithm 8.6

Pseudocode for Implementing the GAT Algorithm

figure f

8.7 Application: Recommendation System

We are currently in the midst of an information boom driven by the unprecedented popularity of the Internet and the near-ubiquitous use of mobile terminals. In today’s fast-paced world, people want information at their fingertips. But given the vast amounts of information now available, it has become critical to ensure that people can obtain what they want, when they need it. In order to meet such demands, the recommendation system is developed.

The existing recommendation systems employ collaborative filtering, indicating that similar users like the same items and, conversely, the same user likes similar items. This type of filtering is divided into memory-based methods and model-based methods.

Memory-based methods can be further divided into user-based and item-based collaborative filtering. User-based collaborative filtering recommends items to similar users, whereas item-based collaborative filtering recommends similar items to users. In order to implement these two methods, we need to define the similarity between items or users. Although these methods are simple, easy to understand, and easy to implement, a significant amount of time is required to calculate the similarities between each pair of items or users, and to find similar items or users, especially when there are a huge number of items or users.

For highly efficient recommendation systems, one of the most successful methods for implementing collaborative filtering is the model-based matrix factorization. In this model, a user and an item (a user-item pair) are modeled as implicit vectors in the same space based on interactions between the user and item. For an unknown user-item pair, the preferences are calculated based on the vectors of the corresponding user and item (usually through the vector inner product operation). Two popular models in the matrix factorization family are SVD and SVD + + . The SVD model, which is based on the general matrix factorization, improves the stability of the model training by introducing user bias, item bias, and global bias variables. SVD ++ , as an extension of SVD, improves the effectiveness by introducing an auxiliary feature: the interaction history between the user and the item.

8.7.1 Recommendation System in Industry Applications

The recommendation system dates back to as early as the twentieth century, but it wasn’t until the last decade when the industry began adopting it more widely. For example, it is estimated that Amazon sells more than 35% of its listed items through recommendation systems. In addition, by using recommendation systems, Google generated revenue of $43 billion in Internet advertising in 2014, and in 2015, Google Play and Apple’s App Store earned $10 billion and $21 billion, respectively. Huawei—ranked highly in the Fortune Global 500 list—also applies the recommendation system throughout its business operations. The recommendation system involves an extensive range of content, so extensive in fact that we could write an entire book dedicated exclusively to this topic. So, given the space limitations in this book, we will focus on the recommendation system only from the perspective of industry researchers.

Industry recommendation systems consist of three steps: candidate set generation, matching prediction, and sorting. The number of items that these systems recommend may be millions or more, but matching predictions and sorting on such large candidate sets cannot be performed within an acceptable timeframe. As a result, it is necessary to generate a smaller candidate set (typically ranging from hundreds to thousands) based on the current recommendation scenario, the features of the item, and even the user’s preferences. After the candidate set is generated, the matching prediction model predicts the current user’s preference for each item in the candidate set. Ultimately, the sorting step combines the results of the matching prediction model with business rules to generate the final sorting results.

An important part of industry recommendation systems is click through rate (CTR) prediction, which first appeared in online advertisement scenarios and belongs to the matching prediction step described earlier. In online advertisement scenarios using the cost per click (CPC) model, revenue is generated for the platform each time a user clicks on an advertisement. The amount of revenue is specified in a contract between the advertiser and platform. In most cases, the platform uses CTR × bid sorting rules for candidate advertisements, where CTR is an estimate of the current user’s CTR for the advertisement, and bid represents the amount of money the advertiser will pay to the platform if the user clicks on the advertisement. The sorting rules arrange candidate advertisements according to the expected benefits; that is, they are sorted based on the revenue they generate for the platform each time they are displayed. Such rules are also used for real-time bidding advertisements, and similar rules are used in game and video sorting scenarios. Game sorting is generally based on CTR × LTV, where life time value (LTV) is the average fee a user pays for the game. Video sorting generally uses CTR × WT, where watch time (WT) is the average time a user spends watching the video. Given its wide scope of application, CTR prediction is extremely important in the industry recommendation system.

The recommendation system models used by most enterprises have evolved from the wide model—using logistic regression (LR) or factorization machine (FM)—to the deep learning model, and then to the reinforcement learning model in addition to the deep learning model where the graph structure is considered.

To understand the graph neural network-based model, we first need to understand the input data form used in the recommendation system. This input data form differs significantly from that used in NLP and computer vision. The recommendation system covers many discrete features, such as gender, city, and day of the week. Because these features have no numerical meaning, they are typically represented by one-hot encoding. In this encoding method, all possible values are represented by a high-dimensional vector with a value of 0–1, where the corresponding bit is 1 and all other bits are 0. The dimension of the one-hot vector is the number of all possible values. For example, “Friday” can be represented as [0, 0, 0, 0, 1, 0, 0], the gender “male” can be represented as [0, 1], and the city “Shanghai” can be represented as [0, 0, 1, …, 0]. From the preceding information, we can see that the input data of the recommendation system is usually high-dimensional and sparse.

8.7.2 Graph Neural Network Model in a Recommendation System

Ultra-large recommendation systems face several challenges due to the high-dimensional and sparse nature of their input data:

  1. 1.

    Storage: The data is structured and all features are arranged in a certain order, with many of them duplicated. For example, if there are 10,000 male users, the system needs to store 10,000 male-represented vectors, such as [0, 1]. As the number of features, users, and items increases, the amount of duplicated data becomes larger.

  2. 2.

    Sparsity: For movie-recommending scenarios like MovieLens [17], the data is usually represented by a “user-item” score matrix. As the number of users and items increase, the dimensions and sparsity of the score matrix also increase. This is because most users do not score most items, and the collaborative filtering algorithm relies on the score matrix.

  3. 3.

    Scalability: The ability to process ever-increasing volumes of data and the exponential growth of collaborative filtering calculation make it extremely difficult for recommendation systems to scale easily [18].

The graph data structure makes it possible to address these challenges.

  1. 1.

    For repeated storage of features, male can be represented as a vertex on a graph, with all male users having an edge from the user’s vertex to the male vertex. This means that information about only the edge needs to be maintained. For higher-dimensional features, the effectiveness of graph structure storage is more pronounced.

  2. 2.

    In response to the sparsity challenge, graph structure storage is vertex-centric, and only the in-edge and out-edge are maintained.

  3. 3.

    To facilitate scalability, the graph structure makes it easy to add new vertices and edges, requiring the model to be updated only for the new additions.

PinSage [18], jointly published by Pinterest and Stanford University, is the industry’s first commercial end-to-end recommendation model based on the graph neural network. Pinterest, an image-based social networking site, displays images in the form of a waterfall stream, where new images are automatically loaded at the bottom of the page without needing users to change the current page. Users can pin images of interest on the pinboard and can save and share the images, while other users can follow and forward the images. The main items recommended on Pinterest are images (called Pins), which may include images of food, clothes, products, etc. Users group images they like into Boards. Pinterest data can be modeled to construct a bipartite graph, which includes two types of vertices (Pins and Boards). In the bipartite graph, shown in Fig. 8.8, there is no connection edge between vertices of the same type, and the vertex features include images and textual annotations (title and description).

Fig. 8.8
figure 8

Pinterest bipartite graph

In the traditional GCN, the entire graph is used for training. However, in industrial recommendation scenarios such as Pinterest, there are billions of vertices and tens of billions of edges, making it difficult to perform operations if the entire graph is used for training. To solve this problem, PinSage took inspiration from GraphSage to make improvements to the GCN. GraphSage can be considered as a GCN based on random walk and is an inductive variant of the GCN. It learns the vertex representation by sampling the neighbor information of the aggregated vertex in order to avoid operating the entire Laplacian matrix of the graph. This means that GraphSage can be generalized to an unknown vertex if one exists, and its neighbor information can be used to learn the representation of the vertex. The key improvements PinSage made to the GCN are as follows:

  1. 1.

    Local graph convolution is performed by dynamically constructing a new computational graph through random walk (short random walks) sampling of vertex neighbors. Because the importance of different neighbors to the target vertex is different, the neighbor will have an importance score during the information aggregation.

  2. 2.

    Distributed training is performed based on mini-batch. The CPU is used to sample vertex neighbors to obtain the features required for defining local convolution. Through tensor calculation and hardware acceleration, the distributed stochastic gradient descent calculation is performed for each pre-calculated small graph. The convolution operation can be performed separately, and the parameters of each convolutional layer are shared.

  3. 3.

    Repeated calculation of vertex neighbors is eliminated by using related technologies during inference.

The PinSage algorithm uses the local graph convolution to learn the vertex embedding of the web-level graph containing billions of objects, whereby high-quality vertex embedding facilitates subsequent recommendations. The PinSage algorithm can be summarized into two parts.

The first part is convolution, which is shown in Algorithm 8.7. The vertex embedding calculation, vertex neighbors, weights of the vertex neighbors, and aggregate function are used as the input. Through information aggregation, the neighbor embedding (line 1 of the pseudocode) is calculated. Then, the neighbor and vertex embeddings are used to update the current vertex embedding (line 2 of the pseudocode). Finally, the resulting vertex embedding is normalized (line 3 of the pseudocode). The method for sampling vertex neighbors during information aggregation has two advantages: (1) the number of neighbors is fixed and the memory used for calculation is controllable; and (2) different importance of the neighbors to the vertex is used for information aggregation. Each time the convolution operation (Algorithm 8.7) is used to obtain the new embedding of a vertex, more information about the local graph structure around the vertex can be obtained by superposing several such convolutions.

Algorithm 8.7

Convolution.

Input: Embedding zu of the current vertex u, neighbor embedding set {zv|vN (u)}, neighbor weight set α, and aggregate function γ(\(\bullet\))

Output: New embedding \({\mathbf{z}}_{{\text{u}}}^{{{\text{NEW}}}}\) of the vertex u.

(1) Aggregate neighbor information

$$\user2{n}_{u} \leftarrow \gamma \left( {\left\{ {{\text{ReLU}}\left( {\user2{Qh}_{{\text{v}}} + {\text{q}}} \right)|{\text{v}} \in N(u)} \right\},\alpha } \right)$$

(2) Calculate and update the vertex embedding

$$\user2{z}_{u}^{{NEW}} {\text{~}} \leftarrow {\text{~ReLU~}}\left( {\user2{W} \bullet {\text{Concat~}}\left( {\user2{z}_{u} ,{\text{~}}\user2{n}_{u} } \right){\text{~}} + \user2{w}} \right)$$

(3) Normalize the vertex embedding

$$\user2{z}_{u}^{{NEW}} {\text{~}} \leftarrow \user2{~z}_{u}^{{NEW}} {\text{~/}}\left\| {{\text{~}}\user2{z}_{u}^{{NEW}} } \right\|_{2}$$

The second part of the PinSage algorithm is mini-batch, shown in Algorithm 8.8, which stacks convolutions into a mini-batch of vertices M to generate the embedding. The mini-batch vertex neighbor sampling process is performed to obtain the neighbor of each vertex (lines 2–8 of the pseudocode). Then, K convolutions are used to iteratively generate K representations of the target vertex (lines 9–16 of the pseudocode). Finally, the vertex embedding is obtained through learning (based on the previously obtained embedding) by using a fully connected neural network (lines 17–19 of the pseudocode). G1, G2, and g are the parameters of the fully connected layer.

Algorithm 8.8

Mini-batch.

figure g

PinSage has achieved positive results and encouraged the use of the graph convolution algorithm in commercial recommendation systems. In the future, graph neural networks can be expanded to solve the learning problems of other large-scale graph representations and generate greater value in real-world scenarios.