1 Introduction

Graph analysis provides powerful insights into how to unlock the value graphs hold. Due to this power, techniques for analyzing graphs are becoming an increasingly popular topic of study in both academics and industry. To effectively and efficiently support important analytic tasks on graph data, such as node/graph classification, node clustering, community detection, node recommendation, link prediction and graph visualization, a variety of graph embedding techniques (See [5, 11] for a comprehensive survey) have been developed. Graph data is mapped into low-dimension data such that the proximity relationship among graph nodes (i.e., objects) is preserved and the off-the-shelf machine learning methods, which are designed to handle vector representations, can be immediately applied.

The existing graph embedding techniques can be roughly classified into three broad categories: (1) random walk based embedding (e.g., Deepwalk [26] and Node2vec [9]) ; (2) node similarity based embedding (e.g., LINE [38] and NetMF [30]); and (3) graph neural networks (GNN) based embedding (e.g., GCN [15], GraphSage [10], GAT [40] and AS-GCN [12]). As reported by Leskovec et al. in their tutorial on graph embedding at WWW 2018Footnote 1, the first two categories of embedding techniques are only able to learn a “shallow” representation of the graph nodes due to the simplicity of the models. It is shown in [10, 15] that the neural network based embedding methods significantly outperform the state-of-the-art techniques in the first two categories for the node classification task. Therefore, exploring how to use neural network to create a “deep” representation more efficiently is a promising direction in graph representation learning. However, most of the existing graph neural network models suffer from the scalability issue due to the high time and space cost of the real-valued model.

Recently, there have been some researches on learning binary graph embedding (e.g., [20, 37, 46]), in which each node is represented by a binary vector (code), instead of a real-valued vector. It has been shown that the binarized graph embedding can achieve much better time and space efficiency.

Time efficiency.:

It is well-known that the distance computation of binary vectors (i.e., Hamming distance) is much more efficient than that of real-valued vectors (e.g., Euclidian distance). In addition to the specifically tailored search algorithms (e.g., [29]), the dot product between binary vectors can also enjoy the hardware support (e.g., xnor and build-in CPU instruction popcount).

As stressed in a recent work [18] from DeepMind, the pairwise dot product of the vectors has been intensively used by the model for some specific tasks (e.g., graph similarity computation in [1]). Thus, the binary vector has been used in their graph matching network (GMN) to speedup the computation.

Space Efficiency.:

The binary embedding can represent the node in a compact way while well preserving the structure information. As shown in [20], INH-MF can achieve competitive graph node classification performance with 128 bits for each node compared to the conventional embedding approaches (e.g., DeepWalk) with 128 dimensions (i.e., 128 × 64 bits) per node. This will be a great advantage when we face a large-scale graph because the binarized embedding of a graph is more likely to be accommodated in the main memory.

Motivation and Challenges. :

The existing GNN-based methods have demonstrated outstanding performance in various tasks such as classification [10, 12, 15, 40], link prediction [14, 48], graph similarity match [1, 18] and graph clustering [41, 49]. However, they may suffer from the limitation of the memory and speed due to the use of real-valued vectors for node and graph representations and model parameters.

Given the outstanding embedding quality, various applications of the GNN-based approaches and the space and time efficiency of the binarized representation, one may wonder if we can design a binarized GNN-based graph embedding approach such that we can achieve a good trade-off between embedding quality and time/space efficiency in the GNN-based methods.

We notice that the existing binarized graph embedding methods [20, 37] rely on the discretization of the matrix factorization following the node-similarity based approaches. They cannot be extended to binarize the GNN-based embedding due to the inherently different natures of two categories of approaches.

As to our best knowledge, the only attempt for the binarization of GNN is from DeepMind in their recent work [18]. Their binarization method converts each learned d-dimensional real-valued vector into a d-dimensional ”nearly” binary vector by applying well-known binarization function tanh to approximate hamming distance for the binarization and optimization. However, the output of tanh is not exact binary value and cannot be accelerated by the binary logic operations (e.g., xnor and popcount). As an alternative, one may consider the Binarized Neural Network (BNN) (e.g., [13]) for the graph embedding so that the representation is naturally binarized. However, BNN is not designed for graph data, and as to our best knowledge, there is no existing graph embedding work based on BNN.

These issues motivate us to develop a new binarized graph embedding technique which can be integrated into existing GNN-based models to binarize the parameters and produce high-quality binarized graph embeddings. The key challenge is how to generate effective compact embedding vectors with binary network parameters in an effective way. To address the challenge, we design a binarized graph neural network framework to learn the binary parameters and representations efficiently and effectively .

Contributions. :

Our principle contributions are summarized as follows:

  • To the best of our knowledge, this is the first study on binarized graph neural network (GNN) with binary parameters to generate binary graph representations. The proposed method, namely BGN, can be seamlessly integrated into the existing GNNs.

  • An end-to-end binarized graph neural network framework is proposed with binary weights and activations. This binarized framework can immediately reduce the memory consumption for the network; the bit-wise operations between the binary vectors can substantially speedup the inference time of the model and the gradient estimator enables our model to effectively process back-propagation through discrete parameters and activations.

  • Extensive experiments on multiple benchmark networks are conducted for node classification task. The results demonstrate that our proposed method outperforms existing binarized embedding methods with a big margin. Compare to the real-valued GNNs, our BGN model can achieve nearly state-of-the-art performance while consuming much fewer computation resources (up to 1/28 parameter and embedding memory space and 1/20 inference time).

  • Binarization approaches are employed on the GNN-based application GMN to show that, by applying our BGN techniques, GMN model can dramatically reduce the time and space complexity while keeping the performance competitiveness.

  • Experiments further show that our proposed BGN technique allows users to achieve a trade-off between the space/time and embedding quality in a flexible way by tuning different level and setting of binarization on the parameters and activations.

2 Related works

Graph Embedding. :

A key problem in machine learning on graphs is finding a way to incorporate information about the structure of the graph into the desired machine learning model. Graph embedding is one of the most promising approaches because it maps nodes into a low-dimensional space such that the structure of the graph is well preserved. Once accomplished, an existing machine learning approach (e.g., k-means clustering) can be used to assimilate and analyse the graph in the embedded low-dimensional space. Loosely following the seminal graph embedding approach, DeepWalk, three broad categories of embedding methods have appeared in the literature: (1) node similarity based embedding methods (e.g., LINE, NetMF), which rely on the proximity of the nodes w.r.t various similarity metrics. The matrix factorization techniques have been used to learn the embedding of the nodes. (2) Random walk based embedding methods (e.g., Deepwalk and node2vec) which encode the nodes by applying the Skip-Gram technique [25] on the random walks; and (3) graph neural networks (GNN) based embedding methods (e.g., GCN, GraphSage and GIN) [10, 12, 15, 40, 42, 45] which apply the neural network techniques on graph to learn the representations of the nodes.

Most of the existing graph embedding studies use the real-valued vector to encode the graph nodes following the above three computing paradigms. Recently, three unsupervised approaches [20, 37, 46] have been proposed to learn the binary embedding of the graphs following the node-similarity based embedding methods. Particularly, INH-MF [20] and DNE [37] are independently developed for binarized graph embedding based on the discretization of the matrix factorization on proximity graphs. BANE proposed in [46] is a natural extension of DNE by considering both structure and attribute similarities on the attributed graphs.

Binary Hashing. :

The binary hashing has been widely used to learn the binary vectors (codes) of the objects in many applications. The most popular application is the approximate nearest neighbor search in high dimension space where binary hashing methods encode high-dimensional objects (e.g., documents and images) to binary codes, while preserving similarity distance in the original space. Many learning to hash approaches have been proposed including unsupervised methods (e.g., [22, 33]), supervised methods (e.g., [35]), and deep learning based methods (e.g., [21]). Please refer to [43] for a comprehensive survey. Recently, three approaches [20, 37, 46] have been proposed to learn the binary embedding of the graphs following the node-similarity based embedding methods. As to our best knowledge, there is no existing work on the binarized graph embedding based on GNNs.

Binarized Neural Networks :

Binarized neural networks (see [27] for comprehensive survey) was first proposed by BNN [4]. The binarization technique proposed in [4] is used by most network binarization models. Among them, XNOR-Net [31] and DoReFa-Net [50] are the most popular ones because of their great performance on the image classification task.

XNOR-Net was proposed to have high accuracy of classification task on the ImageNet dataset while XNOR-Net has 58 × faster convolutional operations and 32 × memory saving. DoReFa-Net replaces the binarization by quantization which allows the model to change the bit size for weights, activations and even gradient calculations during backpropagation.

Recently, more binarized neural networks [3, 17, 23, 24, 28, 36, 47] and low bitwidth neural networks [8, 51] are proposed to further reduce the time cost of performing the machine learning model and apply these networks on the devices with low computation resources.

However, these methods are all designed for computer vision tasks. Though they perform well on the image dataset, they cannot be adapted to the graph representation learning and graph analysis task directly.

Graph Neural Network Applications :

There are several applications that are based on the GNN. Such as Graph Matching Network [18] and SimGNN [1]. These models utilize GNN and use the similarity (distance) of graph embedding to approximate the graph edit distance and graph similarity.

The Graph Matching Network (i.e., GMN) is a novel GNN-based framework proposed by DeepMind to compute the similarity score between input pairs of graphs. Separate MLPs will first map the input nodes in the graphs into vector space. Then the propagation layer will aggregate the messages of the edges and cross-graph matching vector by MLP or GRU with input concatenation of node representations and edge vectors. Matching function is applied to compute the attention coefficients based on the node information between the input pair of graphs. The matching function is based on the softmax function over node vectors which requires the calculation of vector space similarity like Euclidean, cosine similarity or dot product between all pairs of node representations. This attention coefficients calculation across two graphs requires a computation cost of \(\mathcal {O}(\mid V_{1} \mid \mid V_{2} \mid d)\), where V1 and V2 indicate the number of vertices of input graph 1 and 2 respectively, and d is the dimension of the node representation. The match vector \({\bf \mu }_{j\rightarrow i}\) is concatenated with the message vector \(\textbf {m}_{j\rightarrow i}\) and the node representation \(\textbf {h}^{(t)}_{i}\), then the concatenation is fed into MLP or a recurrent neural network core to produce the new node representations. Given the learned node representations of graph, the aggregation module proposed in [19] is used to obtain the graph representations. The similarity score in vector space such as Euclidean similarity, cosine similarity and approximate hamming similarity will be computed between graph representations to approximate the similarity between the input graphs.

3 Background and preliminaries

Recent studies have revealed that graph neural network can perform excellently on label classification tasks. The existing GNN-based graph embedding approaches share the same computing paradigm. GNNs take graph nodes’ feature and neighborhood information as the input. During the training, the representations of nodes (real-valued vectors) at each layer will be updated by the aggregators and non-linear activation functions. The output representations will be fed into the task-specific layer to calculate the loss of the model. Based on that, the model will be optimized by the optimizer through backpropagation. The main differences among these GNN-based graph embedding approaches are the design of the aggregator which combines the context representations and the loss function designed for different graph analytic tasks.

These models have real-valued parameters and learn a real-valued representation for each node in an end-to-end manner for graph node classification. However, the real-valued parameters and representations are space-consuming for storage and time-consuming for multiplication computation, especially for large-scale graphs. To address these issues, in this paper we devise a novel binarized graph neural network, namely BGN, with binary parameters in the neural network to learn binary embedding representations for node classification task.

The important notations used throughout the paper are summarized in Table 1.

Table 1 Summary of notations

4 Binarized graph neural network

As illustrated in Figure 1, we introduce a new graph neural network with binarized weights and activations. Our model BGN (B inarized G raph Neural N etwork) is based on the attention mechanism and can be easily adapted into other graph neural network frameworks. For a given graph, BGN takes the nodes and their contexts including feature and neighborhood structure information as input. Binarization function will transform the weights, activations and even coefficients into binarized vectors to reduce the time and space complexity, while the attention mechanism enables the nodes to attend over their neighborhoods’ features. We also apply the balance function to ensure that + 1 and − 1 are almost equal with each other in the binarized vectors. Furthermore, the gradient estimator is used for backpropagation of gradients through discretization.

Fig. 1
figure 1

The overall framework of the proposed model BGN. a All input node features are projected into a unified representation space by binary-valued weights. b Masked summation between binary matrix and real-valued matrix is employed to speed up the dot product. c Binary attention coefficients are produced based on the hidden representations. d Output of the layer is calculated via multi-head attention mechanism. e xnor and popcount are employed to calculate the dot product between binary-valued matrix. f Loss calculation and end-to-end optimization for the node classification task

The following subsections present the listed key components of our model:

  • Section 4.1 introduces the framework of our work.

  • Section 4.2 introduces the binarization of our model in detail, including the forward propagation and backpropagation.

  • Section 4.3 describes the optimization objective of our model.

  • Section 4.4 introduces the techniques we used to reduce the time and space complexity and improve the performance.

  • Section 4.5 introduces the adaptation of our model to other GNN frameworks.

4.1 Framework

Algorithm 1 illustrates the framework of our model. We follow the attention mechanism introduced in [39, 40] to involve the importance of the node’s neighborhoods into the graph representation learning process. Given a graph \(G(\mathcal {V},E)\), where \(\mathcal {V}\) and E denote the set of graph nodes and edges respectively, we use nodes features \(\{ \textbf {x}_{v}, \forall v \in \mathcal {V}\} , \textbf {x}_{v} \in \mathbb {R}^{m}\) and the neighborhood information of nodes \(\{ \mathbf {\eta }_{v}, \forall v \in \mathcal {V}\}\) as inputs. Balance(⋅) denotes the balance function which is introduced in Section 4.4.3. Our model will first produce the binarized node representations \(\textbf {h}^{b}_{v} \in \{+1, -1\}^{d}\) for each node within the input graph. After that, the binarized node embeddings will be fed into the output layer to compute the loss for some specific tasks like node classification.

figure a
Attention Mechanism :

Our proposed framework is based on the graph attention mechanism. The attention layer is utilized in our model to learn the importance of every node to other nodes. The key is to get the importance of one node’s feature to other nodes that is the attention coefficients of the input graph, afterwards, the node’s feature can attend on other nodes. Inspired by [40], we perform masked attention to the model to keep the structural information of the input graph. Only the attention coefficients of one node with its neighborhood nodes i.e., αij,vjηi will be computed.

In order to obtain the attention coefficients, we use a shared binarized weight matrix \(\textbf {W} \in \{+1, -1\}^{m \times d^{\prime }}\) to apply the linear transformation to each node. Softmax function is used to normalize the coefficients, but unlike the model proposed in [40], LeakyRelu activation is not employed in our model while the sign function is used to binarize the attention coefficients. With the following (1), we will get a binarized attention coefficient matrix \(\mathcal {A} \in \{+1, 0, -1\}^{N \times N}\) where \(\mathcal {\alpha }_{ij}\) is the element of the matrix \(\mathcal {A}\) (0 is contained in the matrix since we only compute the attention coefficients between neighbors such that the matrix is sparse).

$$ \mathcal{\alpha}_{ij} = \mathcal{B}^{\prime}(Softmax_{j}(\textbf{W}\textbf{x}_{i}, \textbf{W}\textbf{x}_{j})) $$
(1)

where \({\mathscr{B}}^{\prime }\) is the binarization function for attention coefficients which maps 0 to 0, positive values to + 1 and negative values to − 1.

Once the attention coefficient matrix is obtained, it will be used to compute the output of the attention layer. The attention coefficients will multiply the linear transformed node feature. We employ the multi-head attention mechanism to stabilize the learning process. The binarization function, which is served as an activation function, is applied to every attention head to binarize the pre-activations. And concatenation of the output of K independent attention head is the output of the attention layer. Therefore, the output node representation will be like following:

$$ \textbf{h}_{i} = \mathbin{\Vert}^{K}_{k=1}\mathcal{B}(\sum\limits_{j \in \eta_{i}}\alpha^{k}_{ij}\textbf{W}^{k}\textbf{x}_{j}) $$
(2)

Where ∥ means the concatenation of the vectors and hi is the output binarized node representation where \(\textbf {h}_{i} \in \{+1, -1\}^{d}\).

After several attention layers, the node representation will be fed into the last layer to calculate the loss for specific task which is classification in this paper. We will introduce the learning objective in the Section 4.3.

4.2 Binarization

In this section, we introduce how to obtain a graph neural network with binary parameters that can learn binary representations. Section 4.2.1 introduces the binarization function used to transform the real-valued parameters and pre-activations into binary space. Section 4.2.2 introduces the gradient estimators that enable the binarized model to be optimized by the off-the-shelf optimizers such as Adam and SGD.

4.2.1 Forward propagation

Binarization function is important in our model. Specific binarization function will be chosen in the forward propagation calculation process to binarize the weights and the activations. In that way the low-bit parameters and activations will help to reduce the time and space complexity. In our case, various binarization functions will work, and the most straightforward example is the sign function. As mentioned in [4] and [31], deterministic and stochastic binarization based sign function are widely applied to the continuous pre-activations as well as the real-valued weights to obtain binarized activations and weights.

$$ \mathcal{B}_{det}(x) = \left\{\begin{array}{ll} +1 & x \geq 0 ,\\ -1 & else, \end{array}\right. $$
(3)

The above equation is the deterministic binarization function, where x is the real-valued variable. The stochastic binarization is the sign function with probability:

$$ \mathcal{B}_{stoch}(x) = \left\{\begin{array}{ll} +1 & \text{with probability } p = \sigma (x), \\ -1 & \text{with probability } 1 - p, \end{array}\right. $$
(4)

where σ denotes the sigmoid function, that is σ(x) = 1/(1 + exp(−x)). The stochastic binarization is more appealing but needs the computer to generates random bits while the deterministic binarization is easier to calculate. Deterministic binarization function(i.e., (3)) is applied for the binarization of weights and activations because the deterministic sign function provides more stable and reproducible results. Please note that we use a variant of deterministic sign function which maps 0 to 0 to binarize the attention coefficients.

Other than directly binarize the weights in the graph neural network, we follow the quantization process in [31] to add a scaling factor \(\gamma \in \mathcal {R}^{+}\) to estimate a real-valued weights such that WγB, thus achieve better performance. We can find the optimal quantizer by minimizing the quantization error:

$$ \min \mathcal{J}(\gamma {\boldsymbol{B}}) = \left \| {\boldsymbol{W}} - \gamma {\boldsymbol{B}} \right\|^{2} $$
(5)

According to results and analysis in [31], for each real-valued weight W, the optimized binary matrix B and scaling factor γ can be obtained by the following constrained optimization:

$$ {\boldsymbol{B}^{*}} = \underset{\boldsymbol{B}}{\arg\min} {\boldsymbol{W}^{T}}{\boldsymbol{B}} $$
(6)
$$ \gamma^{*} = \frac{{\boldsymbol{W}^{T}}{\boldsymbol{B}^{*}}}{n} = \frac{1}{n}\left \| {\boldsymbol{W}} \right\|_{\mathcal{L}1} $$
(7)

where B is constrained to be a binarized matrix, n is the number of elements within the weight W, if \(\boldsymbol {W} \in \mathcal {R}^{m \times d}\), then n = m × d.

Furthermore, we also adopt the Libra Parameter Binarization (LPB) introduced in IR-net [28] to retain the information and minimize the information loss in forward propagation by jointly considering both quantization error and information loss. LPB also quantify the real-valued weight W using a scaling factor such that WγB. Suppose each element in B can be viewed as a sample of random variable obeying Bernoulli distribution shown in (4). The entropy of the quantization in the following (8) is also considered as a part of loss function:

$$ \mathcal{H}(\gamma {\boldsymbol{B}}) = \mathcal{H}({\boldsymbol{B}}) = -p\ln(p) -(1-p)\ln(1-p) $$
(8)

Together with the quantization loss described in (5), the objective function of LPB is defined as:

$$ \min \mathcal{J}(\gamma {\boldsymbol{B}}) - \lambda \mathcal{H}(\gamma {\boldsymbol{B}}) $$
(9)

We further apply the standardization and balance described in [28]. As a result, the optimal quantization can be obtained by solving:

$$ {\gamma^{*} \boldsymbol{B}_{\boldsymbol{W}}^{*}} = \mathcal{B}_{det}(\widehat{\boldsymbol{W}}_{std})\ll \gg s^{*} $$
(10)

where ≪≫ is left or right bit-shift, s and \(\widehat {\boldsymbol {W}}_{std}\) can be calculated by:

$$ s^{*} = round(\log_{2}(\frac{\left \| \widehat{\boldsymbol{W}}_{std} \right\|_{\mathcal{L}1}}{n})) $$
(11)
$$ \widehat{\boldsymbol{W}}_{std} = \frac{\widehat{\boldsymbol{W}}}{\sigma(\widehat{\boldsymbol{W}})}, \widehat{\boldsymbol{W}} = {\boldsymbol{W}} - \bar{\boldsymbol{W}} $$
(12)

where σ(⋅) denotes the standard deviation and \(\bar {\boldsymbol {W}}\) is a matrix whose elements are all mean value of weight W. LPB directly binarize the representations using the deterministic binarization function, i.e., \({\boldsymbol {B}}_{\boldsymbol {x}} = {\mathscr{B}}_{d}et({\boldsymbol {x}})\). Hence, the operation between the real-valued weights and vectors is reformulated as follows:

$$ {\boldsymbol{W}}{\boldsymbol{x}} = ({\boldsymbol{B}}_{\boldsymbol{W}} \odot {\boldsymbol{B}}_{\boldsymbol{x}})\ll \gg s $$
(13)

where ⊙ denotes the XNOR and popcount operation between binary codes.

4.2.2 Backpropagation

In this part, we describe how to backpropagate the gradients through the binarization function. We adapt the gradient estimator into our model for better optimization.

Propagation gradients through binarization function

It is obvious that the binarization function has zero derivative almost everywhere, which leads to the zero gradients of the loss function w.r.t the pre-activations and weights. The trainable variables cannot be updated with zero gradient. Therefore, the model cannot be trained by simple backpropagation, and the estimation of the gradients should be obtained for optimization. Previous studies have investigated how to propagate gradients through stochastic discrete functions. Below we investigate two popular unbiased gradient estimators for binarization function: straight through estimator and REINFORCE estimator [44].

Straight through estimator

The straight-through estimator is proposed a simple unbiased gradient estimator. It estimates the derivative of binarization function \({\mathscr{B}}(\textbf {h})\) of pre-activation or weight h as 1 (a vector or matrix whose elements are all 1). Let hb denote the binarized representation and h denote the pre-activation before binarization. The straight-through estimation of the gradient of the loss L w.r.t the pre-activation h is thus:

$$ g_{h} = \frac{\partial L}{\partial \textbf{h}} =\frac{\partial L}{\partial \mathcal{B}({\mathbf h})} \cdot \frac{\partial \mathcal{B}(\textbf{h})}{\partial \textbf{h}} = \frac{\partial L}{\partial \textbf{h}^{b}}\mathbf{1}= g_{h^{b}}\mathbf{1} $$
(14)

This gradient will then be back-propagated to obtain the gradient of quantities (i.e., pre-activations or weights) that influence h.

REINFORCE estimator

The reinforce estimator is proposed in [2] to estimate the expectation of the gradient \(\frac {\partial L}{\partial \textbf {h}}\) of loss L with regard to the pre-activation vector or weight h. When binarization function \({\mathscr{B}}(\cdot )\) is stochastic with the probability given by sigmoid, it has been proven that:

$$ \mathbb{E}(\frac{\partial L}{\partial {\mathbf h}}) = \mathbb{E}[(\mathcal{B}({\mathbf h})- \sigma(\textbf{h}))(L-c)] $$
(15)

where σ is the sigmoid function and c is a constant vector. To minimize the variance of the estimation, c can be chosen as:

$$ c = \frac{\mathbb{E}[(\mathcal{B}(\textbf{h})- \sigma(\textbf{h}))^{2}L]}{\mathbb{E}[(\mathcal{B}(\textbf{h})- \sigma(\textbf{h}))^{2}]} $$
(16)

The reinforce estimator can work directly on the weights and pre-activations without actual computation of the gradient. The estimation is obtained by monitoring numerator and denominator during the training process.

Compared with straight through estimator, reinforce estimator is more advanced with better performance in many applications. However, we observe that its performance is not superior than the straight through estimator. On the other hand, straight through estimator helps the model to obtain the gradient faster than the reinforce estimator due to its simplicity. The comparison between these two gradient estimators with regards to the performance is included in Section 5. In practice, we choose straight through estimator for our model in the experiments.

4.3 Optimization objectives

Existing GNN-based graph embedding approaches provide an end-to-end model, which focuses on the node classification task. Therefore, our model is also learned for the node classification task. Below, we introduce the objective of BGN and the learning process that optimizes the parameters.

For the node classification learning, we feed the binarized embedding \(\textbf {h}^{b}_{v}\) into the output layer to predict the class label for the node. The predicting probability of label Ci is written as:

$$ p(\textbf{C}_{vk} \mid \textbf{h}^{b}_{v}) = Softmax^{k}_{\zeta}(\sum\limits_{u \in \eta_{v}}\alpha^{L}_{uv}\textbf{W}^{L}\textbf{h}^{b}_{u}) $$
(17)

where ζ denotes the number of labels for each node. After obtaining the classification result in (17), we calculate the cross-entropy as the loss for the node classification task.

$$ L_{class} =-\sum\limits_{v \in \mathcal{V}_{labeled}}\sum\limits_{k = 1}^{\zeta} \textbf{C}^{\mathcal{L}}_{vk}\log(\textbf{C}_{vk}) $$
(18)

where \(\mathcal {V}_{labeled}\) is the set of nodes that have label information which are used for training process, \(\textbf {C}^{{\mathscr{L}}}_{vk}\) is the multi-hot encoding for ground truth classification labels.

The gradients will be back propagated via estimator and be applied on the optimization of parameters by the off-the-shelf optimizer during the training process.

4.4 Techniques to improve the model

Several techniques are used on binarized graph neural network model to reduce the time and space complexity and improve the performance. Logic operation XNOR between binary values, build-in CPU instruction popcount and the masked summation are used to replace the traditional arithmetic operation dot product to reduce time complexity. The Figure 2 is a toy example that introduces the differences between these operations. The balance function is used to make + 1 and − 1 to be balanced in the embedding vectors which can raise the performance of the GMN. Also, the binary parameters of the neural network and the binary node representations can reduce the space complexity intuitively.

Fig. 2
figure 2

The toy examples of (a) dot product (b) Masked summation and (c) XNOR and popcount instruction

4.4.1 XNOR and popcount

The logic XNOR and CPU build-in instruction popcount between binary matrices are used to replace the dot product between them.

As shown in Table 2, XNOR produces binary value with input of + 1 and − 1. Instruction popcount is then be employed to count the number of bits that is set to 1. The XNOR can be more than one order of magnitude faster than the dot product which can dramatically reduce the time complexity. As mentioned in [4], a 32-bit floating point multiplier costs about 200 Xilinx FPGA slices, whereas a 1-bit XNOR gate costs only 1 slice.

Table 2 XNOR calculation

4.4.2 Masked summation

The masked summation is used to replace the dot product between binary matrix and real-valued matrix. The binary matrix will be transformed into the mask matrix with ”True” and ”False”. During the multiplication, the real-valued vector will be masked by the corresponding mask vector, then the positive and negative masked vector are produced with only the elements at the same position as ”True” and ”False” on the mask vector. The model calculates the summations of the positive and negative masked vector separately. The subtraction of these two summation results is the result of dot product between the given matrices.

The masked summation can reduce the time complexity of dot product of two matrix. Usually, the time complexity of naive dot product between two real-value matrices \(M_{1} \in \mathbb {R}^{m\times n}\) and \(M_{2} \in \mathbb {R}^{n\times d}\) is \(\mathcal {O}(mnd)\), while the time complexity of masked summation between binary matrix \(M_{1} \in \{-1, +1\}^{m\times n}\) and real-valued matrix \(M_{2} \in \mathbb {R}^{n\times d}\) is \(\mathcal {O}(nd)\). Theoretically and also in practice, the masked summation can significantly reduce the time complexity of our proposed binarized graph neural network.

4.4.3 Balance function

The distribution of + 1 and − 1 is sometimes unbalanced in the representation vectors. For example, if most pre-activations h have positive elements, the output graph representation vector of binarization function hb will be formed mainly by + 1. Then the dot product of two vectors will be d which is the dimension of the vectors. This unwanted situation should be avoid because it dramatically lower the effectiveness of the proposed model, especially when the BGN is applied to GMN which requires a great number of dot product between representations. As a result, we apply the following balance function to the pre-activations before binarization in order to balance the distribution of positive and negative elements of pre-activations:

$$ \textit{Balance}(\textbf{h}) = \textbf{h} - \overline{\textbf{h}} $$
(19)

Where the \(\overline {\textbf {h}}\) is the vector whose elements are all mean value of the pre-activation vector h. The balance function ensures that the pre-activation vectors contain almost half positive and half negative elements, which leads to the balance distribution of + 1 and − 1 after binarization.

4.5 Adapted to Other GNN based models

The proposed binarized graph neural network is a very general framework that can be adapted to other graph neural network-based model to project the real-valued parameters and activations into the binary space to reduce the space and time cost. We introduce how we binarize the state-of-the-art GNN-based model AS-GCN [12] and the graph matching network.

4.5.1 Binarization of AS-GCN

AS-GCN is a general framework that is designed for fast representation learning based on graph neural networks such as GCN. Therefore, the binarization of AS-GCN is similar to our proposed BGN. We use deterministic binary function to binarize the parameters and pre-activations of AS-GCN. And straight through estimator is employed for back propagation. The binarized model is denoted as BGN-ASGCN in our experiment.

4.5.2 Binarization of GMN

As mentioned above, the time cost of GMN comes mainly from the pair-wise node similarity computation. We utilize the deterministic binarization function (3) on the preactivations and transform the node and graph representations into binary codes such that the XNOR can be applied to replace the dot product. Straight through estimator (14) is used for the back propagation. Furthermore, we noticed that the distribution of + 1 and − 1 is usually not symmetric which dramatically lower the performance, hence, balance function (19) is employed on the graph representations.

5 Experiment

We conduct extensive experiments to evaluate the performance of our model for the node classification task on real-world network datasets. We compare the time and space efficiency thoroughly between the proposed model and other baseline models. The case study shows the effectiveness and efficiency brought by our framework on the GNN-based application such as GMN.

5.1 Dataset

To facilitate the comparison between our model and the relevant baselines, we conduct the classification experiments on three well-known citation network datasets: Cora, Citeseer and Pubmed [34]. Each dataset contains bag-of-words representations of documents and citation links between the documents. Graph G is constructed based on the citation links. In the classification task, we only use 20 labeled instances per class for training. The test data contains 1000 nodes as in GCN, GAT and AS-GCN. We also include other types of dataset for extensive comparisons. The experiments are also conducted on two social network datasets: Facebook and wiki-vote [16] and two air-traffic networks [32]: Brazil and USA. For social networks, we randomly select 10% and 20% of nodes for training and validation respectively, and the rest of nodes are used as test set. For air-traffic networks, we randomly assign equal number of nodes in training, validation and test sets.

The details of the datasets are summarized in the Table 3.

Table 3 Citation datasets

5.2 Baseline methods

The following GNN-based and binary embedding methods are compared as baselines:

GCN :

(Graph Convolutional Network) [40] is a semi-supervised neural network method for node classification.

GAT :

(Graph Attention Network) [40] is a graph neural network model which first exploits the attention mechanism to solve the node classification task.

AS-GCN :

(Adaptive Sampling over GCN) [12] is a state-of-the-art method for node classification task. AS-GCN aims to increase the scalability of GCN using adaptive sampling. The experiments demonstrate that the application of BGN can further reduce the time and space complexity of AS-GCN.

GAT-binary and ASGCN-binary:

are the models that directly apply sign function on the node representations learned by the original version of GAT and AS-GCN. The naively binarized representations will be fed into the task-specific layer to learn the classification result.

GAT-tanh and ASGCN-tanh:

are the models that employ the binarization function tanh used by DeepMind’s work. tanh function is used to binarize the parameters and embedding vectors of GAT and AS-GCN. We clip the value of the parameters and activations in both models to make sure that tanh can produce ”exact” binary codes.

INH-MF :

[20] is a MF-based information network hashing algorithm that learns binary codes as node embedding which can preserve high-order proximity.

BANE(Binarized Attributed Network Embedding):

[46]

is an extension of DNE [37] which based on the Weisfeiler-Lehman proximity matrix factorization learning function to produce binary node representations.

Besides, we also compared the variants of our proposed BGN with the quantization methods introduced in Section 4.2.1:

BGN-x-GAT and BGN-x-ASGCN:

are the binarized models of graph attention network and ASGCN using (6) and (7).

BGN-lpb-GAT and BGN-lpb-ASGCN:

are the binarized models of graph attention network and ASGCN using Libra Parameter Binarization (LPB).

5.3 Experiment setup

For the performance experiment, we evaluate the models with the same bit-width representations. For the experiment of inference efficiency, the embedding dimension of our method and other baseline methods are all set to 64. During training process, the whole graph can be seen, but only a few nodes are labeled while most nodes have no label information. We put all nodes information in one training phase due to the need of calculation for graph attention coefficients.

For this classification task, we report the average accuracy of the evaluated GNN-based embedding approaches after ten independent runs using the accuracy metric introduced in [15, 40]. Because INH-MF and BANE only produce the binary embedding vectors but have no build-in classifier, we employ the one-vs-rest logic regression implemented by Liblinear [7] to obtain the classification result of the networks, in which 90% nodes are labeled.

All the experiments were conducted on the server which is running RHEL 7.5 and has 2x 2.4GHz Intel Xeon E5-2680 v4 (14 Cores) CPU, 256GB 2400MHz ECC DDR4-RAM and 2x NIVDIA Quadro P5000 16GB Graphics Card (GPUs) (2560 Cores). The time cost of our trained binarized model is evaluated on the CPUs using the XNOR and popcount instructions. The time cost of other baseline GNN-based methods is evaluated on GPUs.

5.4 Classification results

Because our model produces the compact representations for vertices, we compare the performance between our model and other baselines with the same bit width.

5.4.1 Comparison among binary embedding methods

We compare the classification results between our model and other binary-valued embedding methods.

As shown in the Figure 3, under different embedding dimensions, BGN outperforms all the other binary-valued embedding methods significantly on all three datasets. With the help of the graph neural network, our model can make better use of the graph structured data and feature information and is trained specifically for the node classification task. Therefore, our model outperforms other MF-based binarized graph embedding models by a significantly large margin. In comparison with the naively binarized GAT-binary and ASGCN-binary, our model considers the binary property of parameters and vectors during the training process, hence our model achieves better accuracy. In terms of GAT-tanh and ASGCN-tanh, because tanh function has zero gradient when the output is nearly + 1 or − 1 and has real-value output when the gradient is not zero. This property determines that tanh function is not suitable for binarizing the neural network. When the input values are clipped to produce exact binary parameters and embeddings via tanh function, the gradient will be zero which results in the insufficient optimization and worse performance than BGN. Furthermore, the models that are binarized by BGN-x and BGN-lpb achieve better performance than other compared methods.

Fig. 3
figure 3

Classification results of three citation network dataset among the binary-valued embedding methods with different embedding dimensions

5.4.2 Comparison among the GNN-based methods

We compare our model with other GNN-based methods (GCN, GAT and AS-GCN). All baseline methods produce the real-valued embedding vectors each dimension of which is encoded by at least 32 bits. Compared with these methods, each dimension of the embedding vectors learned by our model is only encoded by 1 bit. As a result, a real-valued 16 dimension vector requires at least 256 bits while a binary vector only requires 16 bits. Figure 4 shows the performance of the models with bit width varies for a single embedding vector.

Fig. 4
figure 4

Classification results of three citation network dataset among the GNN-based methods with varied bit width for embedding vector

Our model significantly outperforms all the baseline methods with low bit width. When getting more space for the learned representations, our model can still achieve competitive classification results compared with the state-of-the-art graph neural network-based methods. In conclusion, the performance gap between our model and baselines with large bit-width representations is acceptably small while our model’s performance is notably better with the low bit-width representations.

5.5 Comparison of time and space efficiency

In this section, we report the inference time and space efficiency of our model. The inference is the process that produces the classification result when we have already trained the model. Acceleration is brought by the XNOR and popcount operation with just little sacrifice on the classification performance. In this experiment, we train the binary parameters and activations of our model, then replace dot product operation between binarized matrices by XNOR and popcount and also replace the dot product between binary matrix and real-valued matrix by masked-summation during the inference process.

Tables 4 and 5 report the experiment results. GAT-binary, ASGCN-binary, GAT-tanh and ASGCN-tanh require the same size of parameters as their real-valued version since these baslines only binarize the representations of the nodes while keep using the real-valued parameters in the model. Among these models, GAT-binary and ASGCN-binary can be accelerated by applying the mask summation during inference process, while GAT-tanh and ASGCN-tanh have to perform conventional matrix multiplication since these two methods cannot guarantee to produce exact binarized codes as node representations. Our model under the binarized framework is more than one order of magnitude faster than the baseline methods GAT and AS-GCN with regards to the inference time. The proposed model can be up to 29 × faster and save up to 28 × space compared with the baseline methods.

Table 4 Comparison of performance, inference time and memory space required for the parameters between the real-valued and binarized models
Table 5 Comparison of performance, inference time and memory space required for the parameters between the tanh-based and BGN-based models

5.6 Analysis of binarization

In this section, we introduce the effect of the estimator and binarization level with regard to the space, time and performance. We compare the space, inference time and performance between BGN-GAT and GAT on the Cora dataset. We fix the dimension of embedding vector to 64 for both methods and change the setting of BGN to show the space and time saving compared with the baseline GAT.

Result is shown in Table 6 where BGNw, BGNe, BGNwe and BGNwec mean that the BGN is with weights binarized, embedding vectors binarized, weights and embedding vectors binarized, weights, embedding vectors and attention coefficients binarized based on the graph attention mechanism respectively. We can conclude from the Table 6 that (1) when the weights, activations and attention coefficients are all binarized, the BGN-GAT can save largest space for parameters and the output vectors while holding acceptable classification accuracy. (2) Straight through estimator and reinforce estimator have similar accuracy on the node classification task. Therefore, we choose the STE for our model in the above experiments because of its simplicity and certainty. (3) Compared with original GAT, BGN-GAT can save 28 × space for model parameters, 19 × space for activations and achieve 19 × speed up.

Table 6 Trade-off between time/space efficiency and classification accuracy of proposed BGN w.r.t the level and setting of binarization

5.7 Case study

In this section, we investigate how binarized graph neural network improve the time efficiency of the GNN-based applications such as GMN. Because GMN needs to compute the pair-wise dot product between node and graph embedding vectors, the time consumption is extremely high when the number of nodes in each graph goes up. However, with the binary representations, we can apply XNOR between binary vectors to replace the dot product, which will alleviate the time complexity problem significantly. The following experiment results will introduce the performance and time complexity of GMN with binary node and graph representations compared with the origin version. The graph similarity will then be used for the graph matching task.

Experiment Setup :

We follow the experiment setting of [18] to test the performance of Binarized GMN. The training data is generated by sampling binomial graphs G1 with n nodes and edge probability p [6]. Then the positive example G2 is generated by randomly substituting kp edges from G1 with new edges and negative example G3 is generated by substituting kn edges from G1, where kp < kn. In the experiment, we set kp = 1, kn = 2 and p = 0.2. We also set the hamming similarity between vectors as loss function, which is more suitable for the binary-valued vectors as the loss function to train the model. The model needs to predict a higher similarity score for positive pair (G1,G2) than negative pair (G1,G3). The evaluation metric remains the same: (1) pair AUC - the area under the ROC curve for classifying pairs of graphs as similar or not on a fixed set of 1000 pairs and (2) triplet accuracy - the accuracy of correctly assigning higher similarity to the positive pair in a triplet than the negative pair on a fixed set of 1000 triplets.

Inference time and Graph Matching Performance :

We report the graph matching accuracy and inference time of the binarized and original GMN with regards to the number of nodes in each graph. The default setting in GMN is 20 nodes per graph, which is quite small for real-world networks. We set the number of nodes in one graph from 20 to 160 and keep other settings the same as described above to evaluate the performance and inference time. The dimensions of node and graph representations are set to 32 and 64 respectively.

As shown in Figure 5, the inference of BGN-GMN is significantly faster. This is because of the fact that the similarity computation (pair-wise dot product) between node representations of two graphs mainly accounts for the time complexity of GMN. Under the same dimension of node and graph embedding vectors, BGN-GMN is up to 21 × faster than the baseline model in terms of the inference time with the help of the replacement of dot product by fast operations such as XNOR and popcount between binary vectors.

Fig. 5
figure 5

The performance of graph matching and inference time for GMN and BGN-GMN w.r.t the number of nodes per graph

In terms of graph matching task, the original version of GMN has better performance when the number of nodes in each graph is small. However, when the number of nodes gets larger, the pair AUC and triplet accuracy will both decay. When the number of nodes is more than 60, the real-valued representations cannot tell the similarity difference between the graphs. Hence, the model is not able to learn the different similarity scores for positive and negative pairs of graphs with the hamming similarity metric. However, with the help of binarization and balance function, the binary representations still hold an acceptable and more robust performance for the graph matching task. This is due to the fact that the binarized model produces true binary representations for the calculation of hamming loss and is designed for the graph matching task specifically on hamming space.

Parameter Sensitivity Analysis :

We compare the performance of binarized and original version GMN to show the effect of dimension for node and graph embedding vectors. We set the number of nodes in each graph n = 30 for this comparison. We change the dimension of graph embeddings produced by two models to ensure them to produce the same bit-width embedding vectors and keep the other settings as the same to compare the performance of two models.

The result is included in Figure 6a. We can find that the binary graph representations tend to have better performance when they are low bit-width and have similar accuracy when the bit-width for the representations getting larger. The binary representations have more robust performance compared with the baseline model when the dimension of embedding varied.

Fig. 6
figure 6

The performance comparison of graph matching task between original version of GMN and the BGN-GMN with a graph representations binarized and b node representations binarized

The node representations’ binarization is more important than the graph representations’ because the dot product operation is mainly conducted between the node representations which costs plenty of time. The performances of GMN and BGN-GMN are compared under different bit-width for the node embedding vectors by varying the dimensions.

As shown in Figure 6b, the result for the pair-wise AUC is similar between the binary and the real-valued node embedding vectors, but BGN-GMN holds a better performance with low bit-width representations. As for the triplet graph accuracy, the binary embedding vector achieves better performance with short code length and similar accuracy as real-valued node embedding with long code length. These results indicate that the binary representations are much better for the comparison between two graphs under low bit-width circumstances. In line with the result of the binary graph embedding vectors, the binary node embedding vectors also have more robust performance compared with the real-valued node representations.

6 Conclusion

We present a model focused on the challenging problem of seeking binary representations of network embeddings using a compact neural network structure. We proposed a novel binarized graph embedding method, namely BGN, that has binarized parameters and enables GNNs to learn discrete embedding. The binarized neural network can reduce the memory and time cost of the GNN such that increases the scalability of GNNs. BGN can be naturally integrated into other GNN models to enhance the performance of the model such as graph matching network in terms of the inference time and space consumption. External experiment also illustrates that BGN can increase the time efficiency while holding competitive accuracy.