Keywords

1 Introduction

Text classification is pushed forward into many target applications [10], e.g., sentiment analysis, question answering, natural language inference, etc. It aims to process different kinds of texts and classify them into pre-defined labelled categories. Short text semantic classification serves the role analogous to sentence pair classification in Chinese context semantic environment. For sentence pair classification tasks, two text sequences will be considered the input for approaches and a label or a scalar value indicating their relation will be received. Numerous tasks, including natural paraphrase identification [14] and answer selection [17] can be seen as specific forms of text matching problems.

As a surge of interest and distinguished work [6, 15] emerge in natural language processing (abbr., NLP) recently, choosing proper methods becomes more practical and challenging. Pre-trained learning models (e.g., BERT or GPT and their variants) outperform more better than traditional machine learning methods in almost all scenarios.

Of note, deep graph neural networks are preferred utilized in text classification, efficiently capturing semantic connections between words, phrases or sentences, evolving into feasible representation methods. Nevertheless, most of the datasets used for text classification only provide English version. Not only Chinese datasets, but also how to migrate the methods for text classification needs to be measured. Early work utilizes Chinese characters as input to the model, or first segments each sentence into words and then takes these words as input tokens. Word-based models are more susceptible to sparse data and the presence of out-of-vocabulary words will also lead to performance degradation, and thus more prone to overfitting [7]. However, character-based models cannot fully utilize explicit word information, which is not negligible in Chinese semantic classification.

In this paper, we propose a Graph Attention Leaping Connection Network (abbr., GLCN) to consider both semantic information and multi-granularity information, achieving sufficient information aggregation while alleviating over smoothing. Our model needs to build a pair of word lattice graphs. In order to reduce noise and computation, only several segmentation paths are utilized to form the lattice graph during the construction process. Also, we get the initial word representation by aggregating features from the character-level interaction. For nodes updating, we use an attention mechanism to weigh “important” neighbors more. When getting the final representation of each node, we introduce a leaping connection policy for the first time, which considers information from all nodes in the graph and can be generalized to new graphs by Max-Pooling.

There are four main aspects of our contribution:

  1. 1)

    Our model makes full use of the multi-granularity information of characters and words.

  2. 2)

    Attention mechanism is introduced to better aggregate the information between words and characters.

  3. 3)

    Leaping connection constructed by adaptive Max-Pooling achieves node information aggregation without introducing additional learning parameters while avoiding over-smoothing.

  4. 4)

    Experiments on three datasets demonstrate that our model outperforms the state-of-the-art model.

2 Related Work

Deep Text Classification. Recently, pre-trained models (abbr., PTMs) like BERT [5] have shown their powerful ability in learning contextual word embeddings. For Chinese text classification, BERT takes a pair of short texts as input and each character is a separated input token. It has ignored word information. To tackle this problem, some Chinese variants of original BERT have been proposed, e.g. BERT-wwm [4], ERNIE [12] and its update ERNIE2.0 [11]. They take the word information into consideration based on the whole word masking mechanism during pre-training.

Graph Neural Networks. Graph neural networks derive from network embedding, which effectively maps nodes to low-dimensional representations and records the structure of the network. As a typical kind of non-Euclidean data, graph-structure data is playing a crucial role in the field of deep neural networks [16, 18]. These deep neural network architectures are known as Graph Neural Networks (abbr., GNNs), which have been proposed to learn meaningful representations for graph-structure data.

3 Graph Attentive Leaping Connection Model

3.1 Problem Definition

For ease of presentation, we define the notations and key data structures used in this paper.

Definition 1 (Chinese Test Classification). Given two Chinese short text sequences \(S^{^{a}}=\left\{ s_{1}^{a},s_{2}^{a},\cdots ,s_{T_{a}}^{a}\right\} \) and \(S^{^{b}}=\left\{ s_{1}^{b},s_{2}^{b},\cdots ,s_{T_{b}}^{b}\right\} \), the goal of our text classification model \(f\left( S^{a},S^{b}\right) \) is to predict weather \(S^{a}\) and \(S^{b}\) have the same semantics. Where \(s_{i}^{a}\) and \(s_{j}^{b}\) represent the i-th and j-th Chinese character in two texts respectively, and \(T_{a}\) and \(T_{b}\) denote the number of characters.

Definition 2 (Chinese Lattice Graph). A Lattice Graph consists of the result of Chinese word segmentation and the original character sequence. Since keeping all possible segmentation paths will lead to excessive computation and noise, we stay several paths by random selection like Fig. 2 to form a word lattice graph \(G=\left( \mathcal {V} ,\mathcal {E} \right) \). Each word and character represents a node. \(\mathcal {V}\) is the set of nodes. \(\mathcal {N}\left( v_{i} \right) \) denotes the set of all neighbor nodes of node \(v_{i}\) except itself.

Fig. 1.
figure 1

The framework of GLCN-BERT model.

3.2 Model Description

As shown in Fig. 1, our model consists of four components: a lattice embedding module, a neighborhood interaction-based attention module, a leaping connection module and a final semantic classifier.

Lattice Embedding Module. For each node \(v_{i}\) in graph lattice, the initial representation of word \(w_{i}\) is the aggregation of contextual character representations. We first recombine the two original character-level text sequences to a new one and then feed them to the BERT pre-train model to obtain the contextual representations for each character \(C=\left\{ c^\mathrm {CLS},c_{1}^{a},\cdots ,c_{T_{a}}^{a},c^\mathrm {SEP}, c_{1}^{b},\cdots ,c_{T_{b}}^{b},c^\mathrm {SEP} \right\} \).

Next, we define the characters contained in each word \(w_{i}\) in each graph as \(\left\{ s_{i},s_{i+1},\cdots ,s_{i+n_{i}-1}\right\} \), which means the node \(v_{i}\) has \(n_{i}\) consecutive character tokens and \(s_{i}\) denotes the index of the first character of \(v_{i}\) in the text \(S^{a}\) and \(S^{b}\). Then, we calculate a feature-wised score vector \(u_{k}\), with a two layers feed forward network(abbr., FFN) for each character \(c_{i+k}\left( 0\le k\le n_{i} \right) \) in \(w_{i}\) like [2] and then normalized with a feature-wised softmax as Fig. 2.

$$\begin{aligned} \mathrm {u}_{i+k}=\mathrm {softmax}\left( \mathrm {FFN}\left( c_{i+k} \right) \right) \end{aligned}$$
(1)

The corresponding character embedding \(c_{i+k}\) is weighted with the normalised scores \(u_{i+k}\) to obtain the initial node embedding \(\mathrm{v}_{i}= {\textstyle \sum _{k=0}^{n_{i}-1}\mathrm{u}_{i+k}\odot \mathrm{c}_{i+k}} \). where \(\odot \) represents element-wise product of two vectors.

At the end of this module, we get two lattice graph embedding sets \(G^{a}\) and \(G^{b}\), which consist of both character-level and word-level representations.

Fig. 2.
figure 2

Contextual word embedding.

Neighborhood Interaction-Based Attention Module. Since the utilization of the attention mechanisms allows the learning process to focus on parts of the graph that are more relevant to a specific task. As Fig. 2 shows, the graph attention classification module takes the contextual node embedding \(u_{i}\) as the initial representation \(h_{i}^{0}\) for each node \(v_{i}\), then updates its representation from one layer to the next. We simplify the update strategy into two steps:

(1)Message Propagation. At l-th step, each node \(v_{i}\) in \(G^{a}\) (the same with \(G^{b}\)) will first aggregates messages from its own neighbor nodes and then combine the result with the node representation from the last iteration,

$$\begin{aligned} \mathbf {h}_{i}^{self}=\mathrm {GRU}\left( \mathbf {h}_{i}^{l-1}, \sigma \left( \sum _{v_{q} \in \mathcal {N}\left( v_{i} \right) }\alpha _{ij}\left( \mathbf {W}^{self}\mathbf {h}_{j}^{l-1} \right) \right) \right) \end{aligned}$$
(2)

In order to make full use of the information of \(G^{b}\), we also aggregate messages from all nodes in graph \(G^{b}\),

$$\begin{aligned} \mathbf {h} _{i}^{b}= \sigma \left( \sum _{v_{q} \in \mathcal {V}\left( v_{b} \right) }\alpha _{iq}\left( \mathbf {W}^{b}\mathbf {h}_{q}^{l-1} \right) \right) \end{aligned}$$
(3)

Here , the \(\sigma \) is a non-linear activation function, e.g. a ReLU. And \(\alpha _{ij}\) and \(\alpha _{iq}\) are attention coefficients [13].

(2)Representation Updating. After message propagation, each node \(v_{i}\) will update its representation from \(\mathbf {h} _{i}^{b}\) to \(\mathbf {h}_{i}^{l}={\text {GRU}}\left( \mathbf {h}_{i}^{\text{ self }}, \mathbf {h}_{i}^{\text{ b }}\right) \) with a gate recurrent unit (abbr., GRU) [3].

After updating node feature L steps, we will obtain the graph-aware representation \(\mathbf {h}_{i}^{L}\) for each node \(v_{i}\).

Leaping Connection Module. Without introducing any additional parameters, we selectively adopt max-pooling as the core of the LC module like Fig. 3, which can balance the contradiction between training consumption and over-smoothing. We can get the final representation \(\mathbf {h}_{v}^{final}=\mathrm {MaxPooling}\left( \mathbf {h}_{v}^{1},\mathbf {h}_{v}^{2},\cdot \cdot \cdot ,\mathbf {h}_{v}^{L} \right) \) through this module. Where \(\left\{ \mathbf {h}_{v}^{1},\mathbf {h}_{v}^{2},\cdot \cdot \cdot ,\mathbf {h}_{v}^{L} \right\} \) means the representation of each node at each layer.

Fig. 3.
figure 3

Leaping connection module.

For each text \(S^{a}\) or \(S^{b}\), the text representation vector \(\mathbf {r}^{a}\) or \(\mathbf {r}^{b}\) is obtained by attentive-pooling which can compute the representations of all nodes in each graph.

Semantic Classifier. With two text vectors \(\mathbf {r}^{a}\), \(\mathbf {r}^{b}\), and the vector \(\mathbf {c}^\mathrm {CLS}\) obtained by BERT, our model will predict the similarity of two texts, the training object is to minimize the binary cross-entropy loss.

4 Experiment

4.1 Experimental Setup

Dataset. We conduct experiments on three Chinese datasets for the Chinese short text semantic classification task: LCQMC [8], BQ [1] and ATEC. ATEC is the semantic similarity learning contest data set provided by Ant Financial Services Group. The sample in all datasets contains a pair of texts and a binary label indicating whether the two texts have the same meaning or share the same intention. The statistics of the datasets is shown in Table 1.

Table 1. Features of three datasets

Hyper-parameters. The number of neighborhood interaction graph updating layers L is 3 on both datasets. The dimensions of both word representation and hidden size are 128. The model is trained by AdamW with an initial learning rate of 0.0002 and a warmup rate of 0.1. The learning rate of the BERT layer is multiplied by an additional factor of 0.1. As for batch size, we use 32 for all datasets. The dropout was applied after the word and character embedding layers with a keep rate of 0.3. It was also applied before the fully connected layers with a keep rate of 0.5. Moreover, the patience number is 4.

Environment Settings. Our model is constructed by python3.7, with the help of the PyTorch framework. All the following experiments are conducted on one CentOS server with two Intel Xeon 2.2 GHz CPUs, 128 G RAM, and one RTX 2080Ti GPU. The input word lattice graphs are produced by the combination of three segmentation tools: jiebaFootnote 1 and HanNLPFootnote 2.

4.2 Evaluation Metrics and Baseline

Evaluation Metrics. For each dataset, the accuracy (abbr., ACC.) and F1 score are used as the evaluation metrics. ACC. is the percentage of correctly classified examples. F1 score of matching is the harmonic mean of the precision and recall.

Baseline. We compare our model with several BERT-based models pre-trained on large-scale corpora. Bert-base [5] is the official Chinese BERT model released by Google. It discards the traditional RNN and CNN, and converts the distance of two words at any position to 1 through the attention mechanism. ERNIE [12] is designed to learn language representation enhanced by knowledge masking strategies, which include entity-level masking and phrase-level masking. BERT-wwm [4] is a Chinese BERT, which was trained on the latest Chinese Wikipedia dump and adapt whole word masking in Chinese text. BERT-wwm-ext [4] is a variant of BERT-wwm with more training data and training steps. ERNIE2.0 [11] is an upgraded version of ERNIE, proposing a mechanism for continual learning. Roberta [9] is an enhanced version of BERT that modifies key hyperparameters, eliminates the pre-training target for the next sentence, and trains with larger mini-batches and learning rates.

4.3 Result and Analysis

From Table 2, we find that BERT variants all outperform the original one, which indicates that using word-level information in pre-training is crucial for Chinese text classification. Our model GLCN-BERT performs better than almost all these BERT-based models. It demonstrates that using word-level information and different fusion methods in the fine-tuning stage effectively boosts performance. It can even rival larger models with larger corpus and training time.

Table 2. Performance of various models on LCQMC, BQ and ATEC test datasets.
Fig. 4.
figure 4

Test accuracy.

Fig. 5.
figure 5

Early stopping epochs and average text length.

In addition, as shown in Fig. 4, using the leaping connection method significantly improves the performance of the model for all three datasets. It indicates that our model can aggregate the information of the node itself and neighbor nodes well. This may be since the short text contains a short sequence of contexts. As the depth increases, the expansion of the node aggregation range leads to each node containing too much global information, which can easily lead to overfitting. Our model takes into account the problem and avoids it effectively.

Finally, Fig. 5 shows the results when we set the early stop value to 3 (training will stop when the best result is not exceeded three times in a row). Thus, we could know that for short sequences of text pairs, a small number of epochs already tend to achieve a good result.

5 Conclusion and Future Work

In this work, we propose a Graph Attentive Leaping Connection Network(GLCN-BERT) for Chinese short text classification. Our model takes two word lattice graphs and utilizes a graph attention network structure to obtain information from each layer. Then the leaping connection method is used to aggregate the information flexibly while avoiding overfitting. The proposed approach is evaluated on three Chinese benchmark datasets and achieves the best performance. Extensive experiments also demonstrate that both semantic information and multi-granularity information are essential for text classification modeling.

In the future, we will further investigate the effect of the network depth on text classification and introduce external knowledge, such as paraphrase database, to help learn more accurate and robust text representation.