1 Introduction

As a primary task in the NLP field, text classification is a foundation for many NLP tasks, and it has received continuous endeavors from researchers due to its wide spectrum of applications, such as sentiment analysis [1], topic labeling [2], and disease diagnoses [3]. In the early text classification tasks, statistical models are dominant, such as Naive Bayes (NB) [4], k-nearest neighbor (KNN) [5], and support vector machine (SVM) [6]. These traditional methods use Bag-of-Words(BoW) sparse representation of texts, which contributes to designing models for binary, multi-class, and multi-label classification problems. Although traditional text classification methods could reach reasonable performance, these methods still have some issues, including sparse features, the representation ability is limited, and so forth. Therefore, these methods can not make a fully semantic understanding and produce considerable features to represent the natural language.

With the development of deep learning, these problems are gradually relieved. Deep learning is to learn a set of nonlinear transformations, then integrate feature engineering directly into the output, and finally integrate feature engineering into the process of model fitting. For instance, Convolutional Neural Network(CNN) [7] and Recurrent Neural Network(RNN) [8], they are all essential text classification methods. Based on them, some extended models appear. For example, TextCNN [7],TextRNN [9], TextRCNN [10], fastText [11], long short-term memory(LSTM) and Bi-LSTM. Compared with traditional models, these models have superior performance. The key is that deep learning methods have better representation for texts, which helps to get significantly improved performance using even off-shelf linear classifier models.

However, these methods all have many drawbacks. They lack long-distance [12] and non-consecutive word interactions and only capture semantic information in local consecutive word sequences. In order to relieve these problems, researchers try to make use of graph neural networks(GNNs) [13]. GNN has a rich relational structure and can preserve global structure information of a graph in graph embedding, and with GNN, long-distance interactions between words could be captured to improve the final text classification performance. GNN also gets wide attention [14]because of its superior performance. There are many models based on GNN, such as Text-Level-GNN, TextGCN, TextING, TensorGCN and so on. However, there are many problems with these graph-based methods. First, the contextual aware word relations are neglected and high memory consumption. For example, TextGCN [15] solves the problem of how to convert text into the graph, but it consumes too much memory and lacks consideration about text-level word interactions [16], leading to the model can not understand the semantic of text well. Second, they cannot exploit rich relational information present among entities in texts. To be specific, TextING [17] simplifies text graph and reduces memory consumption, but it neglects semantic features representation. Similarly, TensorGCN [18] ignores the update of nodes’ semantic information, which makes it can not sufficiently mine the semantics, and lacks mastering semantic information.

We propose a new framework, named GText, which can further mine semantic features and relationships within the text. In this work, first, we construct a text graph based on semantic features for each document [19] and only contain words as nodes. Then we can get a very comprehensive contextual semantic relationship via the SIP(Semantic Information Passing) mechanism, and finally, we get text-level representation by a gate mechanism. Our highlights including constructing a semantic features graph for each text to simplify the complexity of graph structure, get semantic relationships, and making use of the SIP mechanism to achieve text information collection, integration, and non-consecutive word interactions, capturing text-level representation by a gate mechanism to improve final text classification performance. In a word, compared with traditional models and above models based on GNN, our model has three contributions as follows:

  • Our model builds a semantic features graph for each text, which simplifies the complexity of graph structure, and reduces memory consumption.

  • Our approach can achieve long-distance and discontinuous words semantic information interaction and integration. We also establish a deeper relationship representation between words for text-level representation.

  • Extensive experiments are conducted on several benchmark datasets to illustrate the effectiveness of GText for text classification.

The remainder of our paper is organized as follows: In Section 2, we describe the related work of text classification. In Section 3, the details of the proposed method are described. In Section 4, we present our experimental results and make the analysis. Finally, we briefly conclude the paper in Section 5.

2 Related work

Natural language processing has always been an important direction in the field of computer science and artificial intelligence, and text classification is a classic problem in natural language processing. In what follows, we briefly review existing studies on text classification methods.

2.1 Traditional text classification methods

People have started the research of text classification in the last century. In the early years of text classification, Naive Bayes [4], KNN [5], SVM [6] are widely used text classification methods. Among them, the Naive Bayesian classifier is a weak classifier. It is easy to build, and suitable for large data sets. However, it has a drawback that the assumption of independent prediction, is almost impossible in real life. And KNN determines the category of the new document according to the similarity of vectors of each document. It is very suitable when classification standards are uncertain, but it needs to compare the new text with all existing training documents when judging the category of text, therefore, the computational cost is very high. Similarly, SVM is also a classical method, it has advantages in solving small samples, however, it also has the problem that computational overhead is relatively large. In a word, these methods usually need at the cost of labor and efficiency, which all affect their performance in the task. With the coming of the information era and the rapid development of the Internet, multimedia information is used widely and deeply, people have higher requirements for text classification, so the deep learning methods are promoted.

2.2 Text classification based on deep learning

For nearly a decade, with the rise of statistical learning methods, a set of deep learning methods are promoted to solve such issues. Among them, neural networks such as RNN and CNN have been widely used in text classification. For example, TextCNN [7] model uses multiple different sizes and kernels to extract key information in sentences, which contributes to better capturing local correlation. However, its convolution and pooling operation will lose the vocabulary order and location information in the text. Similarly, TextRNN [9] model, which uses RNN cyclic neural network to solve the problem of text classification, but it will appear gradient disappearance and gradient explosion, which makes it difficult to learn the long-distance correlation of sequences. In addition, the fastText [11] model is a classical model, by introducing the concept of subword n-gram, the fastText [11] model solves the problems of morphology, low-frequency words, and unregistered words, which makes it can get good results in tasks with a large number of samples and many category labels. However, due to a large number of parameters that need to be estimated, the model may expand and the required memory is too large, which affects the performance of the model. Additionally, there are also many methods to combine neural networks with attention or others, such as ACT [20], MARTA [21], Knowledge-Aware Leap-LSTM [22] , SALNet [23] and so on, there are also methods based on label, for instance, AGN [24], LightXML [25], HTTN [26].

2.3 Text classification with GNN

In recent years, people began to notice the difference of GNN [27]. GNN was originally a neural network that can directly act on images. Recently, there are more and more models based on GNN applying to text classification, such as TextGCN [15] model, which constructs a text map for corpus based on word co-occurrence and word-word semantic relationship, then learns a text GCN for the corpus to improve the accuracy of text classification. However, due to the mapping for the whole corpus, the memory consumption is too large leading to the performance of the model being affected. Similarly, TextING [17] model, which creates a text graph through word co-occurrence, and then classifies the text by summarizing and learning new feature information. However, the text graph created by word co-occurrence can not well represent the semantic relationship between nodes, which will affect the performance of text classification.

Compared with the above models, our model can well relieve these problems. Firstly, our model uses graph structure, which can well solve the long-distance learning problems of TextCNN, TextRNN, and other traditional models. Secondly, our model builds a graph for each text instead of the corpus, and our model only has word nodes, which simplifies the complexity of graph structure and making a model can well solve the problem of excessive memory consumption. In addition, compared with TextING model, our model uses semantic similarity building text graph, which makes the semantics of text expressed better. Finally, we use a gate mechanism for mining the explicit keywords of the whole document, it is an attention mechanism, which can improve the model finding text-level representation power, then improve classification performance.

3 Method

In this section, we will introduce our model in detail. Firstly, we create text graphics based on the semantic features of the text, so that each text has its text-level graphical representation. Then using SIP(Semantic Information Passing) mechanism to ensure that the context semantic features will not be lost. Finally, the attention mechanism selects keywords for the text and classifies text according to the keyword information. The overall framework of our model is shown in Fig. 1.

Fig. 1
figure 1

GText framework. Illustrating the process of text classification by GText. First, we build a semantic features graph for every document, then we feed it into SIP(Semantic Information Passing, it will be described in Section 3.2), and finally we choose text level representation based on attention layers

3.1 Building semantic features graph

In this part, we create the semantic features graph. The semantic features graph algorithm is shown below. Firstly, we extract and pre-process text data, and use embedding to represent word semantic features. After getting the vector representation of the feature words, we start to create the semantic features graph.

Here, we take the only word representation in the text as the vertex of the graph, which is recorded as v.

$$ v=\{v|v1,v2,v3...\} $$
(1)

If the weight between two-word nodes is greater than a certain value, it is considered that there is an association between them, which means there is an edge. The edge is marked as e, i and j are two-word nodes, ws represents the sliding window size.

$$ e=\{e_{ij}|i\subseteq ws,j\subseteq ws\} $$
(2)

we use cosine similarity to calculate the semantic similarity between two-word nodes, and the obtained semantic similarity as the weight between word nodes, indicating the degree of dependency between word nodes.

$$ similarity=\frac{{\sum}_{i=1}^{n}{A_{i }\times B_{i}}}{\sqrt{{\sum}_{i=1}^{n}{(A_{i})^{2}\times }}\sqrt{{\sum}_{i=1}^{n}{(B_{i})^{2}}}} $$
(3)

The similarity explained that the weight is to be calculated in a set sliding window. The sliding window here can be manually adjusted as needed.

Our model builds a semantic features graph for each text instead of the whole corpus, which can not only reduce unnecessary memory consumption but also improve the accuracy of semantic information transmission in the text. Secondly, there are only word nodes in our semantic graph. Using only word nodes can reduce the complexity of the graph, and improve the efficiency of node information dissemination in the graph.

figure e

3.2 Semantic information passing

After getting the semantic features graph, we set up SIP semantic information transmission mechanism to obtain more comprehensive and accurate semantic information. To ensure that each node in the graph can keep the most valuable semantic information and further transmit it, we ask each node to interact with their neighbor nodes and gain neighbor nodes information, therefore, for each word node in the text graph, they do not exist independently.

$$ S=A_{n_{i}}{n_{i}^{t}}W_{s} $$
(4)

Here, s is the information of all neighbor nodes collected by node n, where A represents the adjacency matrix.

$$ \eta=sigmoid(W_{\eta}+U_{\eta}+b_{\eta}) $$
(5)
$$ a=sigmoid(W_{a}+U_{a}+b_{a}) $$
(6)

η and a are important parameters that determine the degree of information retention. They enable nodes to selectively retain the most valuable information, which contributes to updating and optimization of node information in the next step, ⊙ denotes dot production operation.

$$ \lambda=a\odot\eta $$
(7)
$$ n_{i}^{\prime}=tanh(W_{n_{i}^{\prime}}+U_{n_{i}^{\prime}} \lambda+b_{n_{i}^{\prime}}) $$
(8)
$$ n_{i}^{t+1}=(1-\eta)\odot {n_{i}^{t}}+n_{i}^{\prime}\odot\eta $$
(9)

\(n_{i}^{t+1}\) is a node with accurate semantic information, we obtained it by node \({n_{i}^{t}}\) information update sufficiently. η determines the influence of neighbor nodes on node \({n_{i}^{t}}\), and it determines the retention degree of node \({n_{i}^{t}}\) on neighbor information. Here, all U, W, b, η are variable parameters. They will be continuously optimized in the training to ensure the effective update of node information, then improving the semantic understanding of word nodes in the text and the subsequent text classification function.

3.3 Classification based on semantic

Through the previous two steps, the nodes in the semantic features graph have been fully updated, therefore, each node has more accurate text semantic information. For the updated nodes, we call them the nodes at t + 1 time, record as \(n_{i}^{t+1}\). We select the most semantic value node from all these updated nodes by adding an attention mechanism. That means selecting the important text-level representation in the text, then making the final prediction and classification for the text through the selected keyword information at the text. The functional expression is defined as follows:

$$ W_{n}=MLP(n_{i}^{t+1}) $$
(10)
$$ h_{i}=\frac{1}{|v|}\sum\limits_{\upsilon\subseteq\nu}n +Max(n_{1}^{t+1}...n_{i}^{t+1}) $$
(11)

Where Wn is an attention weight, we use it to represent the significance of words nodes. In addition, we apply a max-pooling function for the text representation, and we average the weighted word features, which makes each word node have an impact on the final result, but the keywords contribute more explicitly.

$$ y_{i}=softmax(h_{i} W_{n}+b) $$
(12)
$$ L=-\sum\limits_{i}{y_{label}}log(y_{i}) $$
(13)

Finally, the text-level representation is sent to the softmax layer for final label prediction, and the classification results are obtained.

4 Experiments

In this part, our goal is to evaluate the overall performance of our model GText on two benchmarks datasets under popular evaluation index Test Accuracy. To verify and analyze our model more comprehensively, we will experiment with our model from the aspects of the experimental setting, experimental result analysis, ablation experiment, and parameter sensitivity.

4.1 Datasets

Our experiments are conducted on two public and authoritative standard datasets: film review data set and Ohsumed data set.

MR dataset: MR data set is a classic film review data set. It is a data set used for binary emotion classification. It is useful in multiple text classification models. It mainly divides film reviews into positive reviews and negative reviews, including 5331 negative reviews and 5331 positive reviews. We divided it into tests and training.

0hsumed dataset: Ohsumed data set comes from medline10, a medical information database. It contains the titles or abstracts of 270 medical journals and 348566 documents from 1987 to 1991. We used 13929 unique cardiovascular disease abstracts out of 20000 before 1991, with case categories from 23 disease types in each document. When performing single-label classification, multi-label documents belonging to multiple classes will be excluded, leaving only 7400 documents belonging to one category, including 3357 documents in the training set and 4043 documents in the test set.

We will show specific information about our dataset on Table 1

Table 1 Summary statistics of dataset

4.2 Baselines

In order to make a comprehensive evaluation for our model, we compare our text classification model GText with several recognized text classification models with good performance.

RNN

[9]: RNN uses the last hidden state as the representation of text. RNN recurrent neural network is used to solve the problem of text classification and try to infer the label or label set of a given text (sentence, document, etc.). Such as emotion analysis, news topic classification, false news detection, etc.

CNN

[7]: CNN employs convolution and maximum pool operations are performed on word embedding to obtain the representation of text.

fastText

[11]: fastText, average word or n-gram embedding is used as document embedding. fastText combines the most successful concepts in natural language processing and machine learning. These include the use of word bag and n-gram bag to represent statements, the use of subword information, and the sharing of information among categories through hidden representation.

SWEM

[28]: It is a simple word embedding model, which employs simple pooling strategies operated over word embeddings.

TextGCN

[15]: TextGCN model converts text to graph and using GCN learn to text and text classification.

TensorGCN

[18]: TensorGCN model establishes a text map according to the three aspects of text semantics, word order, and grammar integrates and summarizes the three maps, strives to accurately understand the semantics of the text, and improves the efficiency of text classification.

TextING

[17]: In the TextING model, text classification is based on text graphs, so that text classification has the ability of inductive learning and improves the efficiency of text classification.

4.3 Settings

In this section, we will introduce some details of our experiment. We use the glove word embedding method, and the data dimensions we enter are all 300 dimensions. For all data sets, we give the training set and test set and divide the training set into the actual training set and verification set according to the ratio of 9:1. Moreover, we set the learning rate to 0.001 and dropout to 0.5. For baseline models, we all use default parameter values, just as they were in the original paper or implementation. Our model runs under the TensorFlow framework, and the test accuracy is used as the evaluation index. Compared with other methods, our model has achieved the latest results under this evaluation index.

4.4 Experimental results and analysis

In this part, we will show our specific experimental results and analyze the results. As shown in Table 2, compared with other models, our model is almost always superior to other baseline models in MR and Ohsumed datasets. And in most cases, our model performs better than the strongest baseline model. From Fig. 2, we can more intuitively see the differences between them.

Table 2 Text accuracy comparison with baselines on benchmark datasets
Fig. 2
figure 2

The left of figure is test accuracy with different tradtional models,and the right of figure is test accuracy with different models based on GNN. The red brackets on each columnar door are the range of value changes

We find that compared with our model, the experimental results of traditional neural network models such as RNN and CNN are generally poor. This is because these models give priority to the order and local information of the text and ignore the global semantic information of the text. In addition, these neural network models can not carry out long-distance information transmission. In our model GText, a node can have multiple neighbor nodes, and the neighbor nodes are no longer limited to several nearby nodes, they may not have to be adjacent. Moreover, through the information transmission between neighbor nodes, the semantic features information can be well transmitted in the global scope, and the text-level semantic information can be expressed better expression. However, neural network models such as CNN and RNN do not have such structural functions. In addition, the experimental results in the figure above further show that our model is superior to other baseline models based on GNN in terms of test accuracy.

In the TextGCN model, the text is successfully converted into a graph, but it is based on the whole corpus, which leads to a large amount of storage space consumption and can not support online testing. Our model is to build a text graph for every text. Building a graph for each document can not only avoid excessive resource consumption but also increase the semantic understanding of the text.

In the TensorGCN model, semantics, word order, and grammar are taken into account in text mapping, but too many factors in the process of mapping will increase the complexity of the model and the time delay, and reduce the work efficiency of the model. The ultimate purpose of text classification is to put the text into the corresponding label, so the model should focus on text semantics. Our model mainly focuses on the semantics of text, paying attention to the understanding and transmission of text semantic information.

In the TextING model, using the word co-occurrence method to build the graph for the text. Word co-occurrence is to calculate the frequency of a group of words in the text as their similarity, but the method of word co-occurrence in text classification is not so applicable. Because sometimes not all phrases with high frequency can represent the semantics of the text itself. Our model uses cosine similarity to calculate all the keywords that may represent the theme of the text, and then uses the attention mechanism to select the most representative as the basis of classification, which can contribute to selecting the text-level semantic representation and improve the efficiency of text classification.

From the above figure, we also can see that our model performance better than other baseline models based on GNN. This is because our model can better understand and transmit the text semantics than other baseline models based on GNN, and our model has better semantic expression ability. In addition, we notice that our model performance is better in the Ohsumed data sets. This may be because MR data sets are short text and the density of text graphs is low, therefore, our text semantic features graph and SIP information transmission mechanism have little impact on it. But for the Ohsumed data sets, complex long sentences, a large number of different words all can give full play to the understanding and transmission function of our model. Finally, improving the efficiency of model text classification.

4.5 Ablation experiment

In order to further study the influence of each part of GText on its model performance, we also designed several ablation experiments.

It can be seen from Fig. 3 that the performance of the GText ablation experiment is obviously inferior to GText. The experimental results show that these discarded modules have a significant impact on the performance of the model.

Fig. 3
figure 3

The test accuracy of GText in Ablation Experiment

After removing the semantic information module, the model becomes w/o semantic. We can observe the performance changes of the model on the two data sets from Fig. 3. The performance of w/o semantic on both MR data sets and Ohsumed data sets has declined, which shows that semantic mapping has an important impact on the performance of the model. A good semantic map can connect truly semantically related words together, and correctly define their correlation degree so that the model can fully understand their relationship, reduce the burden for subsequent modules and increase the accuracy of text classification.

The w/o SIP is a model formed after removing the SIP information transmission mechanism in the GText model. From the above figure, it can be observed that the test accuracy of the model without the SIP module is greatly reduced on the MR and Ohsumed, which fully illustrates the importance of the SIP mechanism. SIP mechanism can flexibly spread messages in text, and the message can be learned and retained effectively according to the semantic learning and understanding of the text, which not only reduces the unnecessary waste of resources but also increases the work efficiency of the model and improves the accuracy of text classification.

After removing the attention mechanism module in the GText model, the w/o attention model is formed. It can be observed from Fig. 3 that the performance of the model will decline to some extent after removing the attention module. The attention mechanism can help the model better select the keywords semantic features in text and improve the test accuracy of the model in text classification.

4.6 Parameter sensitivity

The model performance on MR and Ohsumed with different parameters is reported in Fig. 4. Notably, for parameter graph layers, the best performance of our model is achieved when the graph layer is 3 for MR and the graph layer is 4 for Ohsumed. It indicates that with the increment of the graph layers, nodes can get more neighbor information. However, when the number of graph layers reaches a certain value, the test accuracy starts to decline. That means the learning ability of nodes is limited.

Fig. 4
figure 4

Impact of the parameter graph layers, window size, learning rate and dropout on MR and Ohsumed datasets. The 1 and 2 of the dataset-axis represent MR and Ohsumed, respectively. The main contrast is the influence of these parameters on test accuracy

Figure 4 exhibits the performance of GText with different window sizes. With the increment of window size, the test accuracy of our model on MR and Ohsumed also increased. When the test accuracy reaches a peak value, it begins to decline. It illustrates the performance is also affected by the window size.

Figure 4 also exhibits the performance of GText with a varying learning rate on MR and Ohsumed. The result reveals that with the increment of the learning rate, the GText model learns more and more semantic information. Nevertheless, the situation reverses with a continuous increment, where the model reaches the fitting point in advance. In addition, Fig. 4 also illustrates the performance as well as the text accuracy of GText with a varying dropout on MR and Ohsumed. It presents a similar trend as the learning rate when the value of dropout increases.

5 Conclusion

For text classification, previous researches focus on the locality of words and ignore text-level word interactions. In this paper, inspired by how a human being understands a text and acquires knowledge, we fully mine semantic features and relationships from multiple perspectives. And experiments results show that our model is superior to the best baseline. We build a semantic feature graph for each document separately to get the semantic relationship between word nodes. Each node can exchange information with neighbor nodes to achieve the transmission of semantic information, which can better contact the context information. Our model highlight as follows: First, we achieve fine-grained text-level word interaction. Second, we obtain more comprehensive semantic information. Third, experiments show that our model has certain advantages in context semantic transmission and semantic information selection. We set some parameter values and use these parameters to determine the impact of a node’s neighbor node information, and then improve the accuracy of semantic information in the model, ensuring that the key semantic information is retained, and make the most effective semantic information play a role in text classification. Then improving the accuracy of text classification and the efficiency of the model.