1 Introduction

With the rapid development of social media, much text data is always generated. Therefore, obtaining adequate information from massive network data has already become a research hotspot in academia. Text classification plays an essential role in information extraction as one of the basic tasks of NLP, which has many applications, including question answering, spam detection, sentiment analysis, news categorization, user intent classification, etc [16]. In recent years, many deep learning methods have been proposed to promote the development of text classification research [2, 11, 23, 28, 31]. However, most of these existing deep-learning methods have several challenges as the length of the input text increases.

Recently, long text classification has been nontrivial due to the following challenges. The first challenge is that it is difficult to preserve and extract useful information from long texts after they have been preprocessed due to their prosperous and complex information. The most direct and easiest method to solve this problem is to process the long text into multiple short pieces and process them separately. It is contained the following two types of processing methods. The first type is to truncate a specific character length in order. One of the most well-known methods is a transformer-based pre-train model named Bert, which is used a masked language model and limits input length to pre-train the bidirectional transformers [3]. The second type is to select specific paragraphs or sentences to represent the text. For example, Chen et al. [1] constructed a multi-task architecture, which jointly trains an Albert [12] model to key-sentence extraction with distance square loss and multi-label long text classification tasks with cross-entropy loss. To better capture the semantics of long texts, Du et al. [5] proposed a Knowledge-Aware Leap-LSTM to skip irrelevant words in the input for accelerating LSTM models by integrating prior human knowledge. However, this method of inputting the entire text sequence for processing can not distinguish the noise information. Therefore, it can not accurately extract important information from the text. Although the methods of truncation and selection can condense long texts to some extent, there still inevitably ignore part of the semantic and structural information, which leads to the loss of essential information and results in misjudgment of the model.

The second challenge is the complicated construction of the training set of long text. Massive new and different stylistic text data are constantly generated, which requires a lot of new labeled data for existing deep learning models to learn. The emergence of graph convolutional neural networks (GNN) provided a new direction for solving the problem [10]. It aggregated the information of neighbor nodes in the relational network constructed by different texts to achieve similar effects as other methods, but only required a small part of the labeled data. Meanwhile, GNN-based models can better preserve the structural and semantic information by modeling the corpus. For example, Yao et al. [30] built a text graph for corpus based on word co-occurrences and document-word relationships, then jointly learned the embeddings for both words and documents by graph convolutional neural networks. Ragesh et al. [20] designed a heterogeneous graph convolutional network modeling approach to learn feature embeddings and derive document embeddings by combining the best aspects of PTE [22] and TextGCN [30]. Moreover, Linmei et al. [13] proposed a heterogeneous graph attention network with two-level attention mechanisms for learning the importance of different neighboring nodes and node types to a current node. These GNN-based models can aggregate the information of neighbors to strengthen the representation of nodes by semi-supervised learning. After these existing methods simply construct a heterogeneous graph through documents or keywords, the authors believe that the most important thing is to enrich the representation of the node itself through the neighbors. However, it is necessary to consider the semantic relationship in the text and the high-order semantic structure in the graph, which is because the long text contains too many words and complex features. Unfortunately, these methods do not take these aspects into account.

To address the above problems, we propose a novel H eterogeneous A ttention N etwork for semi-supervised L ong T ext classification (Han-LT). Firstly, according to the characteristics of long text, the definition of multi-interrelation based on entity-keyword-title is defined. We extract the titles, entities, and keywords from the texts and get their initial embeddings. Then, their multi-interrelation is found within and between texts, building edges by the multi-interrelation to construct the heterogeneous information graph. In this way, the semantic and structural information of long texts can be preserved to a great extent. Secondly, the multi-semantic passing framework is designed to extract crucial semantic and structural information. Specifically, we first put forward the definition of the semantic degree to measure the importance of different semantic structures in the heterogeneous information graph. Then, the attention mechanism and the semantic degree are combined to capture high-order semantic information while capturing the importance difference of neighbor nodes. Finally, we construct a heterogeneous neural network named Han-LT based on the multi-interrelation heterogeneous information graph and the multi-semantic passing framework to get the classification results by adding the softmax layer at the end of the network. The main contributions of this paper can be summarized as follows:

  • We construct a novel heterogeneous information graph for long texts by extracting titles, entities, keywords, and their multi-interrelation to preserve their significant semantic and structural information.

  • We design a special multi-semantic passing framework for capturing the importance of different nodes, higher-order semantics, and structural information by combing the attention mechanism and the semantic degree.

  • We evaluate the effects of the Han-LT and compare it with 7 state-of-the-art methods. Extensive experimental results show the superiority of our Han-LT method on the long text classification task.

In Section 2, we will introduce the related work of this paper from text classification in deep learning and graph neural networks. In Section 3, we will elaborate on our Han-LT method. In Section 4, we present a large number of designed experiments, experimental results, and related analysis verify the superiority of Han-LT. The Section 5 is the conclusion of this paper.

2 Related work

2.1 Text classification in deep learning

In the past decades, text classification has gradually changed from a shallow learning model to a deep learning model. Deep learning methods can avoid the manual design of rules and functions. The proposal of convolutional neural networks (CNN) [11] was aimed at image classification, which has achieved subversive achievements and promoted the arrival of the hot era of deep learning. In order to apply CNN to text classification tasks, Kim et al. put forward a convolutional neural network called TextCNN [9]. It took an embedding obtained with a pre-trained word vector method as input, which determined the discriminative phrase through one convolution layer and one max-pooling layer. Tan et al. [21] utilized gated units and shortcut connections to transform and carry word information to control how much context information is incorporated into each specific position of the word embedding matrix in the text. Considering long-range or sequential semantics, Peng et al. [18] inputted the word matrix that maintained word order into attention graph capsule recursive CNN to learn semantic features, then a hierarchical classification embedding method was designed to learn the hierarchical relationship between category labels. To alleviate computational complexity, Johnson et al. [8] developed a low-complexity, word-level deep convolutional neural network for text classification called DPCNN. It could obtain a global representation of text by deepening the network without greatly increasing the computational cost. However, the effect of applying them directly to long texts is not satisfactory. The excessive length of the long text can make the graph too complex, which results in the disappearance of gradient or network degradation.

Recurrent Neural Networks (RNNs) have been widely used to capture long-term dependencies through recursive computation, and the performance in long text classification tasks is better than CNNs. For instance, Liu et al. [14] designed a model to capture long text semantics, which could extract context information and effectively reduce the time complexity of the model. Du et al. [4] proposed the Pointer-LSTM framework, which relied on a pointer network to select important words for target prediction. It generated self-attention distribution over the whole input sequence through a small bidirectional LSTM network. Then, a large BiLSTM network was used to obtain Top-k keywords for target prediction. Later, the authors put forward the Knowledge-Aware Leap-LSTM [5] to skip irrelevant words in the input for accelerating RNN models by integrating prior human knowledge. It integrated prior knowledge through factorized and gated integration to partially supervised the word-skipping process, which achieved higher accuracy and faster training speed. Moreover, Du et al. [6] proposed recurrent BLS (R-BLS) and long short-term memory (LSTM) architecture: gated BLS (G-BLS) to learn multiple information simultaneously to achieve high accuracy in text classification. Unfortunately, the gradient problem of those RNN-based methods still exists and might be intractable when facing longer sequences. In addition, all those CNN-based and RNN-based methods were data-driven, which usually required a large amount of high-quality labeled data or prior professional knowledge to achieve higher performance.

The emergence of pre-training models, such as Bert [3], GPT [19], XLNet [29], MacBERT [26], etc., has greatly promoted the development of text classification, especially for long and ultra-long texts. Bert adopted a novel masked language model to pre-train bidirectional transformers to generate deep bidirectional language representations. After pre-training, it was necessary to add an output layer for fine-tuning to achieve state-of-the-art performance in tasks such as text classification. Meanwhile, unsupervised learning under large-scale data has significantly improved the classification effect of the model. However, the mechanism of Bert required the text to be truncated. It was shown that part of the semantic and global information was missing, which made the classification results more likely to be disturbed by noise.

2.2 GNN for text classification

The appearance of GNN provided a new idea for the text classification task, which was transformed into a graph node classification task. GNN-based text classification methods could capture the structural information of texts, and other methods can not replace it.

In recent years, Graph Convolutional Networks (GCN) [10] performed convolution operations on graph structure data and achieved attractive performance in various tasks. They could encode the characteristics of the graph structure and nodes without designing features or fusion methods. Many variants of GCNs were proposed over the next few years. These methods could be divided into 1) homogeneous graph neural network and 2) heterogeneous graph neural network. The difference between the two methods lies in how the graph was constructed and processed.

GraphSAGE [7] was a classic algorithm based on airspace. It improved the traditional GCN in two aspects. During training, the sampling method optimized the full-graph sampling of GCN to partial node-centered neighbor sampling. The second aspect was that GraphSAGE studied several ways of neighbor aggregation. GAT [24] aggregated neighbor nodes through a self-attention mechanism to achieve adaptive matching of the weights of different neighbors, which improved the accuracy of the model. Moreover, Yao et al. [30] designed a text graph convolutional network (TextGCN), which constructed a heterogeneous word text graph for the entire data set and captured global word co-occurrence information. These homogeneous graph neural network methods have achieved remarkable results in multiple fields. However, most networks in reality are heterogeneous. It is essential to build and deal with heterogeneous graphs according to the actual situation. Zhang et al. [32] constructed the HetGNN model for processing the heterogeneous graphs, which used LSTM for node-level aggregation and an attention mechanism for semantic-level aggregation. It could simultaneously capture the heterogeneity of structure and content, which is suitable for transductive and inductive tasks. Heterogeneous Graph Attention Network (HGAT) with a two-level attention mechanism could learn the importance of different adjacent nodes and node types in the current node [27]. It propagated information on the graph and captured relationships to solve the semantic sparsity problem of semi-supervised short text classification. Ragesh et al. [20] designed a heterogeneous graph convolutional network modeling approach which utilized across layers to learn feature embeddings and derive document embeddings. It greatly reduced the model’s parameters and achieved better performance.

These GNN-based models have made remarkable achievements in text classification tasks by aggregating the information of node’s neighbors to enrich the embedding about the node itself. However, most of these methods used chapter-level texts as nodes or simply extracted keywords as text embeddings, which inevitably led to excessive computation or loss of semantic information if applied to the long text classification task.

3 The proposed method

In this paper, we propose a novel semi-supervised long-text classification method named Han-LT, which can take advantage of limited labeled data to preserve and extract the significant structural and semantic information. The general process of Han-LT is shown in Fig. 1. Firstly, we extract titles, entities, and keywords from long texts and get their initial embeddings by using Bert and Word2vec [15]. Secondly, we give the definition of multi-interrelation based on the entity-keyword-title. The heterogeneous information graph is built based on the multi-interrelation to preserve the long texts’ semantic and structural information. Thirdly, the definition of the semantic degree is used to measure the importance of different semantic structures in the heterogeneous information graph. By combining the semantic degree and attention mechanism, we design the multi-semantic passing framework to capture the relationship of nodes and extract the higher-order semantic and structural information. Finally, a softmax layer is added at the end of the network to obtain the final classification results.

Fig. 1
figure 1

Illustration of our method Han-LT. Among them, (a) represents the acquisition of keyword, entities, titles, and their initial embedding. (b) represents the heterogeneous information graph constructed by the multi-interrelation. In (c) multi-semantic passing framework, AGGδ represents the multi-semantic passing mecanism, AGGα represents the attention mechanism. (d) represents the graph convolution layers and (e) represents the final classification result

3.1 Multi-interrelation heterogeneous information graph

Due to the high complexity of features, the tasks of long text classification face many challenges. The most difficult one is extracting valuable and essential information from complex features. Existing graph neural network methods usually construct information graphs simply from documents or keywords. This approach does not consider retaining semantic information from the internal level of the text, which results in the loss of the key information in subsequent processing. To address this issue, we present a heterogeneous graph construction method for long text classification task. Specifically, we put forward the definition of multi-interrelation based on the entity-keyword-title to preserve the core semantic and structural information in long texts. The graph construction method is mainly divided into two steps. Firstly, we extract the titles, entities, and keywords from the texts and get their initial embeddings. Secondly, the multi-interrelation within and between texts are found and built edges for them according to the multi-interrelation.

Here, we consider constructing the heterogeneous information graph G = (V,ξ) including entities E = {e1,...,em}, keywords K = {k1,...,ks}, titles T = {t1,...,tn}, and V = EKT. ξ represents the relationship between nodes. The details of the graph construction are shown in Fig. 2, which are described in the next paragraphs.

Fig. 2
figure 2

Illustration of the multi-interrelation heterogeneous information graph for long texts

3.1.1 Information extraction

The title information is extracted directly as the title has been placed in the first row in most cases. Then the trained model Berts are used to extract keywords and entities. Compared to other methods, Bert introduced Masked Language Model (MLM) and Next Sentence Prediction (NSP) in pre-training to train bidirectional features and capture the connection between two sentences. Therefore, the model has the ability to understand the connection of long sequence contexts. Furthermore, a large-scale unlabeled corpus is used for pre-training, so that the model contains text representation information with rich semantics. The text is processed into three embeddings (Bw, Bs, Bp) and used as input to Bert. Bw is the word embedding. Bs is the segment embedding to help Bert distinguish between paired input sequences. Bp is the position embedding, which indicates the index embedding of the position of the current word. The three embeddings are summed with dimensions (1, n, 768) to get the final Binput as Bert’s input, where n represents the number of words in the text. At the end of the model, a fully connected layer is followed to obtain a 256-dimensional word embedding. Then, it is fine-tuned through the labeled keywords and entity corpus to make it have qualified extraction ability.

The embedding obtained by Berts is used as the initialization embedding of keywords and entities. For the titles, the Word2vec is chosen to embed them. It is worth noting that we treat the title as a separate sentence containing the core intent of the article, so the title and article information need to be separately processed when considering the relationship between texts. The semantics of general titles are relatively complete, and the words in the title can well represent their semantics. Furthermore, taking the efficiency factor into consideration, Word2vec is finally chosen to embed the titles.

3.1.2 Multi-interrelation and edge construction

The construction of edges between different nodes is completed according to the defined multi-interrelation and position information. Specifically, we construct a corresponding sub-graph for each text according to the multi-interrelation between different nodes. Then, the connection between texts through the titles and entities is realized to obtain the multi-interrelation heterogeneous information graph finally.

Multi-interrelation:

Inside the text, the interrelation is expressed as the relationship of em-ks in each sentence and the relationship of tn-em-ks in the title, where ks and em appear in tn. Among texts, interrelation is expressed as the relationship ti-tj or ti-em-tj, and the relationship between the same entities appearing in different texts and their interactive information.

The relationship between entities and keywords (em-ks) can represent the specific intent of the text, while the relationship of title-entity-keywords (tn-em-ks) can represent the core intent of the text. The relationship of title-title (ti-tj) and title-entity-title (ti-em-tj) can connect similar titles. As a core element of an article, a specific entity often appears in certain types of texts. Therefore, entities themselves have rich features and strong characteristics. We connect texts that contain the same entity.

Inside the text, we construct edges through the multi-interrelation between entities and keywords in each sentence. Entities and keywords in the same sentence are connected in the order of their appearance to complete the construction of the em-ks relationship. Among texts, we construct the relationship between texts through the entities and the titles. Considering the relationship between texts, we regard the title as an independent sentence that contains the core intent of the article. If two titles contain the same entity, they will be connected through this entity to complete the construction of ti-em-tj. Moreover, articles with similar titles are more likely to belong to the same category. Therefore we set similarity score s to measure the similarity of two titles if they do not contain the same entity. The similarity score s between title ti and title tj can be formulate as follows:

$$ \begin{aligned} s=\frac{{\sum}_{i=1}^{n} A_{i} \cdot B_{i}}{\left( {\sum}_{i=1}^{n} {A_{i}^{2}}\right)^{\frac{1}{2}} \cdot\left( {\sum}_{i=1}^{n} {B_{i}^{2}}\right)^{\frac{1}{2}} }, \end{aligned} $$
(1)

where A and B represent the vectors of ti and tj, respectively. And n represents the dimension of the vector. If the similarity score s between the title ti and tj is greater than the set threshold, the ti-ti relationship will be constructed successfully. As for entities, we regard multiple identical entities in different texts as the same node. The same keywords appearing in different texts are regarded as different nodes. The reason is that the same entity represents the same semantics in different articles in most cases, while keywords do not. Therefore, we connect different texts through entities and titles, while keywords are connected with their corresponding entities and titles. In this way, different texts can be related by title and entity information and keep their established multi-interrelation. Furthermore, in Fig. 2(d), the color of each element in the heterogeneous information graph is one-to-one corresponding to that in Fig. 2(a). It can help us better understand the multi-interrelation heterogeneous information graph construction method.

The essential semantic and structural information of long texts are well preserved by constructing the novel multi-interrelation heterogeneous information graph. It reduces much redundant information, which greatly benefits the subsequent classification tasks.

3.2 Multi-semantic passing framework

How to extract and represent the key information in a heterogeneous information graph is a complex problem in graph neural network-based long text classification tasks. However, existing methods are more concerned with enriching the representation of the node itself through the neighbors. It inevitably loses important high-order semantic and local structural information, especially for complex information bodies such as long texts. To further capture the significant information, we design a novel multi-semantic passing framework based on the definition of semantic degree. It can aggregate the information about surrounding different types of neighbors to obtain higher-order semantic information. Especially combined with the constructed multi-relationship heterogeneous information graph containing title, entity, and keyword information, better relevant information can be extracted.

Semantic degree:

The proportion of each specific semantic structure among all semantic structures in the multi-interrelation heterogeneous information graph.

The process is described as follows. Firstly, we search and extract specific semantic structures in the heterogeneous information graph according to MotifNet [17], which analyzes integrated networks and searches for specific structures. Secondly, we define the semantic degree to measure the importance of specific semantic structures. With the combination of the semantic degree and attention mechanism, the mutual importance between different nodes can be obtained. The high-order semantic information can be captured according to the semantic degree of the structure where the nodes are located. Finally, the semantic information retained by the node and the neighbors’ information is locally propagated. The illustration of our multi-semantic passing framework is shown in Fig. 3.

Fig. 3
figure 3

Illustration of our multi-semantic passing framework

3.2.1 Semantic degree of edge

In other networks, the motif is a metric used to measure the significance of a structure in a graph. In the heterogeneous information graph of long texts constructed on the entity-keyword-title, motifs represent the core semantic information of texts to a large extent. For example, the entity can be the subject of an event in the structure of entity-keyword-keyword. The keyword can be time, action, or a certain noun or adjective. Then, such a semantic structure can contain rich and important semantic information. Based on the semantic degree, we assign a weight to each edge that is in a specific semantic structure. Formally, the semantic structure is denoted as fF, F = (f1,f2,...,fk), k means all categories of semantic structures. The semantic degree ρf is given by

$$ \rho_{f} = 1 + \left( \frac{X_{f}}{{\sum}_{f \in F} X_{f}}\right)^{\frac{1}{2}}, $$
(2)

where ρf represents the semantic degree of each edge under the structure f. Xf represents the number of structures occupied in the entire heterogeneous information graph. The corresponding weights are set for each edge based on the semantic degree to realize the difference between the feature vectors of different nodes in aggregation. Besides, it is worth noting that some edges are not in any particular semantic structure, while some may be in multiple semantic structures. To accurately measure the semantic weight contained in each edge, the semantic degree calculation formula of each edge is defined as

$$ \delta_{ij} = \prod\limits_{e_{ij}\_in\_f} (\rho_{f}), $$
(3)

where δij represents the product of the semantic degrees of all semantic structures where the edge eij is located, namely the semantic degree of the edge. The more types of semantic structures an edge is located in and the higher the semantic degree of the semantic structures, and the larger the value of δij are.

3.2.2 Multi-semantic message passing

According to the obtained semantic degree, a multi-semantic passing framework is designed for extracting the important higher-order semantics of long texts. Formally, for a graph G = (V,ξ), let XRmn be the feature matrix of the nodes, where each row is the feature vector of node v. A is the adjacency matrix of G and D is the degree matrix, where \( D_{ii}={\sum }_{j} {A_{ij}} \). Moreover, each node is connected to itself. Then, according to the aggregation function AGG, the neighbors’ information of node v is aggregated into Nv to update the embedding of node v recursively. Equations (3) and (4) demonstrate the steps of the attention mechanism:

$$ H_{N_{v}}^{l}= AGG({H_{j}^{l}},v_{j}\in N_{v_{i}}), $$
(4)
$$ H_{i}^{l+1}=\sigma(\alpha_{ij}\cdot \widetilde{A} \cdot W^{l}\cdot({H_{i}^{l}} \oplus H_{N_{v}}^{l})), $$
(5)

where \( \widetilde {A} = D^{-\frac {1}{2}} A D^{-\frac {1}{2}} \) is the symmetric normalized adjacency matrix. αij is the attention value of nodes vi and vj. It need to be learned by the model, which represents the different importance of each neighbor node to vi. The operator ⊕ denotes concatenation. σ denotes the activation function, such as Leaky ReLU. Wl is the trainable transformation matrix of the layer l. Furthermore, \( {H_{i}^{0}}=X_{v_{i}} \). The calculation of αij is shown in Eq. (6),

$$ \alpha_{ij} = \frac{\exp(\sigma(\mu^{T} \cdot [{H_{i}^{l}} \oplus H_{i}^{l-1}]))}{{\sum}_{j \in N_{(v_{j})}} exp(\sigma(\mu^{T} \cdot [{H_{i}^{l}} \oplus H_{j}^{l-1}]))}, $$
(6)

where μ is the attention parameter. T = (τ1,τ2,τ3) are different types of nodes, where τ1,τ2,τ3 represent title, entity, and keyword types respectively. It is worth noting that attention values exist between all nodes, but not all nodes exist in a specific semantic structure. If more than one node in a node pair does not belong to any particular semantic structure, the semantic degree value of the edge between them will be treated as 1. The overall flow of the multi-semantic passing can be expressed as follows:

$$ H_{i}^{l+1}=\sigma(\delta_{ij} \cdot \alpha_{ij}\cdot \widetilde{A} \cdot W^{l}\cdot({H_{i}^{l}} \oplus H_{N_{v}}^{l})). $$
(7)

Considering the heterogeneity of different types of nodes, traditional methods generally concatenate the feature spaces of different types of nodes to construct a new large feature space and set the values of other types of irrelevant dimensions to 0 for summation. The obvious disadvantage of is that it ignored the heterogeneous information of different nodes and increased the difficulty of calculation. To optimize this problem, we project different types of nodes into a common space through each type-specific transformation matrix Wτ. Thus, the representation Hl+ 1 is given by

$$ H^{l+1} = \sigma \left( \sum\limits_{\tau \in T} \delta \cdot \alpha \cdot \widetilde{A}_{\tau} \cdot W_{\tau}^{l} \cdot H_{\tau}^{l}\right), $$
(8)

where τ represents the type of neighbor node. The rows of matrix \( \widetilde {A}_{\tau } \) represent all nodes, and the columns represent neighbor nodes of type τ. Then, the neighbor nodes of different types τ are aggregated with different transformation matrices \( W_{\tau }^{l}\) to obtain the final representation Hl+ 1 of node vi. The final aggregation formula can be described as

$$ H_{i}^{l+1} = \sigma \left( \sum\limits_{\tau \in T} \sum\limits_{j \in N_{\tau(i)}} \delta_{ij} \cdot \alpha_{ij} \cdot \widetilde{A}_{\tau} \cdot W_{\tau}^{l} \cdot {H_{j}^{l}}\right), $$
(9)

where Nτ(i) means the set of neighbors of node i belonging to type τ. In this way, the titles, entities, keywords, and the multi-interrelation information between them in the multi-interrelation heterogeneous information graph can be effectively aggregated, which obtains higher-order semantic and structural information.

3.2.3 Label classification

After going through an L-layer Han-LT, we feed the obtained final embedding Q of the long text into a softmax layer for classification. Formally,

$$ Z_{i} = softmax(Q_{i}^{(L)}). $$
(10)

Moreover, the binary cross-entropy loss function we utilized is as follows,

$$ \zeta = -\sum\limits_{i=1}^{N} \sum\limits_{j=1}^{t}(Y_{ij} \log(Z_{ij}) + (1-Y_{ij}) \log(1-Z_{ij})), $$
(11)

where t is the number of classes, and N is the number of training examples. Yij denotes the binary ground truth label value, and Z represents the predicted value of the long text i obtained by the Han-LT model, which represents the likelihood that text i will be labeled j.

4 Experiments

Experimental works have been conducted on four common datasets to evaluate the performance of the Han-LT method. This section introduces the data sets and preprocessing, comparison of methods, experiment settings and details, experimental results, and corresponding analysis.

4.1 Datasets and preprocessing

We compare Han-LT with several state-of-the-art methods in different scenarios. Two Chinese datasets and two English datasets are selected from news topic classification and medical disease classification to perform our experiments, they are:

ThuCNews::

The ThuCNews corpus is a news document generated by filtering the historical data of the Sina News RSS subscription channel from 2005 to 2011, which contains 14 news categories and about 740,000 news texts. About 6,000 pieces of text data with more than 300 characters are randomly selected for each category.

Sogou News::

Sogou News corpus is a news dataset provided by Sogou Lab, including Sogou CA and Sogou CS datasets. It contains about 27,000 news items in ten categories. To balance the dataset, about 3,000 samples were randomly selected for each category, and the number of characters in each sample was greater than 300.

20NG::

The 20newsgroups dataset is one of the international standard datasets for text classification, text mining, and information retrieval research. It contains 18,846 non-repeating news texts divided equally into 20 categories.

Ohsumed::

Ohsumed contains 7,400 articles. Each article is a medical abstract with at least one or more labels from 23 cardiovascular disease categories. Since a document may be labeled with multiple labels, the label with the highest level is taken as its final label in the experiments.

The two Chinese datasets are filtered by length to construct two real long text datasets, which makes our results more convincing in long text classification tasks. For all selected datasets, we remove stop words and low-frequency words (word frequencies below 5). We select 70% of each dataset as the training set and the rest as the test set. About 30% of the data in the training set is labeled data. In our datasets, all long texts contain entities that we have defined. The statistics of the pre-processed datasets are detailed in Table 1.

Table 1 Summary statistics of datasets

4.2 Comparison of methods

To comprehensively evaluate our method, we compare it with the following 7 state-of-the-art algorithms:

CNN:

[11]:CNN is a classical neural network that utilizes convolutional computation. We explore a 13-layer CNN with two variants: 1) CNN-rand, which uses randomly initialized word embeddings, and 2) CNN-pre, which uses pre-trained word embeddings.

Bert:

[3]:It is a pre-trained model that stacks multiple transformer models and pre-trains bidirectional deep representations by conditioning the bidirectional transformers in all layers. We choose an existing trained Bert-base model and fine-tune it to convergence with our training data.

Pointer-LSTM:

[4]:A LSTM framework that relies on pointer networks to select important words for target prediction. It maintains a consistent input process for the LSTM module and allows it vary the skip rate during inference.

TextGCN:

[30]:It is used a graph convolutional network to model the corpus for capturing neighborhood information, and it is built a bipartite graph using word co-occurrence information and word frequency information. It is transformed the text classification problem into a node classification problem.

GAT:

[24]:Graph Attention Networks adopt the attention mechanism to learn the weights of neighbor nodes adaptively. The nodes’ expression is obtained through the weighted summation of neighbor nodes.

HAN:

[25]:It puts forward a novel dual-level attention mechanism, including node-level attention and semantic-level attention. The node-level attention is used to learn the importance between the central node and its different types of neighbor nodes, and the semantic-level attention is used to learn the importance of different meta-paths.

HeteGCN:

[20]:A heterogeneous graph convolutional network combines the best aspects of PTE and TextGCN. It learns feature embeddings and derives document embeddings using a HeteGCN architecture with different graphs used across layers.

4.3 Experiments settings and details

The following experiments are conducted to compare and analyze our Han-LT method comprehensively. The first experiment provides an overall evaluation of all methods. Our method achieves excellent results on multiple datasets, which demonstrates the effectiveness of Han-LT. The second and third experiments are designed to embody the superiority and flexibility of the multi-interrelation heterogeneous information graph construction method and the multi-semantic passing framework. In the second experiment, the corpus is modeled using Han-LT for constructing a heterogeneous information graph of long texts and then processing them with different heterogeneous graph neural networks. The third experiment is designed to demonstrate the scalability of the multi-semantic passing framework. The framework achieves good results on different heterogeneous graphs based on text classification. The fourth experiment is to demonstrate the superiority of Han-LT in semi-supervised algorithms by changing the proportion of labeled data in the training set. To further verify the superiority of Han-LT on semi-supervised learning, we design the fifth experiment to find out which part of Han-LT has a greater impact on semi-supervised learning. Then, the sixth experiment is designed to demonstrate the selection process of some core parameters in the method. Besides, for the viewability of the experimental tables, the heterogeneous information graph is defined as HIN.

We evaluated the performance of all classifier models using Acc and F1 score. Acc represents accuracy which represents the correct prediction ratio among all the predicted samples. The F1 value is introduced to evaluate the model more comprehensively, which is formulated as

$$ F1=2*\frac{P\cdot R}{P + R}, $$
(12)

where P represents precision and R represents recall. During model training, we train all models until the loss value converges and repeat this process ten times. Then, average the best accuracy and F1 score obtained in each experiment as the final results.

After experimental verification, we select the optimal number of entities and keywords for each document, and the value of title similarity x. To construct the heterogeneous information graph of long texts, we set the maximum number of entities extracted per document K = 5, and the maximum number of keywords J = 10. For all text corpora, the title similarity threshold x is set to 0.6. Furthermore, the initial dimension of words is set to 258. As for model training, we set the learning rate l = 0.005, regularization factor n = 1e-6, and dropout rate as 0.5. All methods are run on a computer with an i7-9700kf CPU and an RTX2070s GPU.

4.4 Experimental results

4.4.1 The overall experiment

Table 2 shows the classification accuracy rate and F1 score of different algorithms on the four datasets. The accuracy rate of Han-LT is higher than 98% on both Chinese long text datasets, of which 98.86% is achieved on Sougou News and 87.62% accuracy on the classic news classification dataset 20NG and 71.46% on the disease classification dataset Ohsumed. Compared with the classical attention mechanism-based method GAT, Han-LT has higher effects on all datasets, and the highest improvement can reach 8.9%. We note that Han-LT is improved by 0.38%–1.56% compared with the new baseline methods (except CNN-based methods) in the Chinese datasets, while Han-LT has an improvement of 0.55%–8.81% compared with the new baseline methods in the English datasets. Compared with the baseline method, the improvement of Han-LT on the English datasets is significantly greater than that on the Chinese datasets. It is because the Chinese datasets have less room for improvement (the accuracy rate of the baseline methods is about 97%). However, Han-LT further improves the effect due to its ability to extract deeper semantic information. In general, it is obtained that Han-LT performs best on all datasets, which represents the Han-LT method’s effectiveness and superiority on long text classification tasks.

Table 2 Test accuracy and F1 score of different methods on two Chinese datasets and two English datasets

It is noted that all methods, including Han-LT, outperform the English datasets on the Chinese datasets. After analysis, it is concluded that because the data characteristics of each category in the Chinese datasets are obvious, the data of different categories are quite different. The English data sets do not have the property, especially for Ohsumed, whose original data set may contain multiple labels for each data. It is indicated that their textual information is intricate. Secondly, the English datasets contain fewer entities that we have defined. For example, most of the entities in Ohsumed are medical-related professional vocabulary. However, the proposed Han-LT method can still achieve better results than other methods in such cases.

For more in-depth performance analysis, we note that there are also specific differences within the baseline methods. For instance, the CNN-pre using pre-trained word vectors is significantly improved compared to the CNN-rand which randomly initializes word vectors. It shows the importance of node representation learning. The pretrained model Bert outperforms CNN-pre on two datasets with longer text lengths but is not as good as CNN-pre on 20NG. It is analyzed that CNN can better simulate continuous and short-range semantics while Bert can better capture long-range semantic information. Similarly, the LSTM-based method Pointer-LSTM has a greater advantage in long sequence classification, which performs better on longer texts. Graph neural network-based model TextGCN achieves comparable results with the pret-rained deep model Bert. Compared to the CNN-pre method, the GNN-based methods (such as TextGCN, GAT, HAN, and HeteGCN) can improve up to 9.92% on the Ohsumed dataset is a significant improvement. The overall performance of GAT is better than that of TextGCN on most datasets since the attention mechanism can adaptively learn the weights of neighbor nodes. It is indicated the superiority of the attention mechanism. HeteGCN combines the advantages of TextGCN and PTE to construct a text corpus as a heterogeneous graph, which achieves good results on four datasets. However, these methods do not profoundly consider the semantic information inside the text, resulting in the partial loss of the rich semantic information in the long text. The proposed Han-LT method takes into account and achieves better results.

4.4.2 The analysis of the multi-interrelation heterogeneous information graph

The following experiments are designed to verify the superiority of the constructed heterogeneous information graph for long texts. We apply GAT, GCN, HeteGCN, and HAN to our constructed heterogeneous information graph for classification. GAT and GCN do not consider the heterogeneity of nodes, while HeteGCN and HAN consider the heterogeneity of nodes. In response to this problem, we treat nodes as homogeneous nodes when running GAT and GCN. Although part of the feature information will be lost, the core features of our graph construction method are preserved, named entities, keywords, titles, and their multi-interrelation.

The results obtained from the experiment are shown in Table 3. It can be seen that with the graph we constructed, a certain extent of optimization has been obtained on GCN, GAT, and HeteGCN. However, the performance of HAN is not very satisfactory because that HAN needs to specify the meta-path in advance manually. However, HAN’s dual attention mechanism still has a small improvement on the Chinese datasets with the heterogeneous information graph. The improvement of our graph construction method on GCN and GAT is more obvious, especially the 0.91% improvement of GAT-HIN on 20NG.

Table 3 Test accuracy and F1 score of different methods with the HIN we constructed

The experimental results are shown that the multi-interrelation heterogeneous information graph for long texts is superior. Because our method considers the relative opposition between words, the semantic relationship of the article is not only be more intuitively accepted by humans and allows the model to learn more semantic information. Another benefit is that two long-distance but related words are allowed to be associated together, which makes the global semantics richer.

4.4.3 The analysis of the multi-semantic passing framework

Our heterogeneous information graph constructed for the semantic information of long texts has certain advantages in long text classification tasks. We design a multi-semantic passing framework for the constructed heterogeneous information graph to capture deeper semantics and more structural information. Moreover, we believe that the multi-semantic passing framework has a certain degree of adaptability, which can also capture more semantic information in other heterogeneous graphs.

The process of the experiment is as follows. Firstly, we select three graph construction methods for text classification. 1) Based on the most primitive graph construction method of document-document, the edges are constructed according to the similarity between documents. The proposed approach was used in GCN. 2) The document-word-based graph construction method mentioned in TextGCN. It constructs edges according to word co-occurrence and word frequency. 3) The graph construction method based on the unique words has appeared in the text and the co-occurrence mechanism of words proposed by TextING [33]. However, it is a separated graph construction method for each text. The method is adopted inside the text in this experiment, which is used the word co-occurrence mechanism to realize the interaction between texts. Secondly, we process four text data sets through these three graph construction methods to construct corresponding corpus information graphs. Thirdly, we compare their original algorithm with the multi-semantic passing framework. In particular, only text-word graphs can be considered heterogeneous among these three graph construction methods, and the other two are homogeneous. In response to this problem, we treat the two graphs of text-text and word-word as homogeneous. Specifically, we change Aτ in the formula to A. Here, all nodes are adopted the same transformation matrix.

The final results are shown in Table 4. The multi-semantic passing framework achieves better results than the original method on these three graphs, especially in the word-word graph. We notice that in the document-document graph, the improvement effect is relatively insignificant after we apply the multi-semantic passing framework. It is because the homogeneous graph is constructed only from documents. The semantic information captured by the multi-semantic passing framework is more graph-based and global than the semantic information inside the text. The multi-semantic passing framework we designed for the long texts graph construction method is more sensitive to the capture of text semantics. Meanwhile, it can capture more structural information about the graph. Our multi-semantic passing framework can better extract semantic and structural information in both homogeneous and heterogeneous graphs.

Table 4 Test accuracy and F1 score of different graph construction methods with the multi-semantic passing framework

The above three experimental results are shown that the Han-LT method has obvious superiority on long text classification tasks. Moreover, the multi-interrelation heterogeneous information graph construction method and the multi-semantic passing framework in Han-LT are flexible and applicable. The main reasons are referred Han-LT works well are twofold: 1) Multi-interrelation heterogeneous information graph based on entities, titles, and keywords can better preserve more significant semantic and structural information. 2) Based on the constructed heterogeneous information graph, the multi-semantic passing framework can adaptively find important nodes and extract more crucial higher-order semantic information to represent the long text. In conclusion, Han-LT shows sufficient superiority over other methods.

4.5 Effects of labeled data size

Nowadays, the training set of long text is difficult to construct. Therefore, there is an urgent need to develop semi-supervised long text classification methods. It is obtained that Han-LT can produce good results in semi-supervised learning due to its superior local label transfer ability. It can achieve better results than other methods in limited labeled data. We design the following experiment to verify the effectiveness of our semi-supervised long text classification method.

We chose 4 related algorithms: CNN-rand, TextGCN, HAN, and Han-LT, and the effect of the number of labeled documents are studied. Specifically, we vary the proportion of labeled texts on each dataset, and compare their accuracy on all datasets. The proportion of labeled data is increased from 2% to 30%. In addition, the experimental results are the average values obtained by running each algorithm 10 times.

From Fig. 4, it can be seen that all algorithms’ accuracy on all datasets increases with the increased ratio of labeled data. Generally, the GNN-based methods achieve better accuracy, which indicates that GNN-based methods can better use limited labeled data through a message passing framework. When the proportion of labeled data is low, the performance of other algorithms drops significantly. Meanwhile, our method still maintains a relatively high accuracy, which shows that the method can better utilize limited annotated data to achieve better results in long text classification. It is because the multi-semantic passing framework can adaptively learn the importance of different nodes, which can better spread the label information of nodes locally. The superiority of Han-LT in semi-supervised learning can achieve satisfactory results even when labeled data is relatively rare.

Fig. 4
figure 4

The test accuracy with different proportion of labeled documents on four data sets

4.6 Ablation analysis

To further verify the superiority of Han-LT on semi-supervised learning and find out which part of Han-LT provides a greater impact on semi-supervised learning, we design an ablation experiment as follows. Specifically, we divide Han-LT into two parts: the multi-interrelation heterogeneous information graph and the multi-semantic passing framework. We name the method of preserving the multi-interrelation heterogeneous information graph as Han-LT_mhg, and the method of preserving the multi-semantic passing framework as Han-LT_mpf. The scale of labeled text on each dataset is varied and compared the accuracy of Han-LT, Han-LT_mhg, and Han-LT_mpf on THUCNews and 20NG. The proportion of labeled data is set as 2%, 5%, 10%, 15%, 20%, 25%, and 30%.

The experimental results are shown in Fig. 5. It is seen that the improvement of Han-LT_mhg with the multi-interrelation heterogeneous information graph is more pronounced when the proportion of labeled data is low. As the labeled data increases, the Han-LT_mpf method with the multi-semantic passing framework also becomes obvious. The improvement of Han-LT_mhg is evident because the information in the multi-interrelation heterogeneous information graph is compact. We link the core information of each article by multi-interrelation, and articles of the same category will directly become neighbors to each other with a high probability. To a certain extent, we can think of the multi-interrelation heterogeneous information graph as a ”pre-categorized” graph with core informational elements and relationships. It is because we do not make major changes to the core network structure but focuses on improving the network’s ability to perceive multi-interrelation. It is worth noting whether adding the multi-interrelation heterogeneous information graph or the multi-semantic passing framework has a particular improvement effect compared to the original GAT and TextGCN methods. In general, our proposed Han-LT method is a ensemble including the multi-interrelation heterogeneous information graph and the multi-semantic passing framework. The multi-interrelation heterogeneous information graph connects the core elements of the article, and the multi-semantic passing framework captures the essential semantics on the graph. The combination of the two makes Han-LT provides the superiority of semi-supervised learning.

Fig. 5
figure 5

Test accuracy of Han-LT variants on datasets with different proportion of labeled documents

4.7 Parameter analysis

For Han-LT, the selection of entities and keywords is essential, which determines the difficulty of capturing semantic information and the running time of the algorithm. To intuitively show the process and ideas of our graph construction method, the following experiment is designed and visualized the results for reference. It is experimented with separately by varying the number of keywords and entities extracted for each text at each run. We set the extraction number of entities to 3, 5, 7, 9, 11, and 13 due to the high importance with low occurrence in the text. The extracted number of keywords is set to 3, 6, 9, 12, 15, 18, and 21.

Figure 6 shows the accuracy tests using different numbers of entities on the four datasets. Figure 7 shows the accuracy tests using different numbers of keywords on the four datasets. It can be seen that the accuracy rate increases with the number of selected entities and keywords in all datasets at the beginning of the experiment. However, when the number of selected entities in 20NG is greater than 5 or the number of selected keywords is greater than 10, the accuracy rate decreases with the increase of selected entities or keywords numbers. It is because selecting too many words will make the constructed heterogeneous information graph complicated. Furthermore, it is added edges between nodes that are not closely related so that the model can not accurately extract the semantics of the text. Combining the best experimental performance, we finally select 5 entity counts and 10 keyword counts.

Fig. 6
figure 6

The average accuracy with different number of entities on four data sets

Fig. 7
figure 7

The average accuracy with different number of keywords on four data sets

In order to comprehensively evaluate our proposed method Han-LT, we designed the above six experiments in total. Combined with all the obtained experimental results, we strongly demonstrate that the Han-LT method can effectively preserve and extract the semantic and structural information of long texts to achieve excellent results in semi-supervised learning task. Overall, Han-LT shows significant superiority in semi-supervised long text classification tasks.

5 Conclusion

This paper proposes a semi-supervised long text classification method based on a graph neural network. Aiming at the core intention expressed by the text, we construct the long text corpus from three aspects: title, entity, and keyword. We model the text itself and link different texts together through their multi-interrelation to condense the meaning expressed by the texts while retaining the semantic structures. Then, we design the message passing framework by combining the attention mechanism and the semantic degree for the relationship between title-entity-keyword again to aggregate the multi-interrelation heterogeneous information graph. These make our model have a strong ability to extract deeper semantic and structural information. Validated by extensive experiments, our method achieves remarkable results on long text classification tasks.