Introduction

Cognitive systems help humans understand textual information from the outside world and acquire the corresponding knowledge. Artificial simulation of this cognitive process is beneficial to explain such cognitive phenomena [1]. Natural language processing (NLP) uses computers to understand human language, bringing machines closer to human cognitive systems. Text classification is one of the cognitively inspired methods in NLP. Text classification technology based on neural networks simulates human brain structure and cognitive processing [2], giving computers the ability to perform corresponding cognitive tasks. It realizes automatic abstract classification by conducting big data analysis of resources within the discipline to understand abstracts comprehensively and extensively [3]. However, unlike the general text structure, abstracts involve a variety of natural sciences. As a result, complex labels and the lack of label information make it difficult to accurately map the text features to the corresponding labels space [4]. Not only that, abstracts have a higher degree of professionalism, the general word vector is difficult to express comprehensive semantic information [5]. Meanwhile, many supplementary explanations introduce lots of noise, which are not related to the topic. This noise makes the length of the abstracts longer and the features scattered. Therefore, higher requirements are put forward for cognitive systems to understand the relevance of context.

Most of the word embeddings used by existing methods for text classification are based on language models. For example, bidirectional encoder representations from transformers (BERT) uses unsupervised objectives and trained on large numbers of text data. Unlike other models, it uses bidirectional coding structures to enhance the generalization ability of pre-trained encoder models, which made great contributions in text classification. Recently, Moraangthem and Lee [6] considered a lite BERT (ALBERT) as a better pre-trained model using parameter reduction technology, which significantly reduces the amounts of parameters and improves performance of BERT. The high professionalism of abstracts can easily lead to the label semantics being far from the sample semantics because of the lack of relevant knowledge [7]. To solve the issues, this paper proposes a fusion label information model to generate label semantics by integrating sample information. On this basis, label semantics and text information are taken as two kinds of attention heads and multi-head self-attention realizes the feature interaction between labels and texts. It not only highlights the weight of professional features but also enhances the semantic representation ability of embeddings.

Apart from making the most of label information, methods for abstract classification are also critical. Due to the structural characteristics of abstracts in academic articles, there are many supplementary explanations. These explanations not only introduce excessive noise to interfere with the model mining text information, but also increase the length of the abstracts, causing long-term dependence problems [8]. Due to the length of abstracts, the local features extracted by traditional convolutional neural networks (CNN) are not comprehensive enough and the global semantic information contained in the long texts cannot be used [9]. While recurrent neural networks can extract global feature information, the high proportion of noise content in the texts causes a fragmented distribution of features. It is easy to affect the extracted global features. Nowadays, the traditional feature extraction methods can no longer adapt well to the classification task of text in the professional field, and it is urgent to design a highly professional text classification model for the abstracts. In this work, we design a dual channel pooling mechanism to improve CNN. The deep semantic information channel uses the maximum pooling to retain the maximum features of the sentences. It highlights the key content in the abstracts and avoids the key information being overwritten when the text length is too long. The average pooling method in the shallow semantic information channel retains the overall information of the sentence, which is suitable for the underlayer TSGRU to extract context-related features. TSGRU adds a timescale to recur the past features after filtering, which strengthens the long-term dependence between texts and improves the model’s mining ability for potential features of texts. AAPD dataset contains 55,840 abstracts and each abstract contains about 200 to 500 words. WOS dataset collects abstracts from 46,985 articles published on the Web of Science. The two datasets are suitable for evaluating the performance of the model on abstracts with long length. Amazon Review and Yahoo! Answers datasets have the max length of 32,788 characters and 4000 characters, respectively. Therefore, they are suitable for evaluating the classification performance of longer texts.

The main contributions of this paper are as follows:

  • In terms of pre-trained encoder model, we propose a method of fusing label information to improve the ability of abstracts representation. It uses multi-attention mechanism to integrate the sample public information as labels semantics and multi-head attention to combine labels and texts information.

  • In terms of text classification model, we propose a multi-granularity model to solve the problem of excessive noise in abstracts and dispersive features. It introduces DCP-CNN to enhance the feature recognition of key features and the coverage of sequence information of the entire abstracts.

  • Considering that CNN cannot effectively extract the spatial information of the abstracts, TSGRU is proposed to obtain more comprehensive spatial semantic information through the timescale and enhance the ability to suppress noise and retain the contextual semantic features through a soft thresholding mechanism.

This paper is organized as follows. The “Related Work” section presents the review of literature. The “Research Methodology” section presents the details of the proposed model. The “Experiments and Analysis” section shows the analysis and results of experiments. The “Discussion” section discusses the results of experiments and the “Conclusion” section summarizes the paper.

Related Work

Deep neural network models have obtained large success in many natural language processing tasks. These cognitively inspired models achieve satisfactory results in text classification with the optimization in different aspects and promote the development of the cognitive systems.

Word Embedding

Language models using pre-trained word embedding matrices have higher training speed and accuracy than random word embedding matrices [10]. Glove’s method of word representation based on count-based and overall statistics [11] reduces the amount of computation and storage space of data. BERT is a new language model [12] that targets the masked language model to predict the next sentence with masked or replaced words to generate deep bidirectional language representations. ALBERT reduces the amounts of parameters while maintaining performance and improving efficiency of parameters [13], the specific number of parameters is shown in Table 1.

Table 1 Comparison of BERT and ALBERT

Compared with BERT, ALBERT has a smaller number of parameters under the same conditions, and the classification performance is the same as BERT. The language model pre-trained by ALBERT can not only understand the semantics of texts accurately and break through the polysemy problem that static word vectors cannot solve, but also improve the operation efficiency of the model.

Text Classification

The traditional text classification method is to make multiple categories of features artificially, such as vocabulary, syntax and term frequency. Then put them into machine learning models, such as support vector machines (SVM), naive Bayes and random forest [14]. However, extracting features manually is a task that requires a lot of expertise, which omits long-term relationships in the text corpus and makes it difficult to cope with the fast-growing field of academic articles. CNN and Recurrent Neural Network (RNN) have long been popular. RNN is a class of neural networks to process sequence data. As shown in Fig. 1.

Fig. 1
figure 1

Structure of RNN

RNN treats the text as a sequence of words and understand the structure in it. However, in the face of long tests, the gradient vanishing will appear when the depth of the neural network is too deep. The practice and theory of gated units have long been studied. Long short-term memory (LSTM) first applied them to the hidden layers of RNN, controlling the flow of information through a gating mechanism to mitigate gradient vanishing. It is excellent at processing sequence data and easy to capture long-term and short-term dependencies [15]. However, this model cannot achieve key information in the text and it is hard to capture local features in the text. Gate recurrent unit (GRU) is similar to LSTM. It reduces the number of gating units under the premise of ensuring classification accuracy. Therefore, it is easier to train and improve training efficiency greatly. MTGRU builds on the GRU by increasing the above share through timescale, which strengthens the relevance of context. The gating units of LSTM, GRU and MTGRU are shown in Fig. 2.

Fig. 2
figure 2

Gate recurrent units of LSTM, GRU and MTGRU

The variants of LSTM and GRU can obtain overall semantic information [16]. Sentiment analysis uses interactive LSTM [17] to model interactions between individuals to discover changes in each person’s emotional state. Bi-directional long short-term memory (Bi-LSTM) is used to obtain the global representation of the article. Combine with a multi-convolutional neural network (MCNN) to capture shallow features flexibly and use the attention mechanism to capture more comprehensive key information [18]. The use of LSTM with an attention layer [19] allows the network to select the most relevant feature for each label. A long text classification algorithm integrating multi-feature-level attention mechanism [20] is proposed, which uses bidirectional gated recurrent unit (Bi-GRU) and CNN to extract multiple feature fusions to obtain specific target vectors. Bi-GRU with attention mechanism and capsule network performs better when processing tasks with less data. At the same time, the correlation between words is preserved [21]. The effectiveness of timescale [2223] on neural networks has been demonstrated. On this basis, Moirangthem and Lee [6] proposed that hierarchical MTGRU to capture multiple compositions and enhanced the network’s ability to model longer text sequences. In addition, Pal et al. [24] designed two new decoding units in the GRU to speed up convergence and added a new gating unit to reserve longer memory. Aote et al. [25] used the particle swarm optimization algorithm to process multiple features of the abstract and achieved good performance.

CNN has great advantages in parallel computing and has the ability to get local correlations and extract higher levels of correlation through pooling [26], which allows it to extract sentences from a continuous context window. Kaur [27] used CNN to improve its performance based on BERT. Rafiepour et al. [28] used several convolutional layers with different kernel sizes to preserve the correspondence between tokens and labels. Liang et al. [29] utilized a combination of a well-designed multi-view representation learning and data transfer methods to extract and weight text with multi-granularity representations automatically. Ayetiran [30] used convolution operations to extract attention signals and highlight emotional words and flip words that focus on the text. Using char embedding as input to CNN avoids that traditional word embedding does not have a good effect on low-frequency words [31]. In addition, Li et al. [32] introduced inductive learning methods on the basis of graph convolution to enhance the interpretability of text information. The introduction of exogenous knowledge to build a network solved the problem that existing methods ignore the semantic and structural information of nodes effectively. However, using word frequency to measure the importance of words could not reflect sequence information and was easily affected by dataset skew.

Study on Labels

In addition to utilizing text representation, label information can be leveraged to improve text classification. The division of label hierarchy combines text-to-label attention and text labels participate in the representation [33]. Label information leverages the feedback of text representations to encode labels with more information. Wang et al. [34] established the interaction function between labels and texts through a multilayer perceptron and experiments proved that the information representation of labels can be effectively enhanced. But datasets often use fixed label annotations, ignoring relationships between labels. Qian et al. [35] proposed the label-level contrastive learning (LLCL) paradigm to constrain unreasonable label distribution and capture label correlation. Wang et al. [36] designed a guide network label strengthening strategy, which used label semantics to fine-tune the pre-trained classification model. But the model was only valid for labels with fixed semantics.

Research Methodology

This section describes the classification model in detail. The frame of the model is shown in Fig. 3.

Fig. 3
figure 3

The frame of the cognitively-inspired multi-granularity model incorporating label information (LIMG). Label-text fusion is used to improve representations of abstracts. DCP-CNN and TSGRU extract the features of different granularities. Fusion gate fuses multi-granularity features and puts them into classifier

Firstly, the label-text fusion obtains label semantics and enhances representation of abstracts with the use of label information. Secondly, the features of different granularities are extracted by DCP-CNN and TSGRU, and then the fusion gate realizes the information fusion of the two. Finally, the classifier outputs the prediction results according to the fusion features. The following sections will introduce the structure of the model in turn.

Pre-trained Encoder Layer

In this section, the fusion label information model is described in detail. Its purpose is to integrate label information into the encoding of text sequences, so that labels are more closely related to abstracts. As shown in Figs. 4 and  5, it is mainly composed of multi-attention semantic extraction and multi-head attention layer.

Fig. 4
figure 4

Multi-attention semantic extraction

Fig. 5
figure 5

Multi-head attention

The multi-attention semantic extraction layer uses ALBERT to obtain word embeddings, and then puts samples into set \({S}_{i}\) according to their corresponding labels, where i represents a label, i ∈ [1, I], and I represents the number of labels. We use Eq. (1) to calculate the semantic similarity weight matrix between samples.

$${\delta }_{x}=\frac{Relu\left({d}_{x}^{\lambda }\right)}{\sum_{n=1}^{L}Relu\left({d}_{n}^{\xi }\right)}$$
(1)

where \({d}_{x}^{\lambda }\) is a word embedding x from the sample \(\lambda\), \({d}_{n}^{\xi }\) is a word embedding from the sample \(\xi\). Samples \(\lambda\) and \(\xi\) are different samples from the set \({S}_{i}\). Then L is the length of the sample and \(Relu\) represents the activation function. We use the \(Relu\) activation function because it has stronger nonlinear fitting ability and high computational efficiency. The computed weight \({\delta }_{x}\) of x helps the attention mechanism extract common semantics.

As shown in Fig. 4, the process is as follows: divide the samples from \({S}_{i}\) into groups of two and use the attention mechanism to pay attention to their corresponding first-level intermediate semantics. Therefore, the non-common semantics between the two samples are weakened and the common semantics are retained. Then repeat the above steps to obtain more advanced intermediate semantics in all first-level intermediate semantic groups. Finally, the label semantics pointed to by this dataset are obtained.

The word embeddings of texts and labels obtained through the multi-attention semantic extraction layer can be expressed as \({x}_{emb}\)= {\({x}_{1}\),\({x}_{2}\),\({x}_{3}\),…,\({x}_{n}\)} and \({l}_{emb}\)= {\({l}_{1}\),\({l}_{2}\),\({l}_{3}\),…,\({l}_{c}\)}, where\({x}_{i}\in {R}^{n*d}\),\({l}_{i}\in {R}^{c*d}\), n is the number of words in the texts, and c is the number of labels.

To obtain a textual representation containing label information, the multi-head attention layer helps the model pay more attention to label-related words. The scaled dot product attention is as follows [37]:

$$Attention(Q,K,V)= Softmax(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}})V$$
(2)

where Q ∈ \({\mathbb{R}}^{q\times {d}_{k}}\), K ∈ \({\mathbb{R}}^{k\times {d}_{k}}\), V∈\({\mathbb{R}}^{k\times {d}_{v}}\) and we set \({d}_{k}\)=\({d}_{v}\). The definition of multi-head attention is as follows:

$$\begin{aligned}&MultiHead\left(Q,K,V\right)=Concat{(H}_{1};\dots ;{H}_{p}){W}^{o}\\&where \;{H}_{i} =Attention(Q{W}_{i}^{Q},K{W}_{i}^{K},V{W}_{i}^{V})\end{aligned}$$
(3)

where \({W}^{o}\)\({\mathbb{R}}^{h{{d}_{h}\times d}_{k}}\), \({W}_{i}^{Q}\), \({W}_{i}^{K}\) and \({W}_{i}^{V}\)\({\mathbb{R}}^{{d}_{k}\times {d}_{h}}\). h is the number of heads and i ∈ [1, h]. The dimension of each head is \({d}_{h}\)=\({d}_{k}\)/h. Concat is used to connect the heads of multi-head attention. To make the model pay more attention to the words related to the label, we feed \({x}_{emb}\) and \({l}_{emb}\) into the multi-head attention module at the same time to get label-attended text representation \({X}_{att}\) [33].

$${X}_{att}=MultiHead\left({x}_{emb},{l}_{emb},{l}_{emb}\right)$$
(4)

We use word embeddings as query vectors to calculate the relevance of word embeddings and labels. The word embeddings associated with labels obtain greater attention weight.

Sample: Select an abstract and the corresponding labels from the AAPD dataset. The text is from the abstract while stat.ME and cs.IR are the labels corresponding to this abstract. The stat.ME stands for methodology subject in the domain of statistics and cs.IR stands for information retrieval subject in the domain of computer science. Then we get the corresponding word embeddings \({x}_{emb}\) and \({l}_{emb}\) through ALBERT coding layer. The dimension of the word embeddings is 128 and the number of heads is 2. With the help of multi-attention semantic extraction layer, we can get the weights of the words related to the label. Figure 6 is the display after the visualization of the attention weights. The darker the color, the more relevant it is to the label cs.IR.

Fig. 6
figure 6

Label-attended text encoding

Finally, we use two independent Feed Forward Networks (FFN) and residual connections to get their fused encoding. After Layer Normalization (LN), we get fusion encoding \({{\text{X}}}_{fuse}\):

$${X}_{fuse}=L{N}_{X}\left(FF{N}_{X}\left({X}_{att}\right)+{X}_{emb}\right)$$
(5)

The obtained fusion encoding \({{\text{X}}}_{fuse}\) will be classified in the multi-granularity classification model as word embeddings of the texts.

LIMG

Dual Channel Pooling CNN

Due to the high proportion of noise content in the abstracts, it is difficult to extract long text features while eliminating the influence of noise only through the general shallow convolutional structure. Therefore, the abstracts are divided into sentences according to the hierarchical structure of the text and a dual channel pooling CNN is designed to extract local key information and context sequence information in the text. It can extract both local key information and contextual sequence information at the same time. Firstly, the sentence feature vectors c are extracted by CNN, and then the mean-pooling and max-pooling are performed in two channel dimensions respectively. We can obtain the feature vector \({c}_{avg}\) containing the shallow semantic information of the texts and the feature vector \({c}_{max}\) containing the deep semantic information, \({c}_{avg}^{i}\) and \({c}_{max}^{i}\) is as follows:

$${c}_{avg}^{i}=\frac{1}{r}{\sum }_{k=1}^{r}{c}_{k}^{i}$$
(6)
$${c}_{max}^{i}=max({c}_{1}^{i},{c}_{2}^{i}\dots {c}_{k}^{i})$$
(7)

where \({c}_{k}^{i}\) denotes vector k in the sentence i when use the kernel size of m. \({c}_{avg}^{i}\) denotes the vector after using mean-pooling in \({c}_{k}^{i}\) and r is the number of these vectors. Then we connect \({c}_{avg}^{i}\) of all sentences to get \({c}_{avg}\). Similarly, we replace mean-pooling with max-pooling to get \({c}_{max}\).

The shallow semantic information focuses on the general content of the abstracts and then it will be extracted by TSGRU. Deep semantic information focuses on the key content to compensate for the key information forgotten during the TSGRU extraction process. CNN can also flexibly set multiple convolution filters to extract deep semantic features. Features of different sizes are extracted separately by sliding on the \({{\text{X}}}_{fuse}\) with different kernel sizes k (e.g., kernel = 1, kernel = 3, kernel = 5 in Fig. 7). Then, the max-pooling operation is performed to reduce the dimensionality of the features and extract more important information. On this basis, multi-head attention provides multiple subspaces to refine the distribution of attention weights. Each attention head can focus on measuring the weight of the word in the current position.

Fig. 7
figure 7

Dual channel pooling CNN

Algorithm Design of DCP-CNN

According to the above description of DCP-CNN, the following algorithm is designed. It uses \({{\text{X}}}_{fuse}\) as input that is partially integrated into the label information in the “Data” section. Each sentence extracts semantic information through two channels: mean-pooling uses kernels of size 1 to retain complete sequence information and max-pooling uses different sizes of kernels to extract deep semantic information. The detailed procedure is shown in Algorithm 1.

Algorithm 1
figure a

Dual channels pooling CNN

The shallow semantic information \({h}_{mp}\) is extracted by the underlayer TSGRU to make up for the missing sequence information of DCP-CNN.

Timescale Shrink Gated Recurrent Units

GRU solves the problem that CNN cannot extract temporal features and alleviates gradient vanishing, but with the increase in the length of abstracts, more and more past information disappears because of the gating units. This will destroy the long-term dependence in long texts, so adding a variable called timescale in GRU will increase the proportion of the past information and strengthen the context connection, to obtain more comprehensive global features. In order to enhance the resistance of the model to noise in texts, a soft thresholding algorithm is introduced to the timescale. Soft thresholding is a common algorithm in signal noise reduction processing. When the features are lower than the threshold, it can be considered that this part of the features is useless and will be zeroed out. The other part of the features will be retained. In this way, the noise reduction treatment can be achieved. The formula is as follows:

$$y\left\{\begin{array}{c}x-\lambda ,x>\lambda \\ 0,-\lambda <=x<=\lambda \\ x+\lambda ,x<-\lambda \end{array}\right.$$
(8)

x, y are input and output vectors, respectively. \(\lambda\) is the threshold.

The timescale is to add another constant gating unit to blend the features of current and past hidden states essentially.

Each step of the TSGRU takes \({x}_{t}\) from \({h}_{mp}\) and the previous hidden states \({h}_{t-1}\) as input to obtain the output vectors \({h}_{t}\) of the hidden layer. It contains a reset gate \({r}_{t}\) and an update gate \({z}_{t}\) to determine how many features of the past hidden state are retained [6], as is shown in Eq. (8):

$${r}_{t}=\sigma \left({W}_{xr}{x}_{t}+{W}_{hr}{h}_{t-1}\right)$$
$${z}_{t}=\sigma \left({W}_{xz}{x}_{t}+{W}_{hz}{h}_{t-1}\right)$$
$${u}_{t}=tanh({W}_{xu}{x}_{t}+{W}_{hu}({{r}_{t}\odot h}_{t-1)})$$
$$\begin{array}{c}{\widetilde{h}}_{t}={z}_{t}{h}_{t-1}+\left(1-{z}_{t}\right){u}_{t}\end{array}$$
(9)

where \({u}_{t}\) and \({\widetilde{h}}_{t}\) serve as candidate activation and hidden state vectors of the current gating unit. σ(⋅) and tanh(⋅) are the sigmoid and tanh activation functions. \(\odot\) denotes the Hadamard product.

The timescale gating unit is shown in Eq. (10):

$${h}_{t}={\widetilde{h}}_{t}\frac{1}{\tau }+\left(1-\frac{1}{\tau }\right){h}_{t-1}$$
(10)

The constant τ is used to control the timescales of each TSGRU cells. On the one hand, larger τ increases the features of the previous text sequence, which makes the gated unit retain more long-term dependency. It is conducive to extracting features from longer texts. On the other hand, a smaller τ makes the scale factor 1/τ larger, so the current time series \({\widetilde{h}}_{t}\) accounts for more weight. The gating unit will contain more features of the current time series. The τ, like other weight parameters in neural networks, is a trainable variable that is optimized with the final loss.

Algorithm Design of TSGRU

Based on the above description, the algorithm of TSGRU to extract the global features of abstracts is shown in Algorithm 2.

Algorithm 2
figure b

Timescales shrink gated recurrent units

The algorithm is called at every step of the training process. All parameters are initialized before training, including hidden state vector \({h}_{t-1}\) and the time scale parameter τ. We feed word embedding vectors \({x}_{t}\) from the dual channel pooling CNN into the model. Then we can obtain \({h}_{st}\mathrm{ after}\) filtering the noise in \({h}_{t-1}\) by the soft thresholding algorithm. Filter \({h}_{st}\) and \({x}_{t}\) through the reset gate and update gate to get the candidate activation \({u}_{t}\). The τ is used to adjust \({u}_{t}\) and \({h}_{st}\) to get the next hidden state \({h}_{t}\). The update of the timescale τ starts after a specific number of batches of training. The final output of the algorithm \({h}_{t}\) will be fused with the deep semantic information \({h}_{max}\) in the next section.

Fusion Gate

Since the deep semantic information \({h}_{max}\) enhanced by DCP-CNN and the global features \({h}_{t}\) extracted by TSGRU may be complementary and duplicated, we used a gating unit to fuse the features from two aspects:

$${g}_{t}=\sigma \left({W}_{g}{o}_{g}+{W}_{c}{o}_{c}+b\right)$$
(11)
$${o}_{t}={g}_{t}{o}_{g}+\left(1-{g}_{t}\right){o}_{c}$$
(12)

\({{\text{g}}}_{t}\) is the gating unit for selecting the features, \({{\text{o}}}_{{\text{g}}}\) is the global features extracted by TSGRU, \({{\text{o}}}_{{\text{c}}}\) is the deep semantic features extracted by the CNN and \({{\text{o}}}_{t}\) is the text features filtered by the gating unit. Finally, the fully connected layer and activation of the classifier will output the probabilities of labels to which the abstracts belong.

Experiments and Analysis

Datasets

In order to comprehensively compare the performance between LIMG and the traditional classification models, four benchmark datasets are used to cover different text lengths and multiple classification tasks. The statistics summary of these datasets is shown in Table 2.

Table 2 The details of the text classification datasets

Arxiv Academic Paper Dataset (AAPD): Contain abstracts of 55,840 academic articles from the site. Each abstract involves multiple disciplines and the total number of disciplines is 54. Each abstract has multiple labels and each label has many samples. Each abstract contains about 200 to 500 words, which is suitable for evaluating our model.

WOS-46985: The Web of Science (WOS) dataset collects data such as abstracts, domains, and keywords from 46,985 articles published on the Web of Science. The categories of first-level include 7 categories of computer science, psychology, mechanical engineering, electrical engineering, biochemistry, medical science and civil engineering.

Amazon Review: Come from the Stanford Network Analysis Project (SNAP). The full dataset (Amazon F) includes 34,686,770 reviews on 2,441,053 products and the max length of reviews is 32,788 characters. Reviews are divided into 1–5 star representing user satisfaction. Amazon Review Polarity Dataset (Amazon P) is a subset that contains 3,600,000 training samples and 400,000 testing samples in 2 polarity sentiment.

Yahoo! Answers: Topic Classification from “Yahoo!” Corpus of answers. It contains the questions in the corpus and the related answers to them and the text length can be up to 4000 characters. It includes 10 classes, each containing 140,000 training samples and 5000 test samples, respectively.

Experiment Settings

Word embeddings with label information are used as input in the experiments. Among them, 128 units of TSGRU and 128 units of DCP-CNN are used to extract features. Following the parameter setting in Yun et al. (2022), the timescale parameter τ is initialized to the value of 1.00. The learning rate of updating τ is set to 0.00001, so the timescales will not change too large. Gradient clipping is also used to prevent gradient explosion with a clipping value of 1.00 and learning rate is 2e-5. For regularization [4], a dropout of 0.5 was adopted on the LIMG to reduce overfitting. We use ALBERT to acquire word embeddings with the dimension of 128 and two heads in multi-head attention.

Competitor Methods

Model evaluation mainly focuses on two aspects, one of which is the pre-trained encoders. The texts encoding of LIMG incorporates labels information to highlight word vectors related to labels and it is necessary to verify the effectiveness of label-text fusion by comparing with word vectors without fused labels information. The second is the performance in long texts classification. In the experiments, we select excellent text classification models such as Char-CNN, Attn-LSTM, MTGRU and so on to analyze whether the performance of the improved GRU and CNN in long text classification is improved under the same input.

Experiment I: Comparative Accuracy Analysis of Classification Models

Experiment I compares the accuracy of LIMG and the baseline models on the four datasets above. Table 2 uses accuracy as a metric for classification and the equation is shown in Eq. (13). In order to comprehensively measure the performance of the models on the abstracts, Table 3 uses the class-weighted harmonic average \(micro-{F}_{1}\) [33] to calculate the experimental results on two academic abstract datasets as shown in Eq. (14). \(micro-{F}_{1}\) and \(macro-{F}_{1}\) are commonly used to evaluate multi-classification tasks. \(macro-{F}_{1}\), which calculates \({F}_{1}\) values for each category, is more susceptible to unbalanced data distribution than \(micro-{F}_{1}\). Therefore, we choose \(micro-{F}_{1}\) to evaluate classification performance.

Table 3 Accuracy of our model against other methods on various benchmark datasets
$$Acc=\frac{TP+TN}{TP+TN+FP+FN}$$
(13)
$$micro-{F}_{1}=\frac{{\sum }_{i=1}^{c}2T{P}_{i}}{{\sum }_{i=1}^{c}2T{P}_{i}+F{P}_{i}+F{N}_{i}}$$
(14)

The baseline models include the traditional classification model CNN, LSTM and their variants Attn-LSTM, Char-CNN, LSTM-CNN, Bi-LSTM and MTGRU. The pre-trained portion used by all baseline models uses word embeddings that fuse labels information.

From Table 3, LIMG has higher accuracy on all datasets than MTGRU, which is the best performing model in the baseline models. At the same time, LIMG achieves the maximum improvement of 2.28% on Amazon and Yah.A datasets with long text lengths. This is because the hierarchical structure can obtain comprehensive features from different granularities of the texts.

Table 4 shows the \(micro-{F}_{1}\) scores of each model on the AAPD and WOS datasets. It can be seen from Table 4 that LIMG has achieved a maximum improvement of 3.22% compared with MTGRU, which proves that the improved timescale can effectively filter out noise in abstracts and facilitate the extraction of fragmented distribution features.

Table 4 Micro-F1 scores on the abstracts in AAPD and WOS datasets

The LIMG model has the best performance in two evaluation metrics, showing good generalization performance and can cope with various complex long text classification tasks.

Experiment II: Comparison with Large Pre-trained Models

Experiment II compares LIMG with several state-of-the-art pre-training models. Although some models use corpus of large-scale to get excellent language representation, it is difficult to learn specific meanings of labels in professional abstracts. The experiment uses \(micro-{F}_{1}\) to measure the performance of these models in AAPD and WOS. The results are shown in Table 5.

Table 5 Comparison with pre-trained models in terms of micro-F1 scores 

As shown in Table 5, the \(micro-{F}_{1}\) scores of the LIMG have improved 5.81 compared to other pre-trained models. By extracting the common semantics of similar samples, LIMG avoids the lack of actual semantics of labels. Therefore, the text representation ability is better than other models. The effect is most obvious on the WOS dataset, because the number of WOS labels is less than AAPD. Besides, there are more homogeneous samples for the model to learn and fewer labels are conducive to multi-head attention to pay more attention to the words that are related to the labels.

In Fig. 8, the assignment of weights in the attention layer is visualized, with different color treatments for the parts of the abstract that are relevant to different labels. The results show that the multi-head attention layer captures the label-related parts of the text sequence and verifies the effectiveness of the mechanism of fusing labels and information of abstracts.

Fig. 8
figure 8

Visualization of the attention scores in multi-head attention

Experiment III: Classification Performance on Different Length of Texts

In the text classification task, the accuracy of the model declines significantly due to the increase of text length. Therefore, experiment III divides the AAPD dataset according to different text lengths and evaluates them from six indicators: precious, recall, F1, micro-precious, micro-recall and \({\text{micro}}-{{\text{F}}}_{1}\) to measure the effect of the models on long texts comprehensively. The experiment has three parts. The first part compares whether GRU adds the classification indicators of the Timescale Shrink (TS), as shown in Fig. 9(a) and (b); the second part compares the classification performance before and after adding the DCP-CNN, as shown in Fig. 9(c) and (d); the third part compares LIMG with the optimal baseline model MTGRU at different text lengths, as shown in Fig. 9(e) and (f), the larger the area, the better the models perform.

Fig. 9
figure 9

Compare classification performance based on different lengths of input. TSGRU means timescale shrink GRU and DCP-CNN means dual channel pooling CNN. a 200 words, b 400 words, c 200 words, d 400 words, e 200 words, f 400 words

As shown in Fig. 9(a) and (b), there is no large difference between the five indicators obtained by adding a TS to the same abstracts with a length of about 200. When processing abstracts with a length of about 400, the indicators of the model without TS decreased significantly and the model with TS decreased slightly. It indicates that TS can effectively avoid the above information forgetting and retain the long-term dependence of the context.

In Fig. 9(c) and (d), the DCP-CNN-added model performs better, indicating that the dual channel pooling compensates the loss of key features caused by the GRU cell’s special forgetting mechanism.

Figure 9(e) and (f) compares the indicators of LIMG with the optimal baseline model MTGRU. On datasets of different text lengths, LIMG outperforms MTGRU in all indicators.

Figure 10 further subdivides the text length and we can directly see the change of classification accuracy of each model as the text length increases. LSTM+CNN is the only model without adding timescale. Its accuracy decreases the most. Therefore, the timescale has the most significant improvement on long abstracts. The performance of TSGRU on shorter texts with soft thresholding algorithm is similar to that of ordinary timescale GRU. However, the gap between the two gradually widens with the increase of text length, which further illustrates the necessity of soft thresholding algorithm to filter text noise.

Fig. 10
figure 10

Compare classification performance on AAPD and WOS datasets. TSGRU means timescale shrink GRU and T means timescale GRU. a AAPD dataset, b WOS dataset

Experiment IV: Ablation Study

To further verify the effectiveness of the LIMG modules, Experiment IV conducts ablation studies on AAPD and WOS. Ablation studies usually refer to removing some features of a model or algorithm and observing how it affects model performance. The experiment is performed from the following three characteristics: Fusion Label Information Model (LI), Dual Channel Pooling Model (DCP), Shrink Time Scale Model (TS). Then calculate Accuracy and \(micro-{F}_{1}\) scores on the datasets respectively, as shown in Tables 6 and 7.

Table 6 Accuracy of LIMG on AAPD and WOS datasets
Table 7 Micro-F1 scores of LIMG on AAPD and WOS datasets

Tables 6 and 7 show that the TS has the greatest impact on the overall performance of the model. TS introduces the past features after filtering noise to avoid the information being overwritten by the GRU and retains the long-range dependence. DCP and LI can also improve the performance of abstracts classification. The LI model explains that giving reasonable semantics to labels helps the model pay attention to the label-related features. While the DCP model extracts sentence-level features through a hierarchical structure, which is conducive to the distribution features of GRU aggregation fragmentation.

In order to demonstrate whether the features extracted by the model are beneficial to classification visually, the experiment uses the AAPD dataset to map the multi-dimensional features extracted by the model to the two-dimensional plane. We selected 5 labels that were not associated with each other randomly and packaged the abstracts belonging to these labels into a training set separately. The Principal Component Analysis (PCA) is used to map feature vectors to two-dimensional vectors. PCA can retain most of the feature information and avoid feature loss. We visualize the feature extraction results by using 5 colors to mark the abstracts of the 5 categories. Evaluate the results of extraction according to the degree of convergence of similar features and the boundary distance of heterogeneous features, as shown in Fig. 11.

Fig. 11
figure 11

Visualize the features extracted by different models on AAPD dataset. The models extract the features of abstracts belonging to the five labels and we mark them with five colors. a DCP-CNN, b DCP-CNN+GRU, c DCP-CNN+TSGRU

In Fig. 11, TSGRU means timescale shrink GRU and DCP-CNN means dual channel pooling CNN. Figure 11(a) is a two-dimensional feature map of the dataset extracted by DCP-CNN, from which it can be seen that the boundary of various features is not obvious. Figure 11(b) further uses GRU to extract text information of different granularities on the basis of Fig. 11(a). It can be seen from the figure that the text characteristics of different categories have a relatively clear dividing line. Figure 11(c) is to add TS model on the basis of Fig. 11(b). Compared with Fig. 11(b), the dividing line of different features is more obvious and the degree of convergence is higher. Combined with the above three visual feature maps, it is shown that each part of the model has different contributions to the classification of abstracts.

Discussion

We compared the classification performance between baseline models and our model. As shown in Table 3 and 4, although the model is based on CNN and MTGRU [6], its performance has been significantly improved. The result of experiment III shows that TSGRU is particularly effective for processing the task of abstracts with long length. Because it filters out text noise while reducing the loss of information transmission in the deep network. This is similar to the purpose of residual networks used in image recognition [38]. This method helps computers process large amounts of information when simulating cognitive systems. It can be seen from Table 6 and Table 7 that the label vectors integrating text information also significantly improve the performance of classification. Assigning appropriate semantics to labels brings improvement for other cognitive domains. Just as in human cognition, labels contain some unique characteristics. For example, the polarity label used in the sentiment classification task [2, 17] is usually an integer. If the label contains the corresponding emotional information, the result may be improved. Since abstracts have a lot of volume and content, it is convenient to extract the corresponding label semantics. For data with sparsity and short text, the model has certain limitations. But with the help of external knowledge [32], this problem will be solved. Our model is suitable for all single-label and multi-label classification tasks with long texts, such as highly specialized abstracts and patent classification.

Conclusion

This paper discusses the problem of long text classification for abstracts. We develop a cognitively inspired multi-granularity long text classification model that integrates label information in view of the complex domain and the excessive length of abstracts. Firstly, the label information fusion model is designed to obtain the semantic information of each label to improve the semantic representation. Secondly, the dual channel pooling convolutional neural network (DCP-CNN) is proposed to solve the problem of loss of critical information due to excessive length of abstracts. Finally, the shallow semantic information channel in DCP-CNN and timescale shrink gated recurrent units (TSGRU) are used to obtain global information. On the basis of the timescale gated recurrent units, a soft threshold shrinkage algorithm is added to filter noise and enhance the long-term dependence in abstracts. In the experiments, the ablation studies are carried out on each part of the model. The results of the experiments show that the proposed model can maintain better performance with the gradual increase of length in abstracts. The model makes up for the shortcomings of the current classification models in the use of label semantics and its multi-granularity feature extraction solves both text noise and long-term dependency. As a result, computers can process large amounts of information in long abstracts, facilitating the cognitive system’s understanding of academic texts. In the future, we plan to introduce external data to reduce the adverse effects of data sparseness on label information extraction and improve the encoding of academic terminology. This will allow us to improve the cognitive performance of the model.