1 Introduction

With the rapid development of e-commerce and social media platforms, users are generating a large volume of short texts on a daily basis, including product reviews and online forum posts, among others.This significant increase in short texts on the web has led to a growing interest in the STC task from both industry and academia. The goal of the STC is to automatically classify incoming short texts into different categories, thereby preventing users from being overwhelmed by the massive amount of raw web data. Furthermore, STC can be readily applied to a wide range of natural language processing (NLP) tasks, such as sentiment analysis, dialogue systems, and offensive language detection.

In the earlier stage, Latent Semantic Analysis (LSA) (Dumais 2004) and its extensions, such as Independent Component Analysis (ICA) (Comon 1994) and Language Independent Semantic (LIS) kernel (Kim et al. 2014), play an important role in the STC. These approaches have the capability to extract potential semantic structures while classifying short texts by combining matrix decomposition techniques with machine learning-based classification algorithms, including Naïve Bayes, K-nearest neighbors, and support vector machine (Song et al. 2014). However, it is worth noting that these approaches are computationally expensive and heavily reliant on feature engineering.

Subsequently, STC methods based on the deep neural network (DNN) have garnered considerable attention due to the advancements in deep learning in recent years. These methods primarily employ Convolutional Neural Network (CNN), Recurrent Neural Networks (RNN), and other neural network structures (Mirończuk and Protasiewicz 2018) as the backbone. The CNN is effective in extracting local features, such as N-gram features, while the RNN captures long-distance features from texts. However, despite their individual strengths in prioritizing locality and sequentiality, both CNN and RNN overlook the valuable global word co-occurrence information that encompasses non-consecutive and long-distance semantics.

More recently, the GCN has emerged as a promising approach for addressing the STC task (Linmei et al. 2019; Zhang et al. 2020; Liu et al. 2020). For example, Yao et al. (2019) treat the text classification task as node classification, where they construct a text graph consisting of word and text nodes. They then employ a GCN to learn the node embeddings via message passing and predict the labels of text nodes. Wu et al. (2012) construct a word-level graph for each document, connecting nodes within a fixed-size window. This approach enables better capture of local features and significantly reduces memory consumption. Linmei et al. (2019) propose a Heterogeneous Graph Attention Network that incorporates a double-layer attention mechanism for text classification. By utilizing a heterogeneous information network, this method can integrate various types of additional information and the relationships between them.

However, it is worth noting that the aforementioned GCN-based approaches primarily focus on texts of normal length, and few studies have investigated their effectiveness on short texts. Moreover, applying the GCN to short texts poses a significant challenges. Firstly, short texts are semantically sparse and lack sufficient context (Song et al. 2014).This sparsity issue results in the absence of connections between word pairs that are highly correlated in our common sense. Secondly, most GCN-based methods rely solely on SoftMax or Cross-Entropy objective functions to learn an optimal representation of a given text that is most similar to its ground-truth label. These methods ignore the intra-class and inter-class geometrical structures in the global semantic space, resulting in unclear classification boundaries among samples from different categories.

To address the aforementioned challenges, we propose a novel GCN-based STC method named Topic-aware Cosine Graph Convolutional Network (ToCo-GCN), which effectively mitigates the sparsity problem and fully utilizes the global geometric structures of short texts. Specifically, given an STC corpus, the ToCo-GCN first captures its latent topic distributions of words and short texts. Meanwhile, a text graph that takes the words and short texts as nodes is constructed. Then, the ToCo-GCN regards the latent topics as virtual nodes and construct a topic-aware text graph. Based on the topic prior, this graph directly connects word pairs within each topic cluster, alleviating the sparsity of the text graph. During the graph learning stage, to learn discriminative text embeddings, the ToCo-GCN captures the intra-class and inter-class geometric structures over the graph in a cosine space. Specifically, inspired by the literature (Wang et al. 2018), the ToCo-GCN utilizes the cosine value of the angle between text embeddings and label embeddings to measure both the inter-class and intra-class geometric structures. Minimizing such geometric constraint enforces angular between short texts from the same category to be smaller while angular between short texts from different categories to be larger in the cosine space. It makes short texts of the same category more compact in space while pushing short texts from different categories farther away. By doing this, the discriminative boundaries between different categories of short texts are now clearer, which effectively enhances task performance. The contributions of our work are summarized as follows:

  • We propose the ToCo-GCN, which fully exploits geometric structures of data by simultaneously considering intra-class and inter-class geometric structures in the STC. Additionally, we make use of topic information to alleviate the sparsity problem for better adapting the model to short texts.

  • We experimentally evaluate the ToCo-GCN with other state-of-the-art models on 8 STC datasets. The ToCo-GCN shows significant improvements in terms of Accuracy and Macro-F1 score compared to the baselines.

The remainder of the paper is organized as follows: In section 2, related work on the STC is introduced. In Sect. 3, we introduce the ToCo-GCN in detail. The experimental results and analyses are given in Sect. 4. Finally, we conclude this paper in Sect. 5.

2 Related work

In this paper, we revise the existing researches on the STC task from two perspectives: traditional STC methods and deep learning-based STC methods.

2.1 Traditional STC methods

Earlier studies on short text classification mainly made use of statistical machine-learning techniques. For instance, a bag-of-words (BoW) model built with rare vocabulary information is proposed in the literature (Heap et al. 2017). Samant et al. (2019) classify short texts based on the Vector Space Model with a new weighting mechanism for each word. Moreover, other feature models, such as TFIDF and n-grams, are also employed for short text classification (Yang et al. 2021; Cavnar et al. 1994). However, both the BoW and the VSM do not well solve the high-dimensionality and sparsity problems inherent in short texts. Feature selection methods involving the Chi-square test (CHI), GINI index (GINI), and dictionary learning are proposed to address the high-dimensionality problem (Liu et al. 2022). For solving the sparsity problem, Li et al. (2017) enrich short text features by using concepts from an external corpus Probase [17]. Alsmadi et al. [18] make use of a keyword expansion method to extend the feature space of short texts. Though these approaches improve the problems and perform better than previous work, their performances still have a gap with deep learning-based methods.

2.2 Deep learning-based STC methods

With the breakthrough of deep learning in the past few years, more and more text classification approaches employ deep neural networks to automatically learn semantic features and classify texts. For example, Kim (2014) proposes a CNN-based model with multi-channel to classify texts. Zhang et al. (2015) propose character-level CNN that models different levels of features, improving the accuracy of text classification. Directly applying these frameworks will perform poorly because the above-mentioned problems of short texts are ignored by them. Then, Hu et al. (2018) leverage a combination of the CNN and Support Vector Machine to enhance the performance of short texts. Moreover, Alam et al. (2020) represent short texts with words and entities and exploit a CNN-based model to classify short texts. To obtain better short text features, Yin et al. (2019) make use of the attention mechanism on the character level and incorporate it into a CNN-based model. In addition to these CNN-based methods, Recurrent Neural Network and its variants are also widely explored in short text classification (Lee and Dernoncourt 2016; Liu and Guo 2019). However, both the CNN-based and RNN-based methods fail to make use of global word co-occurrence information in a corpus that carries non-consecutive and long-distance semantics.

More recently, Graph Neural Networks (Zhou et al. 2020), which concentrate on coping with arbitrary non-Euclidean spatial data, have been well exploited in text classification. In addition to the aforementioned textGCN and the TL-GNN (Huang et al. 2019), Zhang et al. (2020) propose the TextING that encodes each document as a single graph and inductively learns node embeddings with a double-layer GNN. Moreover, Liu et al. [?] propose a tensor graph that is merged by semantic, syntactic, and sequential graphs of a corpus. Different from these methods, Ding et al. (2020) propose the HyperGAT that involves word-word edges. However, these methods will not perform well for short texts because of lacking context information. Thus, GCN-based models for short texts are proposed. For example, Linmei et al. (2019) propose the HGAT that simultaneously models topics, entities, and documents. The entities are associated with knowledge graphs. Ye et al. (2020) propose the STGCN, which develops a corpus-level graph based on not only traditional text relations but also topic relations, alleviating the sparseness of short texts. However, these GCN-based approaches for the STC task fail to consider both intra-class and inter-class geometric structures of samples in a corpus. This impedes models from learning text representations that are representative as well as discriminative.

3 Methodology

Fig. 1
figure 1

The architecture of the ToCo-GCN. This method first generates topic distribution for the incoming STC corpus \(\mathcal {D}\) via the GPU-DMM and then constructs a topic-aware text graph \(\mathcal {G}_s\). Then, a N-layer GCN is employed to learn the node embeddings. Eventually, such predictive results of samples are leveraged to calculate the total loss

3.1 Problem definition

We now formulate the task of STC, whose training dataset contains N labeled samples \(\mathcal {D}=\{(x_{i},\textbf{y}_{i})\}_{i=1}^{N}\). The notations x and \(\textbf{y} \in {\{0,1\}}^{C}\) denote the raw short text and category label, respectively. The goal of our work is to train a GCN-based classifier over \({\mathcal {D}}\), enabling to distinguish the category of a given short text.

3.2 The basic GCN

In this subsection, we introduce the basic GCN that operates directly on graph-structured data. Specifically, given a graph \(\mathcal {G} = \{\mathcal {V},\mathcal {E}\}\). The notion \(\mathcal {V}=\left\{ v_1, v_2, \ldots , v_{\textrm{T}}\right\} \) denote the set of nodes, while the \(\mathcal {E}\) denotes the set of edges. \(\textrm{T}\) is the total number of nodes in the graph \(\mathcal {G}\). We use \(\textbf{U}=\left[ u_1, u_2, \ldots , u_{\textrm{T}}\right] \in \mathbb {R}^{\textrm{T} \times \textrm{d}}\) to denote the node features, where \(\textrm{d}\) is the dimension of node features. The corresponding adjacent matrix is denoted as \(\textbf{A} \in \{0,1\}^{\textrm{T} \times \textrm{T}}\), where 1/0 denotes the component corresponds to an edge or not. Besides, each node of the two graphs is with self-loop. The degree matrix \(\textbf{D}\) is a diagonal matrix and \(\textbf{D}_{ii}=\sum _{j} \textbf{A}_{ij}\). Then, for a single-layer GCN, the node features can be updated by the following equation:

$$\begin{aligned} \textbf{L}^{(1)}=\rho \left( \tilde{\textbf{A}} \textbf{U} \textbf{W}_{0}\right) \end{aligned}$$
(1)

where \(\textbf{L}^{(1)} \in \mathbb {R}^{\textrm{T} \times \textrm{k}}\) is the learned node feature matrix. \(\textrm{k}\) is the expected dimension of node features. \(\tilde{\textbf{A}}=\textbf{D}^{-\frac{1}{2}} \textbf{A} \textbf{D}^{-\frac{1}{2}}\) is the normalized symmetric adjacency matrix of the \(\textbf{A}\). \(\textbf{W}_{0}\) is trainable parameters of the GCN. \(\rho \) is the activation function, such as ReLU. By doing this, the single-layer GCN can induce node features from the neighbors via first-order message-passing mechanism, learning structure-aware node features.

Therefore, a multi-layer GCN can bring information from higher-order neighborhoods. The learning procedure of node features can be further formulated as:

$$\begin{aligned} \textbf{L}^{(j+1)}=\rho \left( \tilde{\textbf{A}} \textbf{L}^{(j)} \textbf{W}_{j}\right) \end{aligned}$$
(2)

where j denotes the number of layers. \(\textbf{W}_{j}\) is trainable parameters of the j-th layer.

3.3 The proposed ToCo-GCN

In this subsection, we introduce the structures and training objective of the proposed ToCo-GCN. The overall framework is shown in the Fig. 1.

3.3.1 Constructing a topic-aware text graph

Given the corpus \(\mathcal {D}\), the ToCo-GCN first constructs a text graph \(\mathcal {G}_s = \{\mathcal {V}_s,\mathcal {E}_s\}\). The set of nodes \(\mathcal {V}_s=\left\{ v^{s}_1, v^{s}_2, \ldots , v^{s}_{\textrm{T}_{s}}\right\} \) consists of two parts: words and texts, where the \(\textrm{T}_{s}\) denotes the total number of nodes in the graph \(\mathcal {G}_s\). The set of edges \(\mathcal {E}_s\) also contains two kinds of relations: word-to-word and word-to-text. The former is defined by the Point-wise Mutual Information (PMI) values, while the latter is defined by the TFIDF values (Yao et al. 2019). The PMI value of a given word pair \(<v^{s}_i,v^{s}_j>\) is calculated as:

$$\begin{aligned} {\text {PMI}}(v^{s}_i, v^{s}_j)&=\log \frac{p(v^{s}_i, v^{s}_j)}{p(v^{s}_i) p(v^{s}_j)} \end{aligned}$$
(3)
$$\begin{aligned} p(v^{s}_i, v^{s}_j)&=\frac{\# Count(v^{s}_i, v^{s}_j)}{N_{w}} \end{aligned}$$
(4)
$$\begin{aligned} p(v^{s}_i)&=\frac{\# Count(v^{s}_i)}{N_{w}} \end{aligned}$$
(5)

where \(N_{w}\) denotes the total number of word nodes. \(\# Count(v^{s}_i, v^{s}_j)\) is the co-occurrence frequency of the word pair in a corpus. However, for short texts, some synonyms or highly related word pairs do not co-occur in the window due to the sparsity problem. Hence, the \(p(v^{s}_i, v^{s}_j)\) will equal zero. The PMI value of these word pairs will be an Infinitesimal. The quality of node representations might be degraded due to the message-passing between the node pairs is unavailable in the first layer of the GCN.

To improve the sparsity of short texts, we enrich the text graph with topic information that provides latent connections between words and documents. We leverage the topic model GPU-DMM (Li et al. 2016) that derives topic distributions of short texts and distributions of words under each topic. The latent topics are as nodes in text graph. Moreover, topic-document edges and word-topic edges are constructed. Then, the adjacent matrix \(\mathbf {A^s}\) the graph \(\mathcal {G}\) can be defined as follows:

$$\begin{aligned} \mathbf {A^{s}}_{i j}\left\{ \begin{array}{lr} {\text {PMI}}(i, j) &{} i, j \text{ are } \text{ words, } {\text {PMI}}(i, j)>0 \\ \textrm{TF}\textrm{IDF}_{i j} &{} i \text{ is } \text{ a } \text{ text, } j \text{ is } \text{ a } \text{ word } \\ \textbf{R}^{(tw)}_{ij} &{} i \text{ is } \text{ topic, } j \text{ is } \text{ word } \\ \textbf{R}^{(tx)}_{ij} &{} i \text{ is } \text{ topic, } j \text{ is } \text{ text } \\ 1 &{} \text{ self-loop } \\ 0 &{} \text{ otherwise } \end{array}\right. \end{aligned}$$
(6)

where \(\textbf{R}^{(tw)}_{ij}\) denotes the extra word-topic relation. It equals to 1 when the j-th word is associated with the i-th topic. \(\textbf{R}^{(tx)}_{ij}\) is the topic-text relation. It is initialized by the maximum probability of the topic distribution of the i-th document. Similar to word and text nodes, latent topic nodes are also initialized with one-hot vectors. Hence, the node embedding matrix \(\textbf{X} \in \mathbb {R}^{\mathrm {T_s} \times \textrm{k}}\) can be initialized by an identity matrix \(\textbf{I}\).

3.3.2 Updating node embeddings over the graph

After obtaining the adjacent matrix \(\mathbf {A^{s}}\) and the node embeddings \(\textbf{X}\), we employ a two-layer GCN to learn node embeddings over the topic-aware text graph \(\mathcal {G}_s\). The learning process can be formulated as follows:

$$\begin{aligned} \textbf{Z}^{(0)}= & {} {\text {ReLU}}\left( \tilde{\mathbf {A^s}} \textbf{X} \textbf{W}_{0}\right) \end{aligned}$$
(7)
$$\begin{aligned} \textbf{Z}^{(1)}= & {} {\text {SoftMax}}\left( \tilde{\mathbf {A^s}} \textbf{Z}^{(0)} \textbf{W}_{1}\right) \end{aligned}$$
(8)

where \(\textbf{W}_{0}\) and \(\textbf{W}_{1}\) are the parameters of the first layer and the second layer, respectively. \(\textbf{Z}^{(1)} \in \mathbb {R}^{\mathrm {T_s} \times C}\) denotes the node embeddings derived from the last GCN layer. Such a two-layer structure allows node to pass messages from second-order neighborhood over the graph. The \({\text {ReLU}}\) and \({\text {SoftMax}}\) are the activation functions.

3.3.3 Optimizing with cosine-based training objective

For optimizing the ToCo-GCN, we design a cosine-based objective function \(\mathcal {L}_{total}\) that fully considers global geometric structures of short texts in the semantic space. The \(\mathcal {L}_{total}\) is formulated as:

$$\begin{aligned} \mathcal {L}=\mathcal {L}_{ce}+\lambda * \mathcal {L}_{cos} \end{aligned}$$
(9)

where the first term \(\mathcal {L}_{ce}\) is implemented by the cross-entropy that enforces to learn features close to the ground-truth labels. The second term \(\mathcal {L}_{cos}\) is a cosine-margin loss that models the intra-class and inter-class geometric structures of short texts in a cosine space. \(\lambda \) is a trade-off parameter that balances the two terms.

Given the predictive results of texts \(\mathbf {Z_d}=\{\textbf{z}_i\}_{i=1} ^N \subset \textbf{Z}^{(1)}\), the cross-entropy term \(\mathcal {L}_{ce}\) is calculated as follows:

$$\begin{aligned} \mathcal {L}_{ce}=-\frac{1}{N} \sum _{i=1}^{N} \sum _{j=1}^{C} \textbf{y}_{ij} \log \left( z_{ij}\right) \end{aligned}$$
(10)

where C is the number of classes. \(\textbf{y}_{ij}\) equals 1 when the j-th label is true of the i-th text, otherwise it equals 0. Minimizing the \(\mathcal {L}_{ce}\) allows the ToCo-GCN to learn representative features of short texts.

The second regularization term \(\mathcal {L}_{cos}\) is leveraged to construct both intra-class and inter-class geometric structures in cosine space. It is calculated as follows:

$$\begin{aligned} \mathcal {L}_{cos}=-\frac{1}{N} \sum _{i=1}^{N}\log \frac{e^{s\left( \cos \left( \theta _{\textbf{y}_{i} z_{i}}\right) -m\right) }}{e^{s\left( \cos \left( \theta _{\textbf{y}_{i}, z_{i}}\right) -m\right) }+\sum _{j \ne \textbf{y}_{i}} e^{s \cos \left( \theta _{j, z_{i}}\right) }}\nonumber \\ \end{aligned}$$
(11)

where \(\textrm{m} \ge 0\) is a cosine margin that can better improve the ability of discriminative. \(\theta _{\textbf{y}_{i}, z_i}\) denotes the angle between the i-th text and its corresponding label \(\textbf{y}_{i}\) in the angular space, while \(\theta _{j, z_i}\) represents the angle of the i-th text to the other labels. The ToCo-GCN simultaneously minimize the intra-class compactness and maximize the inter-class separation in cosine space. When minimizing the \(\mathcal {L}_{\textrm{cos}}\), the angle \(\theta _{\textbf{y}_{i}, z_i}\) between the text \(d_{i}\) and the weight vector of its ground-truth label \(\textbf{y}_{i}\) will be minimized, and the angle \(\theta _{j, i}\) between \(z_{i}\) and the weight vector of the j-th category, where j represents any label other than \(\textbf{y}_{i}\), will be maximized. The \(\cos \left( \theta _{j, z_{i}}\right) \) is calculated by:

$$\begin{aligned} \cos \left( \theta _{j, z_{i}}\right) =\frac{\textrm{q}_{j}^{T}}{\left\| \textrm{q}_{j}^{T}\right\| } \frac{z_i}{\left\| z_i\right\| } \end{aligned}$$
(12)

where the \(\textrm{q}_{j}\) denotes the weight vector of the j-th category. Moreover, we use the \(L_{2}\) normalization term to remove radial variations.

4 Experiments

In this section, we first introduce several publicly available short text datasets and experimental details. Then, we introduce some sate-of-the-art baselines for comparison. Finally, the experimental results and analysis are provided.

Table 1 The statistics of the STC datasets

4.1 Experimental settings

4.1.1 Datasets

We evaluate the performance of our method on the following 8 benchmarks:

  1. (1)

    R8: This dataset represents a subset of the Reuters 21578 dataset.

  2. (2)

    CR: This dataset is a customer product review dataset.

  3. (3)

    MR: This dataset is a movie review dataset.

  4. (4)

    SST-binary (SST-Bi): This dataset is the Stanford Sentiment Treebank dataset.

  5. (5)

    StackOverflow (STOW): This dataset includes selected questions and the corresponding labels posted on stackoverflow.com from July 31, 2012, to August 14, 2012.

  6. (6)

    Biomedical (BIO):Biomedical is a subset of the challenge data published on the BioASQ’s website, where 19974 paper titles from 20 groups are randomly selected.

  7. (7)

    TagMyNews: This dataset consists of titles of English news from really simple syndication feeds.

  8. (8)

    Electronics (Tayal et al. 2019, 2020): This dataset is collected from Amazon e-commerce platform.

The detailed statistics of each dataset are shown in the Table 1.

4.1.2 Training details

We follow the pre-processing of the textGCN to clean and tokenize texts. We remove non-English characters, the stop words, and low-frequency words appearing less than 5 times for seven datasets other than MR. For the MR dataset, since the texts are too short, all words have remained after the cleaning and tokenizing operations. Table 1 demonstrates the statistics of the datasets, including the number of documents, the number of average tokens and entities, the number of classes, and the proportion of texts containing entities in parentheses. For the ToCo-GCN, the embedding dimension of the first GCN layer is set to 200, while the window size is 20. We set the learning rate as 0.001, and the dropout rate is set as 0.5. The value of the epoch is set to a maximum of 1,000 with an early stopping mechanism. Moreover, we make use of Adam as the optimizer following the literature (Alam et al. 2020). For baselines that leverage pre-trained word embeddings as input, we make use of 300-dimensional GloVe word embeddingsFootnote 1 (Pennington et al. 2014). We evaluate the classification performance using test accuracy (denote as Acc in short) and macro-averaged F1 score (denote as F1 in short).

4.1.3 Baselines

To evaluate the effectiveness of the proposed ToCo-GCN, we select the following 10 well-performed STC methods as baselines:

  1. (1)

    TFIDF + LR: This method uses the TFIDF as the feature of short texts and takes the Logistics Regression as the classifier.

  2. (2)

    textCNN: This method is based on the Convolutional Neural Network (Kim 2014). We develop two variants of the textCNN: CNN\(_{\textrm{rand}}\) and CNN\(_{\textrm{nsta}}\), respectively. The former randomly initializes word embeddings, while the latter uses the pre-trained word embeddings.

  3. (3)

    LSTM: We develop two LSTM variants: LSTM\(_{\textrm{rand}}\) and LSTM\(_{\textrm{nsta}}\), respectively.

  4. (4)

    PV-DBOW: This method uses a paragraph vector model (Le and Mikolov 2014) as the text features and takes the Logistic Regression as the classifier.

  5. (5)

    FastText (Joulin et al. 2016): This method treats the average of word/n-grams embeddings as document embeddings and feeds such document embeddings into a linear classifier.

  6. (6)

    SWEM (Shen et al. 2018): The method applies pooling strategies over pre-trained word embeddings.

  7. (7)

    LEAM (Wang et al. 2018): This method considers the label information, which jointly learns word and label embeddings. The label information is implemented via the textual label description.

  8. (8)

    textGCN: This method forms an STC corpus into a text graph with both document and word nodes and jointly learns node representations via message passing over the graph.

  9. (9)

    TL-GNN: This method treats each document as a single graph and employs GCN to learn its representation.

  10. (10)

    TG-Transformer (Zhang and Zhang 2020): This method a novel Transformer-based heterogeneous graph neural network, which is a large-sized corpus and ignores the heterogeneity of the text graph.

Table 2 The experimental results of all comparing methods in terms of Accuracy (Acc) and Macro-F1 (F1). The best results are represented in bold. The second-best results are underlined
Table 3 The experimental results of all comparing methods in terms of Accuracy (Acc) and Macro-F1 (F1). The best results are represented in bold. The second-best results are underlined
Fig. 2
figure 2

The performance of the ToCo-GCN in terms of Acc under different values of the trade-off parameter \(\lambda \)

Fig. 3
figure 3

The performance of the ToCo-GCN in terms of Acc under different values of the margin m

Fig. 4
figure 4

The performance of the ToCo-GCN in terms of Acc under different numbers of topics

Fig. 5
figure 5

The performance of the ToCo-GCN in terms of Acc under different dimensions

Fig. 6
figure 6

The t-SNE visualization of text embeddings obtained by the ToCo-GCN on the R8 dataset

4.2 Results and analysis

We evaluate the proposed ToCo-GCN over 8 datasets for the STC task. The results are respectively shown in Figs. 2 and 3. From the results, we can draw the following observations:

  1. (1)

    Overall, the proposed ToCo-GCN outperforms all the baselines by a large margin in terms of Acc and F1 score. For example, the ToCo-GCN achieves increases of 2.8% in Acc and 2.8% in F1 score on the SST-Bi dataset. This indicates that introducing the topic information of short texts and the cosine margin-based loss function can benefit the STC task.

  2. (2)

    However, the ToCo-GCN shows a slight decrease of 0.2% in F1 score on the Electronics dataset. One possible reason is that the scale of this dataset is too large, and the TG-Transformer has many more parameters than the ToCo-GCN. Therefore, the TG-Transformer has a better ability to learn high-quality short text representations.

  3. (3)

    We observe that the graph neural network (GNN)-induced methods (textGCN, TL-GNN, TG-Transformer, and the ToCo-GCN) achieve better performances than the non-GNN-induced methods in terms of Acc and F1 score on most benchmarks. This indicates that treating the corpus as a whole graph and globally learning word as well as text representations over the graph is efficient for the STC task.

  4. (4)

    We observe that STC methods with pre-trained word embeddings, such as LSTM\(_{\textrm{nsta}}\) and CNN\(_{\textrm{nsta}}\), continuously outperforms those with randomly initialized word embeddings. This indicates that pre-trained word embeddings provide rich semantic information that can benefit the STC task.

  5. (5)

    Moreover, we observe that the PV-DBOW method, which ignores the word order, performs poorly on most datasets. This indicates that word orders are important to capture latent semantics of short texts.

4.3 Ablation study

We further evaluate the effectiveness of the two main components of the ToCo-GCN: the topic information and the cosine margin-based loss function \(\mathcal {L}_{cos}\). The ablative results are respectively shown in Figs. 2 and 3. From the results, we observe that when either the topic information is removed from the text graph or the \(\mathcal {L}_{cos}\) is removed, the ToCo-GCN ’s performance in terms of accuracy and F1 significantly decreases over most datasets. This indicates that introducing the topic information can efficiently shorten the semantic interaction distances between words or words and documents over the graph, improving the quality of text representations. However, we also observe that the ToCo-GCN shows increases of 0.6% and 0.5% in terms of accuracy and F1 on the MR dataset after removing the \(\mathcal {L}_{cos}\). One possible reason for this is that the angle between some text pairs that do not belong to the same category is incorrectly minimized, while the angle between some pairs that belong to the same category is maximized.

4.4 Parameter sensitivity

We further explore the efficiency of several important parameters of the ToCo-GCN: the trade-off parameter \(\lambda \), the cosine margin m, the number of latent topics, and the dimension of embeddings, respectively.

4.4.1 Effect of the trade-off parameter \(\lambda \)

We evaluate the effectiveness of the parameter \(\lambda \), which controls the importance of \(\mathcal {L}_{cos}\). The value of \(\lambda \) is in the range of \(\left[ 10^{-6}, 10^{-2}\right] \). Figure 2 demonstrates the variation of accuracy with the increase of \(\lambda \). Based on the results, we draw the following observations:

  1. (1)

    On the R8 and MR datasets, the performance of the ToCo-GCN generally shows a trend of initially increasing and then decreasing. When \(\lambda = 10^{-4}\), the ToCo-GCN achieves the optimal result on the R8 dataset, while for the MR dataset, the optimal value is \(\lambda = 5 \times 10^{-3}\). The reason for this may be that samples with different categories in the R8 dataset always leverage specific words or phrases to describe the news. Therefore, these samples can be well classified by the ToCo-GCN when the discriminative constraint \(\mathcal {L}_{cos}\) is set to a small value. However, the MR dataset focuses on sentiment classification, and some samples may simultaneously contain both positive and negative sentiment expressions, which are difficult to distinguish even for human beings. Therefore, a larger value of \(\mathcal {L}_{cos}\) is needed to enforce the ToCo-GCN to learn discriminative sentiment-specific features for the MR dataset.

  2. (2)

    In contrast to the above performances, the performance of the ToCo-GCN on the CR and SST-Bi datasets gradually improves as the value of \(\lambda \) increases, and the ToCo-GCN performs best when \(\lambda = 10^{-2}\) on both datasets. This indicates that only using the cross-entropy loss \(\mathcal {L}_{ce}\) to minimize the difference between individual sample predictions and ground-truth labels is insufficient on the CR and SST-Bi datasets. Therefore, the ToCo-GCN further utilizes the global information of samples in the cosine space to learn discriminative text features, effectively improving the task performance of STC.

4.4.2 Effect of the cosine margin

We evaluate the effectiveness of the parameter m, which controls the anger between sample-pairs in the cosine space. The value of m is in the range of [0.1, 0.9]. Figure 3 shows the variation of accuracy with the increase of m. Based on the results, we draw the following observations:

  1. (1)

    On the R8 and CR datasets, the performance of the ToCo-GCN first gradually increases to a peak and then rapidly decreases within the [0.8, 0.9] range. This upward trend indicates that the ToCo-GCN can learn discriminative text features while sufficiently preserving specific semantic information for each text. However, the rapid decline may be due to the excessively large margin m incorrectly enforcing some samples from different categories to be closer.

  2. (2)

    Compared to the performances on the above two datasets, the performances of the ToCo-GCN on the MR and SST-Bi datasets are more sensitive to changes in the value of m. The possible reason for this is that the distinction between samples from different categories is relatively low, resulting in less clear category decision boundaries in the cosine semantic space. Therefore, even small changes in the value of m can have a noticeable impact on the task performances.

4.4.3 Effect of the latent topics

We further analyze the impact of the number of latent topics on the performance of the ToCo-GCN across four datasets. The results are shown in Fig. 4. Overall, the performance of the ToCo-GCN varies across the four datasets, and the optimal performance on the CR, MR, and SST-Bi datasets corresponds to 10, 15, and 25 topic nodes, respectively. This suggests that appropriately introducing topic nodes can reduce the distance between semantically related but distant word pairs or word-document pairs over the text graph, effectively improving the efficiency of capturing global semantic information. However, we observe that the ToCo-GCN performs best when the number of topic nodes is set to 30 on the R8 dataset. This may be because the R8 dataset has more categories than the other three datasets, and therefore, more fine-grained topic information allows the ToCo-GCN to better capture discriminative information between different categories.

4.4.4 Effect of the embedding dimensions

We evaluate the impact of different embedding dimensions in the \(1^{st}\) GCN layer on the performance of the ToCo-GCN. The results are reported in Fig. 5. From the results, we observe that the ToCo-GCN achieves optimal results on the CR, MR, and SST-Bi datasets when the dimension is set to 250. Additionally, on these three datasets, the performance initially increases and then slowly decreases as the dimension increases. This indicates that as the dimension increases, the ToCo-GCN can capture more discriminative and rich semantics. However, excessively large dimensions may introduce unnecessary noise and hurt the performance of the STC task.

4.5 Visualization of classification results

Figure 6 demonstrates the t-SNE (Van der Maaten and Hinton 2008) visualization of the first layer text embeddings learned from the R8 dataset. With the increase of m, samples of the acq class and samples of the earn class can maintain good intra-class aggregation as well as inter-class separation. The reason is that the number of samples of the two categories is larger compared to the other classes, hence our model is able to learn discriminative features even with smaller margins. However, for categories with only a few samples, we can observe that the boundary between category A and other categories gradually increases as the margin increases from 0.1 to 0.35. Additionally, there is an overlap between the interest class and the money-fx class, and this issue only slightly improves as m increases from 0.1 to 0.5. We believe there are two reasons for this: firstly, the two classes are similar in terms of topics or content, and secondly, the limited number of samples hinders the model from learning distinctive features of the two classes.

Table 4 Comparison of average time consumption (in seconds) on 10 runs. The running environment is on the NVIDIA A100 80 G GPU

4.6 Time consumption of model training and testing

We further compared the proposed ToCo-GCN with the textGCN in terms of time consumption during training and testing stages, as shown in Table 4. From the results, we can observe that there is almost no significant difference in the time consumption per training epoch between the ToCo-GCN and textGCN. This indicates that introducing topic information and the discriminative constraint \(\mathcal {L}_{cos}\) into the ToCo-GCN may not impose a heavy computational burden. However, on the MR dataset, the overall training time of the ToCo-GCN (4.3s) is significantly longer than that of textGCN (3.1s). This may be due to optimizing with the \(\mathcal {L}_{cos}\) slows down the convergence speed of the ToCo-GCN. Therefore, under the early stopping mechanism, the ToCo-GCN requires more training epochs to achieve fitting.

5 Conclusion and future work

Although the GCN-based methods in text classification construct graphs at the text level, which contains both local co-occurrence relations and global co-occurrence relations, and makes use of multi-layer GCN to exploit the two relations in the raw corpus to learn text embeddings based on pre-trained embeddings, they do not fully employ geometric structures of labeled data. In this paper, we propose a novel method for short text classification, called Topic-aware Cosine Graph Convolutional Neural Network (ToCo-GCN). The ToCo-GCN cannot only learn representative text embeddings but also can make use of underlying intra-class and inter-class geometric structures to enhance the power of discriminative. Experiments on four benchmark data sets show that the proposed model is superior to the GCN and several competing existing short text classification methods. In the future, we will investigate how to further extend the graph neural networks to other NLP downstream tasks, as well as how to leverage external knowledge to enhance the ability of graph learning to capture task-relevant features from a global perspective.