Keywords

1 Introduction

Text classification is a crucial aspect of Natural Language Processing (NLP), and extensive research in this field is being conducted. Many researchers are working to improve the speed, accuracy, or robustness of their algorithms. Traditional text classification, however, does not take some traits into account that appear in numerous real-world applications, such as short text. Therefore, studies have been conducted specifically on short texts [38, 47]. From user-generated content like social media to business data like accounting records, short text covers a wide range of topics. For example, the division into goods and services (see Sect. 4.1) is an important part of the tax audit. Currently, an auditor checks whether the element descriptions match the appropriate class of good or service. Since this can be very time-consuming, it is desirable to bring it into a semi-automatic context with the help of classifiers. Also, the subdivision into more specific classes can be useful for determining whether a given amount for an entry in the accounting records is reasonable.

Since short texts are typically only one to two sentences long, they lack context and therefore pose a challenge for text classification. In order to get better results, many short text classifiers also operate in a transductive setup [38, 41, 43], which includes the test set during training. However, as they need to be retrained each time new data needs to be classified, those transductive models are not very suitable for real-world applications. The results of both transductive and the generally more useful inductive short text classifier are typically unsatisfactory due to the challenge that short text presents. Recent studies on short texts have emphasized specialized models [33, 36, 38, 41, 43, 47] to address the issues associated with the short text length. However, State of the Art (SOTA) text classification methods, particularly the pure use of Transformers, have been unexploited. In this work, the effectiveness on short texts is examined and tested by means of benchmark datasets. We also introduce two new, realistic datasets in the domain of goods and services descriptions. Our contributions are in summary:

  • We provide a comparison of various modern text classification techniques. In particular, specialized short text methods are compared with the top performing traditional text classification models.

  • We introduce two new real-world datasets in the goods and services domain to cover additional dataset characteristics in a realistic use-case.

  • Transformers achieve SOTA accuracy on short text classification tasks. This questions the need of specialized short text classifier.

Below, we summarize the related work. Section 3 provides a description of the models that were selected for our experiments. The experimental apparatus is described in Sect. 4. An overview of the achieved results is reported in Sect. 5. Section 6 discusses the results, before we conclude.

2 Related Work

Despite the fact that Bag of Words (BoW)-based models have long represented the cutting edge in text classification, attention has recently shifted to sequence-based and, more recently, graph-based concepts. However, BoW-based models continue to offer a solid baseline [7]. For example in fastText [12] the average of the trained word representations are used as text representation and then fed into a linear classifier. This results in an efficient model for text classification. To give an overview of the various concepts, Sect. 2.1 provides various works in the field of sequence-based models, Sect. 2.2 discusses graph-based models, and Sect. 2.3 examines how these concepts are applied to short text. Finally, a summary of the findings from the related work is presented in Sect. 2.4.

2.1 Sequence-Based Models

For any NLP task, Recurrent Neural Networks (RNN) and Long short-term memory (LSTM) are frequently used and a logical choice because both models learn historical information while taking location information for all words into account [17, 23]. Since RNNs must be computed sequentially and cannot be computed in parallel, the use of Convolutional Neural Networks (CNNs) is also common [17, 34]. The text must be represented as a set of vectors that are concatenated into a matrix in order to be used by CNNs. The standard CNN convolution and pooling operations can then be applied to this matrix. TextCNN [15] uses this in combination with pretrained word embeddings for sentence-level classification tasks. While CNN-based models extract the characteristics from the convolution kernels, the relationship between the input words is captured by RNN-based models [17]. An important turning point in the advancement of NLP technologies was the introduction of Bidirectional Encoder Representations from Transformers (BERT) [35]. By performing extensive pre-training in an unsupervised manner and automatically mining semantic knowledge, BERT learns to produce contextualized word vectors that have a global semantic representation.

The effectiveness of BERT-like models for text classification is demonstrated by Galke and Scherp [7].

2.2 Graph-Based Models

Recently, text classification has paid a lot of attention to graph-based models, particularly Graph Neural Networks (GNNs) [3, 28, 37]. This is due to the fact that tasks with rich relational structures benefit from the powerful representation capabilities of GNNs, which preserve global structure information [37]. The task of text classification offers this rich relational structure because text can be modeled as edges and nodes in a graph structure. There are different ways to represent the documents in a graph structure, but two main approaches have emerged [37, 38]. The first approach builds a graph for each document using words as nodes and structural data, such as word co-occurence data, as edges. However, only local structural data is used. The task is constructed as a whole graph classification problem in order to classify the text. A popular document-level approach is HyperGAT [5] which uses a dual attention mechanism and hypergraphs applied to documents to learn text embeddings. The second approach creates a graph for the entire corpus using words and documents as nodes. The text classification task is now a node classification task for the unlabeled document nodes. The drawback of this method is that models using it are inherently transductive. For example, TextGCN [42] uses this concept by employing a standard Graph Convolutional Networks (GCN) on this heterogeneous graph. Following TextGCN, Lin et al. [19] propose BertGCN, a model that makes use of BERT to initialize representations for the document nodes in order to combine the benefits of both the large-scale pretraining of BERT and the transductive TextGCN. However, the increase provided by this method is limited to datasets with long average text lengths. Zeng et al. [44] also experiment with combining TextGCN and BERT in the form of TextGCN-Bert-serial-SB, a Simplified-Boosting Ensemble, where BERT is only trained on the TextGCN’s misclassification. Which model is applied to which document is determined by a heuristic based on the node degree of the test document. However, TextGCN-CNN-serial-SB, which substitutes TextCNN for BERT, yields better results. By using a joint training mechanism, TextING [46] and BERT are trained on sub-word tokens and base their predictions on the results of the two models. In contrast to applying each model separately, this produces better results. Another approach combining graph classifiers with BERT is ContTextING [11]. ContTextING utilizes a joint training mechanism to create a unified model that incorporates both document-wise contextual information from a BERT-style model and node interactions within a document through the use of a GNN module. The predictions for the text classification task are determined by combining the output from both of these modules.

2.3 Short Text Models

Of course, short texts can also be classified using the methods discussed above. However, this is challenging because short texts tend to lack context and adhere to less strict syntactic structure [38]. This has led to the emergence of specialized techniques that focus on improving the results for short text. Early works focused on sentence classification using methods like Support Vector Machines (SVM) [29]. A survey by Galke et al. [6] compared SVM and other classical methods like Naive Bayes and kNN with multi-layer perceptron models (MLP) on short text classification. Other works on sentence classification used Convolutional Neural Networks (CNN) [13, 36, 45], which showed strong performance on benchmark datasets. Recently, also methods exploiting graph neural networks were adopted to the needs of short text. For instance, Heterogeneous Graph Attention networks (HGAT) [41] is a powerful semi-supervised short text classifier. This was the first attempt to model short texts as well as additional information like topics gathered from a Latent Dirichlet Allocation (LDA) [1] and entities retrieved from Wikipedia with a Heterogeneous Information Network (HIN). To achieve this, a HIN embedding with a dual-level attention mechanism for nodes and their relations was used. For the semantic sparsity of short text, both the additional information and the captured relations are beneficial. A transductive and an inductive HGAT model were released, with the transductive model being better on every dataset. NC-HGAT [33] expands the HGAT model to produce a more robust variant. Neighbor contrastive learning is based on the premise that documents that are connected have a higher likelihood of sharing a class label and, as a result, should therefore be closer in feature space. In order to represent the additional information, SHINE [38] also makes use of a heterogenous graph. In contrast, SHINE generates component graphs in the form of word, entity, and Part Of Speech (POS) graphs and creates a dynamically learned short document graph by employing hierarchical pooling over all component graphs. In the semi-supervised setting, SHINE outperforms HGAT as a strong transductive model. SimpleSTC (Simple Short Text Classification) [48] is a graph-based method for short-text classification similar to SHINE. But instead of constructing the word-graph only over the data corpus itself, SimpleSTC employs a global corpus to create a reference graph that shall enrich and help to understand the short text in the smaller corpus. As global corpus, articles from Wikipedia are used. The authors sample 20 labeled documents per class as training set and validation set. Short-Text Graph Convolutional Networks (STGCN) [43] is an additional short text classifier. A graph of topics, documents, and unique words is the foundation of STGCN. Although the STGCN results by themselves are not particularly strong, the impact of pre-trained word vectors obtained by BERT was also examined. The classification of the STGCN is significantly enhanced by the combination of STGCN with BERT and a Bi-LSTM.

2.4 Summary

Graph neural network-based methods are widely used in short text classification. However, in recent short text research, SOTA text classification methods, particularly the pure use of Transformers, have been unexploited. The majority of short text models are transductive. The crucial drawback of being transductive is that every time new data needs to be classified, the model must be retrained.

3 Selected Models for Our Comparison

We begin with models for short text classification in Sect. 3.1 and then Sect. 3.2 introduces a selection of top-performing models for text classification. Following Galke and Scherp [7], we have excluded works that employ non-standard datasets only, use different measures, or are otherwise not comparable.For example, regarding short text classification there are works that are applied on non-standard datasets only [10, 49].

3.1 Models for Short Text Classification

The models listed below either make claims about their ability to categorize short texts or were designed with that specific goal. The SECNN [36] is a text classification model built on CNNs that was created specifically for short texts with few and insufficient semantic features. Wang et al. [36] suggested four components to address this issue. In order to achieve better coverage on the word vector table, they used an improved Jaro-Winkler similarity during preprocessing to identify any potential spelling mistakes. Second, they use a CNN model built on the attention mechanism to look for words that are related. Third, in order to accomplish the goal of short text semantic expansion, the external knowledgebase Probase [39] is used to enhance the semantic features of short text. Finally, the classification process is performed using a straightforward CNN with a Softmax output layer.

The Sequential Graph Neural Network (SGNN) [47] is a GNN-based model that emphasizes the propagation of features based on sequences. By training each document as a separate graph, it is possible to learn the words’ local and sequential features. GloVe’s [24] pre-trained word embedding is utilized as a semantic feature of words. In order to update the feature matrix for each document graph, a Bi-LSTM is used to extract the contextual feature of each word. After that, a simplified GCN aggregates the features of neighboring word nodes. Additionally, Zhao et al. [47] introduce two variants: Extended-SGNN (ESGNN), in which the initial contextual feature of words is preserved, and C-BERT, in which the Bi-LSTM is swapped for BERT.

The Deep Attention Diffusion Graph Neural Network (DADGNN) [22] is a graph-based method that combats the oversmoothing problem of GNNs and allows stacking more layers by utilizing attention diffusion and decoupling techniques. This decoupling technique is also very advantageous for short texts because it obtains distinct hidden features in deep graph networks.

The Long short-term memory (LSTM) [9], which is frequently used in text classification, has a bidirectional variant called Bi-LSTM [20]. Due to its strong results for short texts [23, 47] and years of use as the SOTA method for many tasks, this model is a good baseline for our purpose.

3.2 Top-Performing Models for Text Classification

An overview of the top text classification models that excel on texts of all lengths and were not specifically created with short texts in mind is provided in this section. We employ the base models for the Transformers.

The Bidirectional Encoder Representations from Transformers (BERT) [4] is a language representation model that is based on the Transformer architecture [35]. Encoder-only models, such as BERT, rely solely on the encoder component of the Transformer architecture, whereby the text sequences are converted into rich numerical representations [34]. These models are well suited for text classification due to this representation. BERT is designed to incorporate a token’s left and right contexts into its computed representation. This is commonly referred to as bidirectional attention.

The Robustly optimized BERT approach (RoBERTa) [21] is a systematically improved BERT adaptation. In the RoBERTa model, the pre-training strategy was changed and training was done on larger batches with more data, to increase BERT’s performance.

To improve BERT and RoBERTa models, Decoding-enhanced BERT with disentangled attention (DeBERTa) [8] makes two architectural adjustments. The first is the disentangled attention mechanism, which encodes the content and location of each word using two vectors. The content of the token at position i is represented by \(H_i\) and the relative position i|j between the token at position i and j are represented by \(P_{i|j}\). The equation for determining the cross attention score is as follows: \( A_{i,j} = H_iH_j^T + H_iP_{j|i}^T + P_{i|j}H_j^T + P_{i|j}P_{j|i}^T\). The second adjustment is an enhanced mask decoder that uses absolute positions in the decoding layer to predict masked tokens during pre-training. For masked token prediction, DeBERTa includes the absolute position after the transform layers but before the softmax layer. In contrast, BERT incorporates the position embedding into the input layer. As a result, DeBERTa is able to capture the relative position in all Transformer layers.

Sun et al. [32] proposed ERNIE 2.0, a continuous pre-training framework that builds and learns pre-training tasks through continuous multi-task learning. This allows the extraction of additional valuable lexical, syntactic, and semantic information in addition to co-occurring information, which is typically the focus.

The concept behind DistilBERT [26] is to leverage knowledge distillation to produce a more compact and faster version of BERT while retaining most of its language understanding capacities. DistilBERT reduces the size of BERT by \(40\%\), is \(60\%\) faster, and still retains \(97\%\) of its language understanding capabilities. In order to accomplish this, DistilBERT optimizes the following three objectives while using the BERT model as a teacher: (1) Distillation loss: The model was trained to output probabilities equivalent to those of the BERT base model. (2) Masked Language Modeling (MLM): As described by Devlin et al. [4] for the BERT model, the common pre-training using masked language modeling is being used.(3) Cosine embedding loss: The model was trained to align the DistilBERT and BERT hidden state vectors.

A Lite BERT (ALBERT) [16] is a Transformer that uses two parameter-reduction strategies to save memory and speed up training by sharing the weights of all layers across its Transformer. This model is therefore particularly effective for longer texts. During pretraining, ALBERTv2 employs MLM and Sentence- Order Prediction (SOP), which predicts the sequence of two subsequent text segments.

WideMLP [7] is a BoW-Based Multilayer Perceptron (MLP) with a single wide hidden layer of 1, 024 Rectified Linear Units (ReLUs). This model serves as a useful benchmark against which we can measure actual scientific progress.

InducTive Graph Convolutional Networks for Text classification (InducT-GCN) [37] is a GCN-based method that categorically rejects any information or statistics from the test set. To achieve the inductive setup, InducT-GCN represents document vectors with a weighted sum of word vectors and applies TF-IDF weights instead of representing document nodes with one-hot vectors. A two-layer GCN is employed for training, with the first layer learning the word embeddings and the second layer in the dimension of the dataset’s classes outputs into a softmax activation function.

4 Experimental Apparatus

4.1 Datasets

First, we describe the benchmark datasets. Second, we introduce our new datasets in the domain of goods and services. The characteristics are denoted in Table 1.

Table 1. Characteristics of short text datasets. #C refers to the number of classes. Avg. L is the average document length.

Benchmark Datasets. Six short text benchmark datasets, namely R8, MR, SearchSnippets, Twitter, TREC, and SST-2, are used in our experiments. The following gives a detailed description of them. R8 is an 8-class subset of the Reuters 21578 news datasetFootnote 1. It is not a classical short text scenario with an average length of 65.72 tokens but offers the ability to set the methods in comparison to traditional text classification. MRFootnote 2 is a widely used dataset for text classification. It contains movie-review documents with an average length of 20.39 tokens and is therefore suitable for short text classification. The dataset SearchSnippetsFootnote 3, which is made up of snippets returned by a search engine and has an average length of 18.10 tokens, was released by Phan et al. [25]. TwitterFootnote 4 is a collection of 10, 000 tweets that are split into the categories negative and positive based on sentiment. The length of those tweets is on average 11.64 tokens. TRECFootnote 5, which was introduced by Li and Roth [18], is a question type classification dataset with six classifications for questions. It provides the shortest texts in our collection of benchmark datasets, with an average text length of 10.06 tokens. SST-2Footnote 6 [30] or SST-binary is a subset of the Stanford Sentiment Treebank, a fine-grained sentiment analysis dataset, in which neutral reviews have been removed and the data has either a positive or negative label. The average number of tokens in the texts is 20.32.

Fig. 1.
figure 1

Class distribution of our new datasets (separated by train and test split)

Goods and Services Datasets. In order to evaluate the performance on data with real world applications, we introduce two new datasets that are focused on the distinction between goods and services. Although there are already datasets for product classification, such as the WDC-LSPMFootnote 7, to the best of our knowledge, our datasets are the first to combine goods and services. NICE is a classification system for goods and services that divides them into 45 classes and is based on the Nice ClassificationFootnote 8 of the World Intellectual Property Organization (WIPO). There are 11 classes for various service types and 34 categories for goods. With 9, 593 documents, NICE-45 is comparable in size to the benchmark datasets. This dataset, which has texts with an average length of 3.75 tokens, is an excellent example of extremely short text. For the division into goods and services, there is also the binary version NICE-2. Short Texts Of Products and Services (STOPS) is the second dataset we offer. With 200, 341 documents and an average length of 5.64 tokens, STOPS-41 is a reasonably large dataset. The data set was derived from a potential use case in the form of Amazon descriptions and Yelp business entries, making it the most realistic. Like NICE, STOPS has a binary version STOPS-2. Both datasets provide novel characteristic properties that the benchmark datasets did not cover. In particular, the number of fine-granular classes presents a challenge that is not addressed by common benchmarks. For details on the class distribution of these datasets, please refer to Fig. 1.

4.2 Preprocessing

To create NICE, the WIPOFootnote 9 classification data was converted to lower case, all punctuation was removed, and side information that was enclosed in brackets was also removed. Additionally, accents were dropped. Following a random shuffle, the data was divided into \(70\%\) train and \(30\%\) test.

As product and service entries for STOPS, we use the product descriptions of MAVEFootnote 10 [40] and the business names of YELPFootnote 11. Due to the different data sources, these also had to be preprocessed differently. All classes’ occurrences in the MAVE data were counted, and 5, 000 sentences from each of the 20 most common classes were chosen. The multi-label categories for the YELP data were broken down into a list of single label categories, and the sentences were then mapped to the most common single label that each one has. In order to prevent any label from taking up too much of the dataset, the data was collected such that there is a maximum of 1, 200 documents per label. After that, all punctuation was dropped, the data was converted to lower case, and accents were also dropped. The data was split into train and test in a 70:30 ratio after being randomly shuffled.

4.3 Procedure

The best short text classifier and text classification models were retrieved from the literature (see description of the models in Sect. 3). The accuracy scores were extracted in order to establish a comparison. Own experiments, particularly using various Transformers, were conducted in order to compare them. Investigations into the impacts of hyperparameters on short texts were performed. More details about these are provided in Sect. 4.4. In order to test the methods in novel contexts, we also created two new datasets, whereby STOPS stands out due to its much higher quantity of documents.

4.4 Hyperparameter Optimization

Our experiments for BERT, DistilBERT, and WideMLP used the hyperparameter from Galke and Scherp [7]. The parameters for BERT and DistilBERT are a learning rate of \(5 \cdot 10^{-5}\), a batch size of 128, and fine-tuning for 10 epochs. WideMLP was trained for 100 epochs with a learning rate of \(10^{-3}\), a batch size of 16, and a dropout of 0.5. For ERNIE 2.0 and ALBERTv2, we make use of the SST-2 values that Sun et al. [32] and Lan et al. [16], respectively, published. For our hyperparameter selection for DeBERTa and RoBERTa, we used the BERT values from Galke and Scherp [7] as a starting point and investigated the effect of smaller learning rates. This resulted in learning rates of \(2 \cdot 10^{-5}\) for DeBERTa and \(4 \cdot 10^{-5}\) for RoBERTa while maintaining the other parameters. For comparison, we followed the same procedure to create ERNIE 2.0 (optimized), which yields a learning rate of \(25 \cdot 10^{-6}\). The Bi-LSTM values from Zhao et al. [47] were used for both the LSTM and the Bi-LSTM model. We used DADGNN with the default parameters of 0.5 dropout, \(10^{-6}\) weight decay, and two attention heads for all datasets.

4.5 Metrics

Accuracy is used to measure the classification of short text. For multi-class cases, the subset accuracy is calculated.

5 Results

Table 2. Accuracy on short text classification datasets. The “Short?” column indicates whether the model makes claims about its ability to categorize short texts. Provenance refers to the source of the accuracy scores.
Table 3. Accuracy on our own short text classification datasets. The “Short?” column indicates whether the model makes claims about its ability to categorize short texts. Provenance refers to the source of the accuracy scores.

The accuracy scores for the text classification models on the six benchmark datasets are shown in Table 2. The findings demonstrate that the relatively straightforward models LSTM, Bi-LSTM, and WideMLP provide a strong baseline across all datasets. This comparison clearly demonstrates the limitations of some models, with InducT-GCN falling short in all datasets except SearchSnippets, SECNN underperforming on TREC, and DADGNN producing weak MR results in our own experiment. The Transformer models, on the other hand, are the best performing across all datasets with the exception of SearchSnippets. With consistently strong performance across all datasets, DeBERTa stands out in particular. The graph-based models from Zhao et al. [47], SGNN, ESGNN, and C-BERT, all perform well for the datasets for which results are available and ESGNN even outperforms all other models for SearchSnippets. It is important to note that Zhao et al. [47] used a modified training split and additional preprocessing. While an increase of about 5 percentage points for MR could be obtained by extending ESGNN with BERT in C-BERT, the increase is not noticeable for other datasets. When applied to short texts, the inductive models even outperform transductive models. On Twitter, ERNIE 2.0 and ALBERTv2 reach a performance of \(99.97\%\), and when using BERT on the TREC dataset, a performance of \(99.4\%\) is obtained. Non-Transformer models also perform well on TREC, although Transformers outperform them. For the graph-based models SHINE and InducT-GCN, we also calculated the mean and standard deviation of the accuracy scores across 5 runs. This is motivated from the observation that models based on graph-neural networks are susceptible to the initialization of the embeddings [27]. SHINE had a generally high standard deviation of up to nearly 5 points, indicating greater variance in its performance. In comparison, InducT-GCN has a rather small variance of always below 1 point.

The accuracy results for our newly introduced datasets, NICE and STOPS, are shown in Table 3. New characteristics covered by NICE and STOPS include shorter average lengths and the ability to distinguish between classes at a fine-granular level in NICE-45 and STOPS-41. The investigation of more documents is also conducted in the case of STOPS. As a result, NICE-45 and STOPS-41 reveal that DADGNN encounters issues when dealing with more classes, even falling around 20 and 60 percent points behind the baseline models. While still performing worse than the baseline models, InducT-GCN outperforms DADGNN on all four datasets. Transformers once again demonstrate their strength and rank as the top performing models across all datasets on this dataset. There are also significant drops. ERNIE 2.0 performs worse than the baseline models with \(45.55\%\) on NICE-45. However, ERNIE 2.0 (optimized), which uses different hyperparameter values (see Sect. 4.4), comes in third with \(67.65\%\).

6 Discussion

Graph-based models are computationally expensive because they require not only the creation of the graph but also its training, which can be resource- and time-intensive, especially for word-document graphs with \(\mathcal {O}(N^2)\) space [7]. On STOPS, this drawback becomes very apparent. We could observe that DADGNN required roughly 30 hours of training time, while BERT only took 30 minutes to fine-tune with the same resources. Although in the case of BERT, the pre-training was already very expensive, transfer learning allows this effort to be used for a variety of tasks. Nevertheless, the Transformers outperform the inductive graph-based models as well as the short text models, with just one exception. The best model for SearchSnippets is ESGNN, but additional preprocessing and a modified training split were employed. Our Bi-LSTM results, obtained without additional preprocessing, differ by 16.66 percentage points from the Bi-LSTM results from Zhao et el. [47]. This indicates that preprocessing, and not a better model, is primarily responsible for the strong outcomes of the SearchSnippets experiments. Another interesting discovery can be made using the sentiment datasets. In comparison to other datasets, the Transformers outperform graph-based models that do not utilize a Transformer themselves by a large margin. This demonstrates that graph-based models may not be as effective at sentiment prediction tasks. In contrast, the CNN-based models show strong performance on the sentiment analysis task SST-2. Still, the best CNN model is more than 6 points below the best transformer. However, it should be noted that not all Transformers are consistently excellent. For instance, for NICE-45, one can observe a lower performance with ERNIE 2.0. But the absence of this performance decrease in our optimized version of ERNIE 2.0 (optimized) suggests that choosing suitable hyperparameters is crucial in this case.

6.1 Key Results

Our experiments unambiguously demonstrate that Transformers achieve SOTA accuracy on short text classification tasks. This raises the question of whether specialized short text techniques are necessary given that the performance of the existing models is insufficient. This observation is especially interesting because many of the short text models used are from 2021 [22, 36, 41, 47] or 2022 [33]. Most short text models attempt to enrich the documents with some kind of external context, such as a knowledge base or POS tags. However, one could argue that Transformers implicitly contain context in their weights through their pre-training.

Those short text models that compare themselves to Transformers assert that they outperform them. For instance, Ye et al. [43] claim to outperform BERT by 2.2 percentage points on MR, but their fine-tuned BERT only achieves \(80.3\%\). In contrast, our own experiments show that BERT achieves \(86.94\%\). With \(85.86\%\) on MR, Zhao et al. [47] achieve better BERT results, but only to beat it by a meager \(0.2\%\) with C-BERT. Given the low surplus, they would no longer outperform it with a marginally better selection of hyperparameters for BERT. Therefore, it is reasonable to assume that the importance of good hyperparameters for Transformers is underestimated and that they are often not properly optimized. ERNIE 2.0 (optimized), which outperforms ERNIE 2.0 on every dataset, also demonstrates the effect of better hyperparameters. Finally, Zhao et al. [47] is already outperformed by other transformers like RoBERTa and DeBERTa by 3 and 4 points, respectively.

Additionally, there is a need for new short text datasets because the widely used benchmark datasets share many characteristics and fall short in many use cases. The common benchmark datasets all contain around 10, 000 documents, distinguish only a few classes, and frequently have a similar average length. Furthermore, many of them cover the same tasks. For instance, MR, Twitter, and SST-2 all perform sentiment prediction, which makes sense given how much short text is produced by social media. In this paper, we introduce two new datasets with distinctive attributes to cover more cases in NICE and STOPS. New and intriguing findings are produced by the new characteristics that are investigated using these datasets. Particularly, the ability to distinguish between classes at a fine-granular level reveals the shortcomings of various models, like DADGNN or ERNIE 2.0. NICE-45 in particular proved to be challenging for all models, making it a good benchmark for future advancements.

6.2 Threats to Validity

In our study, each experiment was generally conducted once. The rationale is the extremely low standard deviation for text classification tasks observed in previous studies [7, 22, 47]. However, it has been reported in the literature on models using graph neural networks (GNN) that they generally have high standard deviation in their performance, which has been attributed among others to the influence of the random initialization in the evaluation [27]. Thus, we have run our experiments for SHINE and InducT-GCN five times and report averages and standard deviation. The high standard deviation observed in SHINE’s performance adds to the evidence of the need for caution when interpreting the results of GNNs [27].

We acknowledge that STOPS contains user-generated labels, some of which may not be entirely accurate. However, given that this occurs frequently in numerous use cases, it is also crucial to test the models in these scenarios.

6.3 Parameter Count of Models

Table 4 lists the parameter counts of selected Transformer models, the BoW-based baseline methods WideMLP, and graph-based methods used in our experiments. Generally, the top performing Transformer models have a similar size between 110M to 130M parameters. Although DistilBERT is only have of that size and ALBERTv2 only about a tens, our experiments show still comparable accuracy scores on R8, Snippets, Twitter, and TREC. ALBERTv2 with its 12M parameters outperforms the WideMLP baseline with 31.3M parameters on all datasets, some with a large margin. The graph-based model ConTextING-RoBERTa has a similar parameter count compared to the pure Transformer models, since the RoBERTa transformer is used internally. It is the top-performer among the graph-based models on R8 and MR but cannot outperform the pure Transformer models.

Table 4. Parameter counts for selected methods used in our experiments

6.4 Generalization

As we cover in our experiments a range of diverse domains, with sentiment analysis on various themes (MR, SST-2, Twitter), question type classification (TREC), news (R8), and even search queries (SearchSnippets), we expect to find equivalent results on other short text classification datasets. Additionally, the categorization of goods and services is covered by our new datasets NICE and STOPS. They include additional features not covered by the benchmark datasets, including a significantly larger amount of training data in STOPS, a shorter average length, and the capacity to differentiate between a wider range of classes. By using an example from a business problem, STOPS specifically demonstrates how the knowledge gained here can be applied in corporate use.

In this work, we cover a variety of models for each architecture, particularly the most popular and best-performing ones. Our findings are consistent with the studies by Galke and Scherp [7], which demonstrate the tremendous power of Transformers for traditional text classification.

7 Conclusion and Future Work

Our experiments unequivocally demonstrate the outstanding capability of Transformers for short text classification tasks. Additional research on our newly released datasets, NICE and STOPS, supports these findings and highlights the issue of becoming overly dependent on benchmark datasets with a limited number of characteristics. In conclusion, our study raises the question of whether specialized short text techniques are required given the lower performance of current models.

Future research on improving the performance of Transformers on short text could be to do pre-training on short texts or on in-domain texts (i.e., pre-training in the same domain as the target task) [2, 31, 34], multi-task fine-tuning [31, 34], or an ensemble of multiple Transformer models [50].