1 Introduction

Sentiment Classification is a challenging problem. Early research focused on lexicon-based methods [1, 2] and traditional machine learning-based methods [3]. However, the sentiment lexicon and feature engineering both need the expert knowledge and they do not consider the context of text sequences. Current deep learning-based methods achieve impressive performance on sentiment classification [4, 5]. The performance of deep models depends on large-scale annotated training data, whereas labeling large and high-quality opinionated texts is laborious. Fortunately, the Internet offers large opinionated texts with user tags, such as product reviews with ratings and tweets with emojis, where the tags can reflect the overall sentiment orientation of the users. However, the orientation of user tags may be inconsistent with the true sentiment semantics of texts, e.g. a 5-star customer review may contain a negative sentence. These noisy labels have a negative impact on the training process [6]. In essence, the user-tagged data is a kind of weakly-labeled data. The key problem of this work is learning robust representations from weakly-labeled data.

Learning a robust sentiment representation can benefit the sentiment classification task. Metric learning is a representation learning paradigm that aims to decrease the distance between samples from the same category while increasing the distance between samples from different categories [7]. The triplet training strategy and the contrastive training strategy are two common approaches to attaining this goal. The former employs the triplet loss function [8] to constrain the distance of the sampled triples, i.e. anchor sample, positive sample, and negative sample. Guan et al. [9] applied the triplet training strategy for sentiment classification task and obtained impressive results. However, because this technique only samples one negative and one positive sample for each anchor sample, the possible contrast patterns are not fully captured. In comparison, the contrastive training strategy used the NT-Xent loss function [10], which samples multiple negative and positive samples in a mini-batch, to learn rich contrast patterns. In weakly-supervised settings, low-cost user tags replace the data augmentation [10, 11] to generate positive and negative samples.

For weakly-supervised scenes, we also need to improve the anti-noise capability of the algorithm. Ghosh et al. [12] observed that the contrastive training strategy improves model robustness under label noise. The supervised robust methods work remarkably well when they are initialized with the contrastive representation learning model. We offer an explanation for the observation: multiple sampling on dissimilar pairs not only provides rich contrast patterns (i.e. positive sample vs negative sample), but also makes noisy instances non-significant among the sampled pairs. However, the robustness depends on a large mini-batch size and an appropriate temperature hyper-parameter. To achieve the robust representation learning on weakly-labeled data, an ad-hoc anti-noise technique based on the contrastive learning is required. Recent work on anti-noise methods fall into two categories. One is to design robust model structures [13,14,15]. These methods work on the assumption that there is a single transition probabilityFootnote 1 between the noisy label and ground-truth label, and a noise adaptation layer is added to simulate the label transition matrix of the noisy data. However, user behavior is arbitrary in weakly-supervised scenarios, resulting in chaotic tags. Therefore, this assumption may be inconsistent with real-world noisy labels, and the design of additional modules increases model complexity. The other works use the designed robust loss functions to alleviate the impact of noisy data [16, 17], but they also follow the same assumption about the noisy label transition. When there are noisy labels, deep learning models will eventually memorize these incorrectly assigned labels, resulting in poor generalization performance [18]. Li et al. [19] investigated the influence of noisy instances on model training and discovered that noisy data usually has a greater impact on the top layers of the model. Based on the above, we propose a novel framework called Weakly-supervised Anti-noise Contrastive Learning (WACL) for sentiment classification. The knowledge transfer strategy is used in this framework: we first use contrastive learning to pre-train a deep model on a large amount of weakly-labeled data. Then we design a dropping-layer strategy to remove the top layers of the pre-trained model that are susceptible to noise. Finally, we add a classification layer on top of the remaining model and run supervised training with small labels. This framework considers the impact of noisy data on the model training, so it does not rely on any assumptions about the noisy distribution and does not require the design of additional structures to combat noises.

The contributions can be summarized as follows:

  1. 1.

    We propose a novel framework called Weakly-supervised Anti-noise Contrastive Learning (WACL) for sentiment classification. It uses contrastive pre-training to learn robust sentiment representations for downstream tasks even on data with a high noise ratio.

  2. 2.

    The proposed framework is adaptable to encoders with various deep structures.

  3. 3.

    The results of the experiments on the different sentiment classification datasets prove that our WACL framework can significantly improve the performance of deep models and WACL with Bert as encoder outperforms other baselines.

2 Related work

2.1 Deep learning on sentiment classification

Sentiment classification, also known as opinion mining, refers to mining the sentimental tendencies of sentimental texts and classifying their attitudes. Recent deep learning-based methods have achieved remarkable performances. Multi-layer perceptron (MLP) [20] demonstrated its powerful representation ability on sentiment classification tasks. Zhang et al. [21] and Habimana et al. [22] used convolutional neural networks (CNN) to identify the sentiment polarity of texts. Convolutions can capture local features by sliding filters and then aggregate them into high-level global representations. Al-Smadi et al. [23] and Arunava et al. [24] used long short-term memory network (LSTM) for sentiment classification. LSTM can learn the text’s long-term dependency relationship and “understand” the text’s sentiment meaning as a whole. In comparison to CNN, LSTM is more suitable for long text sentiment classification [6]. Ling et al. [25] designed a CNN-LSTM network to solve the problem of word polysemy. Transfer learning based on BERT created the SOTA performance models with minimal effort on 11 downstream NLP tasks [26]. BERT uses a bidirectional transformer encoder [27], which can capture the bidirectional semantics of a text sequence and allows for parallel processing. Masked Language Model (MLM) is the most important model structure of BERT and it mainly combines the ideas of transformer encoder and masked tokens. As many NLP downstream tasks (e.g. Natural Language Inference) are based on understanding the relationship between two text sentences, which is not directly captured by language modeling, BERT proposes the NSP by using the special token [CLS] as the first token of every sequence. Combining the above techniques together, BERT’s performance is elevated to a new height.

Although the deep learning methods do not require the creation of complex feature and rules, their effectiveness is reliant on enormous human-labeled data. Labeling large amounts of high-quality opinionated texts is time-consuming and requires considerable human and financial efforts. Besides, it is challenging to maintain consistency among different annotators. Fortunately, the low-cost user-tagged data provides a wide pool of resources to supplement human-labeled data.

2.2 Weakly-supervised learning

The scale of user-tagged data is huge, but it may contain noisy-labeled instances. The reason is that unconstrained users’ labeling behaviors do not follow a same standard. Hence, the user tag is a form of weak supervision. In recent years, researchers have attempted to exploit information from user-tagged data [28] for training sentiment classifiers. Qu et al. [29] proposed to use review data with ratings as weakly labeled data to train a probability model for sentence sentiment classification. Täckström and McDonald [30] proposed a sentence-level sentiment classifcation method based on hidden conditional random fields (HCRF) which combine review-level and sentence-level sentiment labels. Wang et al. [31] proposed a sentiment classification method based on multi-dimensional (language symbols, emoticons’ symbols, and punctuation symbols) and multi-level (words, sentences, and documents) modeling. Emojis are fused into the input for training the deep model. However, the above methods require feature engineering and do not consider the impact of noisy data. The research [9] is close to our work. Guan et al. proposed a weakly-supervised learning framework on customer review sentiment classification. They adopted the triplet training strategy to learn a good sentiment embedding space. But learning on randomly sampled triples cannot adequately capture the contrast patterns between positive and negative instances. To handle this problem, we adopt contrastive learning to sample multiple instances of different categories for each anchor instance. As a result, the model can learn richer contrast patterns between positive and negative samples than the triplet learning. The temperature hyper-parameter governs the degree of attention required for hard negative instances [32]. The sampled noisy negative instances that have the same sentiment polarity as the anchor are deemed as hard negative instances. We can adjust the temperature hyper-parameter so that the model does not focus too much on the hard ones during the learning process, preventing the noisy data from being classified incorrectly. In the weak-supervised scene, contrastive learning is used not only to obtain a good sentiment representation, but also to reduce the negative influence of noisy instances.

2.3 Learning from noisy data

User-tagged data is simple to collect, but we must mitigate the effect of noisy instances. The contrastive learning has certain anti-noise ability, but good anti-noise performance depends on extremely cautious manual hyper-parameter tuning and a large mini-batch size. In our work, we develop an ad-hoc anti-noise strategy to improve the algorithm’s anti-noise performance. This strategy is inspired by the researches of learning from noisy labels. These studies can be classified into two categories:

Designing a robust model structure. The label transition matrix is the key of these methods. It simulates a transition probability between the true and noisy labels, i.e., the sample with the true label has a certain probability of being marked as the noisy label. These methods commonly design an additional noise adaptation layer at the top of the network to model the transition probability, which is then removed during the evaluation phase. Chen et al. [33] used the confusion matrix of all training samples as the initial weight matrix \({{\mathbf {W}}}\) of the noise adaptation layer, and then modified the model’s output to achieve the goal of label correction. A series of studies proposed to initialize the model with the identity matrix \({{\mathbf {W}}}\) and then add a regularizer to limit the propagation of \({{\mathbf {W}}}\) during model training [34,35,36,37]. These methods commonly make strong assumptions about the distributions of noisy labels, limiting the model’s ability to explore complex label noise in real-world scenarios [16]. Furthermore, these methods were developed specifically for computer vision tasks, and their efficacy has not been demonstrated on NLP tasks.

Designed the robust loss function. The key idea is to design a loss function that is robust to noisy labels. The design of the loss functions necessitates that the noisy label satisfy certain conditions, ensuring that the classifier trained on noisy data or noisy-free data has the same misclassification probability [16, 38]. Ghosh et al. [16] proved that the mean absolute error loss has a better generalization effect than the categorical cross entropy loss. The reason is that only the MAE loss satisfies the above conditions. When dealing with complicated data, however, MAE loss has a significant constraint in terms of generalization performance. To address this issue, a more generic noise resilient loss called Generalized Cross Entropy loss is developed, which combines the benefits of both MAE and CCE [17]. But the distribution of noisy instances user-tagged data is more complicated, it is challenging to design a robust losses.

Deep learning models will eventually memorize these wrongly given labels, which leads to the poor generalization performance [18]. Li et al. studied how architecture affects learning with noisy labels and they observed that the last few layers of the model are more negatively affected by noisy labels [19]. Inspired by this, we devise a simple dropping-layer strategy to mitigate the harmful effects of incorrectly tagged instances.

3 Method

In this section, we describe the details of WACL. The framework is depicted in Fig. 1. It is a transfer style method that consists of three steps: contrastive pre-training, dropping-layer and supervised fine-tuning. During the contrastive pre-training phase, we feed a weakly-labeled instance into the encoder layer to obtain the high-level representation, which is then projected to a vector with fixed dimension. We use formula 6 as the objective function for pre-training. Because the multiple sampling of dissimilar pairs can: (1) utilize the large weakly-labeled data more effectively; (2) thus push samples from different classes as far as possible; (3) meanwhile make the noisy-labeled instances inconsequential. After the pre-training phase, we design a dropping-layer strategy to remove the top layers of the pre-trained model because the last few layers of the model are more negatively affected by noisy labels [19]. Finally, we add a classification layer on top of the remaining model for standard supervised training with labeled data. In the following sections, we will describe the model structure and the transfer style training strategy.

Fig. 1
figure 1

WACL framework. The solid arrow represents the pre-training phase, and the dashed arrow represents the fine-tuning phase

3.1 Model structure with BERT as encoder

In this section, we introduce the model structure of WACL. We choose Bidirectional Encoder Representation from Transformers (BERT)Footnote 2 as the text encoder because of its good performance on NLP tasks [26]. Bidirectional transformer encoder is the core structure of BERT. Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution [27]. It can guide the model to learn the semantic patterns from different views. Bert-base has certain advantages in semantic feature extraction, distant information capture, and syntactic feature extraction. The model structure is shown in Fig. 2. Hereafter referred to as WACL-Bert.

Fig. 2
figure 2

WACL-Bert

Input layer. An input text sequence of length t is a word sequence \(< {w_1},{w_2},{w_3},...,{w_t} >\). In order to adapt to the input of the Bert model, we add CLS and SEP symbols at both ends of the word sequence, then the input of the model is \({{{\mathbf {s}}}_{{\mathbf {w}}}} = < CLS,{w_1},{w_2},{w_3},...,{w_t},SEP>\). Note that \({{{\mathbf {s}}}_{{\mathbf {w}}}}\) is sampled from the weakly-labeled dataset.

Encoder layer. We use the pre-trained Bert-base as the encoder. The computational details can be found in [26]. For simplicity, we use the notation \(BERT( \cdot )\) to represent its encoding computation. Input sequence is encoded as:

$$\begin{aligned} {{\mathbf {H}}} = BERT({{{\mathbf {s}}}_{{\mathbf {w}}}}), \end{aligned}$$
(1)

where \({{{\mathbf {H}}}} \in {{\mathbb {R}}^{t \times d}}\) represents the feature matrix of the i-th instance, d = 768 represents dimension of the word vector. We choose the vector \({{{\mathbf {h}}}} \in {{\mathbb {R}}^{d}}\) corresponding to CLS as the input of the projection layer. This setting follows standard practice for fine-tuning pre-trained language model for classification [26].

Projection layer. On top of the encoding layer, we employ a projection operation to generate a fixed high-level representation. The projection layer is consisted of two linear layers and two non-linear layers. The mapping operations are computed as follows:

$$\begin{aligned}&{{{\mathbf {m}}}_{{\mathbf {1}}}} = f({{\mathbf {h}}}{{{\mathbf {W}}}_{{\mathbf {1}}}} + {{{\mathbf {b}}}_{{\mathbf {1}}}}){{{\mathbf {W}}}_{{\mathbf {2}}}} + {{{\mathbf {b}}}_{{\mathbf {2}}}}, \end{aligned}$$
(2)
$$\begin{aligned}&{{\mathbf {v}}} = f({{{\mathbf {m}}}_1}{{{\mathbf {W}}}_3} + {{{\mathbf {b}}}_3}){{{\mathbf {W}}}_4} + {{{\mathbf {b}}}_4}, \end{aligned}$$
(3)

where \({{{\mathbf {m}}}_{{\mathbf {1}}}} \in {{\mathbb {R}}^{d}}\) is the output feature vector of the first linear layer, \({{{\mathbf {W}}}_{{\mathbf {1}}}} \in {{\mathbb {R}}^{d \times c}}\) and \({{{\mathbf {W}}}_{{\mathbf {2}}}} \in {{\mathbb {R}}^{c \times d}}\) are parameter matrices. Similarly, \({{{\mathbf {v}}}} \in {{\mathbb {R}}^{n}}\) is the output of the second linear layer, \({{{\mathbf {W}}}_{{\mathbf {3}}}} \in {{\mathbb {R}}^{d \times n}}\) and \({{{\mathbf {W}}}_{{\mathbf {4}}}} \in {{\mathbb {R}}^{n \times n}}\) are parameter matrices. \({{{\mathbf {b}}}_1},{{{\mathbf {b}}}_2},{{{\mathbf {b}}}_3},{{{\mathbf {b}}}_4}\) are bias term. We use Gelu as the activation function of mapping function f. We empirically set c = 1000 and n = 384 for the projection layer.

The above is the computational process of pre-training. Next, we adopt a dropping-layer strategy to remove the top layers of the pre-trained model that are susceptible to noise sample and the rest layers build a model called remaining model. At fine-tuning stage, we add a classification layer on top of the remaining model and fine-tune the parameter of whole model with labeled data. Specifically, we first extract the high-level feature vector through the remaining model:

$$\begin{aligned} {{\mathbf {r}}} = RM({{{\mathbf {s}}}_{{\mathbf {l}}}}), \end{aligned}$$
(4)

where \({{\mathbf {r}}}\in {{\mathbb {R}}^u}\) is the high-level fixed vector, and \({{{\mathbf {s}}}_{{\mathbf {l}}}} = < CLS,{w_1},{w_2},{w_3},...,{w_m},SEP>\) is a text sequence from the labeled dataset. The mapping function of the remaining model is represented by \(RM( \cdot )\). The final output layer is computed as:

$$\begin{aligned} {{\mathbf {p}}} = \sigma ({{\mathbf {r}}}{{{\mathbf {W}}}_5} + {{\mathbf {b}}}), \end{aligned}$$
(5)

where \({{{\mathbf {p}}}}\) is the whole model’s output. \({{{\mathbf {W}}}_5} \in {{\mathbb {R}}^{u\times m}}\) and \({{\mathbf {b}}}\) represent the parameter matrix and bias term, respectively, where m is the number of categories. \(\sigma (\cdot )\) is softmax function.

3.2 Weakly-supervised anti-noise contrastive learning

Our WACL framework employs a knowledge transfer strategy: we first pre-train the deep model on large amounts of user-tagged data and then a dropping-layer strategy is designed to alleviate the negative effects on top layers of the pre-trained model. Finally, we add a classification layer on the top of the remaining pre-trained model and fine-tune the model’s parameters with a small labeled dataset. The three pieces are described in details in the following sections.

3.2.1 Contrastive pre-training with user-tagged data

The weakly-labeled data contains rich sentiment semantic information, but it also includes noisy instances that cannot be ignored. Pre-training aims to learn a good sentiment representation from a large amount of weakly-labeled data while minimizing the negative effects of noisy instances. The pre-training includes two steps: (1) assigning a weak label to each sample in the user-tagged dataset. In most cases, the user tags are the weak labels for corresponding instances, e.g. an emoji grin is the positive weak label of a tweet. In some additional scenarios, we still need to assign weak labels to fine-grained examples. For example, Zhao et al. [6] set 3-star as the threshold to binarize the 5-level ratings for a sentence-level customer review sentiment classification task. (2) Training a deep model with the constrastive learning strategy. The pre-training procedure aims to make the samples with the same sentiment polarity closer while keep the samples with different sentiment polarity as far as possible. Meanwhile, we also need to prevent the noisy samples from being grouped with incorrect labels. To this end, we use the SupCon loss proposed by Khosl et al. [39] for pre-training:

$$\begin{aligned} L = - \sum_{i = 1}^M {\frac{1}{{{M_{{y_i}}} - 1}}\sum \limits _{j = 1}^M {{l_{i \ne j}}{l_{{y_i} = {y_j}}}\ln \left[ {\frac{{\exp ({{{s_{i,j}}} / t})}}{{\exp ({{{s_{i,j}}} / t}) + \sum \nolimits _{k = 1}^M {{l_{{y_i} \ne {y_k}}}\exp ({{{s_{i,k}}} / t})} }}} \right] } } , \end{aligned}$$
(6)

where M represents a mini-batch size, \({y_i}\) and \({y_j}\) represent the label of the anchor sample i and the sample j respectively. \({M_{{y_i}}}\) represents the number of samples whose label is \({y_i}\) in a mini-batch. \({l_{i \ne j}} \in \{ 0,1\}\), \({l_{{y_i} = {y_j}}}\) and \({l_{{y_i} \ne {y_k}}}\) are similar indicator functions. For instance, \({l_{i \ne j}} \in \{ 0,1\} = 1\) if \(i\ne j\); otherwise, \({l_{i \ne j}} \in \{ 0,1\} = 0\). \({s_{i,j}} = {{{{{\mathbf {v}}}_{{\mathbf {i}}}}^T{{{\mathbf {v}}}_{{\mathbf {j}}}}} / {\left\| {{{{\mathbf {v}}}_{{\mathbf {i}}}}} \right\| }}\left\| {{{{\mathbf {v}}}_{{\mathbf {j}}}}} \right\|\) is the cosine similarity between the sample i and the sample j, where \({{{\mathbf {v}}}_{{\mathbf {i}}}}\) and \({{{\mathbf {v}}}_{{\mathbf {j}}}}\) represent the high-level feature vectors of the sample i and the sample j respectively; t is the temperature hyper-parameter.

In the loss function, the samples with the same weak label as the anchor sample are deemed as the positive samples, and vice versa. SupCon loss samples multiple positive and negative samples for the anchor sample \(s_{i}\). The loss function then guides the positive samples to be close to the anchor while pushing the negative samples away. In this way, the deep model can capture rich contrast patterns and general sentiment distribution from a large number of weakly-labeled data.

In the weakly supervised scene, we have to consider the negative effect of the weak instances with noisy labels that have inconsistent sentiment orientation to the labeled texts. Contrastive training with SupCon loss has a natural ability to alleviate the noise because: (1) the loss function has the property of self-discovery of hard negative samples. Hard negative samples have different labels from the anchor sample, but have embedding features very close to the anchor embedding. The loss function punishes hard negative samples to improve the quality of the learned representations. (2) Choosing a suitable temperature hyper-parameter can control the distance between anchor sample and hard negative sample [32]. A small value enlarge the distance, whereas a large temperature parameter shortens it. When a noisy instance is sampled as the negative sample of a “clean” anchor, it may violate the training objective and move to the wrong category. In this case, a greater temperature value can restrict the noisy instances’ incorrect movements. However, if the temperature is excessively large, the true hard negative sample and the anchor may not separate sufficiently. Hence, choosing a suitable temperature not only improves anti-noise capability but also sufficiently separates the true hard negative sample. (3) Setting a large mini-batch size decrease the possibility of sampling noisy instances since the proportion of noisy instances in weakly-labeled dataset is small. This set reduces the impact of noisy data, resulting in less erroneous movements during the training phase. However, a large batch is memory consumption. Hence, we propose a simple but effective anti-noise strategy to handle this problem.

3.2.2 Dropping-layer

In the pre-training phase, by introducing the SupCon, the model learned rich sentiment information and contrast patterns from a number of weakly-labeled data. Li et al. [19] found that noisy instances have a great negative impact on the last few layers of the model. Despite the SupCon’s intrinsic anti-noise capacity, it is unable of dealing with this critical issue. As a consequence, we also devise a dropping-layer strategy. The key is to drop the top layers that are susceptible to noise samples. The bottom features of deep neural networks have an excellent generalization ability [40] and noisy instances eventually impact on the top layers [19]. Considering that the properties of the bottom and top features differ, we first divide the pre-trained model into the upper part and the lower part from the middle, and then we devise two dropping strategies to evaluate which layers should be removed: one is to incrementally discard from the last layer (i.e. top-to-bottom), and the other is to incrementally discard from the first layer (i.e. bottom-to-top). Figure 3 only depicts one dropping strategy, i.e., top-to-bottom, because the two dropping processes are inverted.

Fig. 3
figure 3

The dropping-layer strategy using Bert-base encoder as an example. Layer 1–12 represents encoder and layer 13–14 are the projection structure. The dropping strategy includes two steps: (1) we first divide the whole network into lower and upper part. We use the middle layer as a boundary to separate the upper and lower parts. In this network, the boundary is the 7th layer, with layers 8–14 belonging to the upper part and layers 1–7 belonging to the lower part. (2) We attempt to drop the appropriate layers of the upper part. D14 denotes discarding the 14th layer, D13–14 indicates the discard of the 13th to 14th layers, and so on. We evaluate the performance of the dropping-layer strategy by removing different top layers. WACL-Bert achieves the best classification performance on the Amazon dataset when the 12th to 14th layers are removed. We describe the investigation of the dropping-layer strategy in Sect. 4.5

The dropping-layer technique not only mitigates the negative impact of noisy data on the top layers of the deep model, but it also reduces the model parameter scale. The investigation of the dropping-layer provides empirical guidance: we only need to remove from the last layer incrementally to find the best dropping manner (see Sect. 4.5).

3.2.3 Supervised fine-tuning phase

We used the dropping-layer technique to remove the layers that are sensitive to noisy data while keeping the bottom pre-trained component that has learnt rich sentiment patterns from vast amounts of weakly-labeled data. Then, on top of the remaining model, we add a classification layer and use labeled data to fine-tune the parameters of the entire network in a standard supervised training paradigm. We use cross-entropy as the loss function:

$$\begin{aligned} {L_{ce}} = - \frac{1}{M}\sum \limits _{i = 1}^M {\sum \limits _{j = 1}^C {{y_{ij}}} } \ln {p_{ij}}, \end{aligned}$$
(7)

where M and C represent the size of a mini-batch and the number of categories, respectively. \({y_{ij}} \in \left\{ {0,1} \right\} = 1\) if the category of the i-th sample is j, \({{p_{ij}}}\) represents the probability that the i-th sample is predicted to be class j.

3.3 The behavior of SupCon loss for noisy data

As described in Sect. 3.2.1, adjusting temperature parameters of SupCon loss can mitigate the negative effects of noisy samples in some cases [32]: (1) as the temperature parameter decreases, the loss will impose a greater penalty on the negative samples with high similarity to the anchor sample (hard negative samples) to force them apart; (2) as the temperature parameter increases, the penalty for all negative samples becomes uniform, which reduces their tendency to separate. Next, we discuss the behavior of SupCon loss on an anchor sample i in two cases.

Case 1: the anchor sample i is a noisy sample.

  • a. Binary sentiment classification.

    • (a) positive sample j .

      1. (a)

        if j is a noisy sample, and its true sentiment polarity is the same as the anchor i, the SupCon loss will pull them closer together so that the negative effect of the noisy label will be alleviated.

      2. (b)

        if j is a clean sample, and its true sentiment polarity is different from the anchor i, the SupCon loss will pull them closer. This wrong movement is inevitable.

    • (b) negative sample k .

      1. (a)

        if k is a clean sample, and its true sentiment polarity is the same as the anchor i, the original embedding of k is distributed closer to the anchor sample i than the other negative samples in a mini-batch. In this case, k is a hard negative sample. We expect k and i to be closer because they have the same true sentiment polarity. To this end, we can set a larger temperature value to alleviate the wrong movement caused by the noisy label of the anchor sample.

      2. (b)

        if k is a noisy sample, and its true sentiment polarity is different from the anchor i, the SupCon loss will push them away since they have the different weak labels.

  • b. Multiple sentiment classification. Except for all the above cases, multiple sentiment classification additionally has the following special cases:

    • (a) positive sample j .

      1. (a)

        if j is a noisy sample while its true sentiment polarity is the same as that of the anchor sample i. This negative effect is inevitable.

    • (b) negative sample k .

      1. (a)

        if k is a clean sample while its true sentiment polarity is different from the anchor i, the SupCon loss can push them away.

      2. (b)

        if k is a noisy sample while its true sentiment polarity is the same as the anchor i, then k is a hard negative sample. We can use the same hyper-parameter setting of case1-a-(b)-a) to alleviate the negative effect of the noisy sample.

Case 2: the anchor sample i is a clean sample.

  • a. Binary sentiment classification.

    • (a) positive sample j .

      1. (a)

        if j is a clean sample, and its true sentiment polarity is same as the anchor i, the SupCon loss will pull them closer.

      2. (b)

        if j is a noisy sample while its true sentiment polarity is different from the anchor i. The negative effect of the noisy sample is inevitable.

    • (b) negative sample k .

      1. (a)

        if k is a clean sample, and its true sentiment polarity is is the same as the anchor i, the SupCon loss will push them away.

      2. (b)

        if k is a noisy sample, and its true sentiment polarity is same as the anchor i. Then k is a hard negative sample. We can use the same hyper-parameter setting of case1-a-(b)-a) to alleviate the negative effect of the noisy sample.

  • b. Multiple sentiment classification. Except for all of the above cases, multiple sentiment classification includes a special case: if k is a noisy sample and its true sentiment polarity is different from that of the anchor sample i, the SupCon loss will push them away.

Note that if a hard negative sample is a noisy sample, the behavior of the SupCon loss is the same as in the above cases where a negative sample k is a noisy sample.

4 Experiments and analysis

4.1 Datasets

We conducted experiments on Amazon product review dataset [9], Twitter dataset [28] and SST5 dataset [41]. The customer reviews are collected from three domains, including digital cameras, cell phones and laptops, and comprises of 1,143,721 sentences with rating information and 11,220 labeled data. The Twitter dataset contains 1,100,000 sentences with emojisFootnote 3 and 4714 labeled data. The sentence with:),: ),:-),:D, or =) is assigned a positive label while the sentence with:(,:( or:-( is assigned a negative label.Footnote 4 We use the twitter sentiment classification benchmark dataset in SemEval 2013 as the labeled data and neutral sentences are removed.Footnote 5 Statistics about the above two datasets are shown in Table 1. For the supervised fine-tuning step, we split the labeled data of Amazon dataset into training set (70%), validation set (10%), test set (20%). Since the scale of Twitter labeled dataset is smaller, we increase the proportion of test set. The labeled set is divided into training set (60%), validation set (10%) and test set (30%). The positive and negative proportion of above sets are 1:1.

The SST5 datasetFootnote 6 contains 11,855 labeled sentences and is divided into 5 categories: 0 (very negative), 1 (negative), 2 (neutral), 3 (positive), and 4 (very positive). The detailed distribution of each category is shown in Table 2. In order to simulate real-world scenarios, we divide SST5 data as follows: we randomly select 3,000 sentences from the SST5 labeled data as the fine-tuning corpus, 1,000 sentences as the test set and the remaining sentences as pre-training corpus. The fine-tuning corpus is divided into training set (90%) and validation set (10%). The proportion of each category is 1:1 in the fine-tuning corpus and the test set. For the pre-training corpus, we randomly replace 20% (because the average of the noise ratio of the other two data sets is approximately 20%) of the true labels of the sentences by using other labels. We also conduct experiment for different noise ratios in “The effect of varying noise ratio on classification performance”.

Table 1 Statistics about the Amazon and Twitter dataset
Table 2 SST5 labeled data

4.2 Experimental settings

We remove the stop words in the texts. We set the input length to 150, 47 and 40 for the Amazon dataset, Twitter dataset and SST5 dataset respectively. More training settings are shown in Table 3.

Table 3 Training settings. “64/128” represents that the mini-batch size is 64 when training a model with Bert encoder, otherwise the mini-batch size is 128. Other parameter settings also follow this way

4.3 Baselines and main comparison

We employ Accuracy and Macro-F1 as the evaluation metrics. The baseline methods are as follows:

Lexicon: Lexicon-based method [42]. It use external evidences in other sentences and some linguistic conventions to identify the polarity of the opinion words.

NBSVM: NBSVM [43] combines naive bayes and support vector machine to construct a linear classifier for sentiment classification.

SSWE: SSWE [44] introduces sentiment information into a neural network to learn sentiment-specific word vectors on weakly-labeled data.

gMLP: A novel neural network model based on MLP [20], which achieved a good performance on the sentiment classification task.

WSD: A weakly-supervised method based on multi-channel CNN, which adopts triplet loss during the pre-training phase and adds subject information in the fine-tuning phase [9]. For the sake of fairness, all the baseline methods do not use the aspect information in the following experiments.

CESCL: CESCL fine-tunes of pre-trained language model with a loss function which combines the cross-entropy loss and the supervised contrastive loss [45]. We adopt Bert-based as the pre-trained language model.

Bert: Training Bert-base with labeled data.

Bert-Weak: We use weakly-labeled data as the labeled data to train Bert-base. We design this baseline method to prove whether the weakly data have a negative impact on training phase.

WACL-Bert w/o drop: Our method without dropping-layer strategy.

WACL-Bert: Our method.

Note that all the above methods are performed on the Amazon dataset, Twitter dataset and SST5 dataset. SSWE, WSD, WACL-Bert w/o drop and WACL-Bert have a pre-training phase on weakly-labeled data. The experimental results are shown in Table 4.

Table 4 The experimental results of baselines on Amazon dataset, Twitter dataset and SST5 dataset

The resultsFootnote 7 of the three datasets are similar. WACL-Bert has the best classification performance while Lexicon has the worst classification performance. The manually constructed sentiment lexicon cannot cover all opinion terms in a specific dataset, and the designed rules cannot consider the various contextual information of the sentences. NBSVM achieves higher accuracy, but its performance is constrained by its limited feature generalization. SSWE performs slightly better than NBSVM due to the ad hoc learning process of sentiment embeddings on the weakly-labeled data. However, the linear classifier’s representation capacity is inferior than that of deep models. gMLP stacks multiple non-linear MLP layers, allowing the model to learn rich semantic information from labeled data. As a result, it outperforms better than SSWE. WSD outperforms gMLP thanks to the triplet training on weakly-labeled data. Bert performs better than WSD because of the rich prior knowledge of the pre-trained Bert. CESCL adds contrastive loss to guide Bert-base to learn contrast patterns from the labeled data, result in a better performance. When compared to our method, the accuracy of Bert-weak drops dramatically, illustrating the harmfulness of the noisy instances. The performance degradation caused by ablating the dropping layer strategy demonstrates its anti-noise capability, and noisy-labeled samples indeed have a negative impact on the top layers of the pre-trained model.

4.4 WACL adopting different encoders

We investigate the performance of WACL on different deep encoders, including MLP, CNN and LSTM. In the following, we will give a brief overview of the structures illustrated in Fig. 4.

Fig. 4
figure 4

MLP-based model, CNN-based model and LSTM-based model. RM represents the remaining model

Fig. 5
figure 5

The structure of the MixerBlock

MLP-based model. It use a MixerBlock module [46]. The structure of the MixerBlock is shown in the Fig. 5. MixerBlock contains two types of MLP: Dimension-Mix MLP and Token-Mix MLP. The former operates each token independently, allowing information exchange across the token’s vector dimensions, while the latter fuses information from each token. MixerBlock also employs skip-connections and layer normalization. The mapping of the MixerBlock is calculated as follows:

$$\begin{aligned} {{\mathbf {U}}} = {{\mathbf {S}}} + {(\sigma ({(LN({{\mathbf {S}}}))^T}{{{\mathbf {W}}}_{{\mathbf {1}}}}){{{\mathbf {W}}}_{{\mathbf {2}}}})^T},\end{aligned}$$
(8)
$$\begin{aligned} {{\mathbf {Y}}} = {{\mathbf {U}}} + \sigma (LN({{\mathbf {U}}}){{{\mathbf {W}}}_{{\mathbf {3}}}}){{{\mathbf {W}}}_{{\mathbf {4}}}}, \end{aligned}$$
(9)

where \({{\mathbf {S}}} \in {{\mathbb {R}}^{t \times d}}\) denotes a input sentence, t is the length of the sentence, d represents the dimension of word vectors, we set d to 300. \({{\mathbf {U}}}\in {{\mathbb {R}}^{t \times d}}\) represents the output of the Token-Mix MLP, \({{\mathbf {Y}}} \in {{\mathbb {R}}^{t \times d}}\) represents the output of the Dimension-Mix MLP, \({{{\mathbf {W}}}_{{\mathbf {1}}}} \in {{\mathbb {R}}^{t \times q}}\), \({{{\mathbf {W}}}_{{\mathbf {2}}}} \in {{\mathbb {R}}^{q \times t}}\), \({{{\mathbf {W}}}_{{\mathbf {3}}}} \in {{\mathbb {R}}^{d \times q}}\) and \({{{\mathbf {W}}}_{{\mathbf {4}}}} \in {{\mathbb {R}}^{q \times d}}\) are the parameter matrix of the fully-connected layers, we empirically set q = 512. LN and \(\sigma\) represent layer normalization and Gelu active function respectively. We stack 6 MixerBlocks as the encoder layer, 2 MixerBlocks as the projection layer, and use average pooling on the output of the projection layer.

CNN-based model. We stack 5 identical convolutional layers as the encoder layer and use max pooling on the output of the encoder layer. For each convolutional layer, the kernel size is 3, the out channels are 300 and the activation function is Relu. We use two standard MLPs as the projection layer, and the activation function is Gelu.

LSTM-based model. We stack 5 identical LSTM layers as the encoder layer. For each LSTM, the dimension of the hidden layer is 300. The projection layer is same as the CNN-based model.

In the experiment, the word lookup table in the pre-training phase is initialized using the publicly available 300-dimensional word vectors trained on 100 billion words from Google News by word2vec [47]. Different from the Bert-based model, the above three models have 8 backbone layers, 7 backbone layers, and 7 backbone layers, respectively. Therefore, we divide it again into the upper part and the lower part, using the 4th layer as the boundary. The partitions are shown in the Table 5. The experimental results are shown in the Table 6. WACL baselines are fine-tuned on the labeled datasets. Baseline-Weak and baseline-Clean denote that the above models are directly trained on the weakly-labeled data and the labeled data respectively. The performance of Weak baselines are poor, proving the negative effect of the noisy samples. The results also demonstrate that WACL boost the performance of baseline-Cleans. The reason is that the contrastive pre-training and the dropping-layer strategy enable the models to learn the sentiment representations well from noisy labeled data and improve their performance for the downstream tasks.

Table 5 Partitions of lower part and upper part
Table 6 Performance comparison on Amazon dataset, Twitter dataset and SST5 dataset

4.5 Investigation of the dropping-layer strategy

Dropping-layer strategy improves model roubutness to noisy instances. We conduct the experiment to search a optimal drop manner. We first divide the pre-trained model into lower part and upper part, and then, adopting two discard strategies (bottom-to-top and top-to-bottom), we remove layers incrementally until we discover the best performance. Table 7 show the performances of dropping different layers.

Table 7 The experimental results of the dropping-layer strategy on Amazon dataset, Twitter dataset and SST5 dataset

WACL adopts “Contrastive pre-training + Dropping-layer”, which makes the classification performance further improved compared to the WACL-Models w/o drop. From Table 7, we can see that the classification accuracy of WACL-Bert(D12–14) is 90.9% on Amazon dataset, which is 0.6% higher than WACL-Bert w/o drop. Similarly, WACL-Bert(D11–14)’s accuracy is 0.5% higher than WACL-Bert w/o drop on Twitter dataset. On SST5 dataset, WACL-Bert(D12–14)’s accuracy exceeds WACL-Bert w/o drop by 0.7%. We discovered that if we keep the lower part orderly and remove the layers in the upper part from the last layer, the performance first improves and then degrades. This observation is consistent with the conclusions: (1) the top layers are susceptible to noisy instances and (2) bottom layer of per-trained model learned the sentiment semantics. Furthermore, the dropping-layer method reduces the model’s parameter scale. On the Amazon dataset, the Twitter dataset and the SST5 dataset, the number of parameters is reduced by 8.1 M, 15.2 M and 8.1 M respectively.

4.6 WACL with different pre-training loss functions

To demonstrate the benefit of the contrastive loss, we compare SupCon loss function with cross-entropy loss function and triplet loss function [6] on Amazon dataset and SST5 dataset. The experimental results are shown in the Table 8. Essentially, triplet loss and contrastive loss are both capture the relative relationships between samples. However, because the triplet loss only samples one positive and one negative instance for the anchor sample, it cannot learn the rich contrast patterns as well as the contrastive loss. Furthermore, multiple sampling on both positive and negative samples improves the model’s robustness to noisy instances. Hence contrastive loss outperforms the triplet loss. The cross-entropy loss function performs worse since it lacks sufficient discriminant of inter-class samples [48].

Table 8 The experimental results of the different pre-training strategies

Furthermore, we use intra-class and inter-class metrics [6] to assess the quality of representations learned with different pre-training strategies. We feed the test samples into the pre-trained model to obtain the high-level feature vectors, and then we compute the inter-class and intra-class average distances of the sample vectors. A better pre-training strategy results in a sample distribution with a larger inter-class distance and a smaller intra-class distance. The intra- and inter-class metrics are calculated as follows:

$$\begin{aligned}&{D_{inter}} = \frac{1}{|\Psi |}\sum _{(s_i,s_j)\in \Psi } dst(s_i,s_j), \end{aligned}$$
(10)
$$\begin{aligned}&{D_{intra}} = \frac{1}{|\Phi |}\sum _{(s_v,s_k)\in \Phi } dst(s_v,s_k), \end{aligned}$$
(11)

where \({D_{inter}}\) and \({D_{intra}}\) are the inter-class and intra-class average distance, respectively. \(\Psi\) and \(\Phi\) are the sampled pairs of different categories and the same category. \(dst(\cdot )\) is the euclidean distance. The results are shown in Fig. 6. Pre-training with cross-entropy loss produces the worst results due to the lack of ability to capture inter-class patterns. The contrastive training strategy produces a larger inter-class average distance and a smaller intra-class average distance than the triplet training strategy, indicating that it can better capture the contrast patterns from training samples.

Fig. 6
figure 6

Inter-class average distance and intra-class average distance for different pre-training strategies

4.7 The effect of varying noise ratio on classification performance

In this section, we conduct experiments on the Amazon labeled dataset and SST5 dataset to study the performance of the proposed method under different label noise ratios. For binary sentiment classification, We first randomly select 50% sentences from the Amazon labeled data for the pre-training phase, 20% sentences for the fine-tuning, and the other 30% sentences are used as the test set. For the pre-training labeled dataset, we randomly replace the true labels of the sentences by the opposite labels. Then we set the noise ratios from 10 to 50%. Since the task is a binary classification and the pre-training only learns the relative relationship between the two categories, varying the ratio from 50 to 100% will produce a set of mirror results. For multiple sentiment classification, we follow the approach described in “Datasets” and we set the noise ratios from 10 to 100%. To investigate the anti-noisy capabilities of contrastive pre-training and dropping-layer method, we compare WACL-bert with Bert-Clean, Bert-Weak and WACL-Bert w/o drop. The results are depicted in the Tables 9 and  10.

Table 9 The experimental results on the Amazon labeled dataset
Table 10 The experimental results on the SST5 dataset

The performance of Bert-Weak decreases sharply as the noise ratio increases, indicating the significant harmful impacts of noisy data. From Table 9, when the noise ratio exceeds 30%, the performance of WACL-Bert without dropping-layer method obviously is inferior to Bert-clean, while WACL-Bert achieves a better performance than its w/o drop counterpart. On the SST5 dataset, the performance of WACL-Bert without dropping-layer method is worse than Bert-clean when the noise ratio exceeds 60%. However, our method is not as good as Bert-clean only when the noise ratio is 100%. And our method’s performance is better than WACL-Bert without dropping-layer method in the case of any noise ratio. From Table 10, we can see that the dropping-layer strategy shows its strength in the case of extremely large noise scale: it discards the layers that are easily affected by noise and greatly mitigates the negative effects of noisy data. Hence, the model’s performance still outperforms Bert-Clean under a 90% noise ratio. The above results demonstrate that the anti-noise capacity of the contrastive pre-training is limited in the presence of a high noise ratio, whereas the dropping-layer strategy can mitigate the negative effect of the noisy samples under the same conditions. The explanation for this could be that when the scale of noises is large, the top layers of the model are more easily influenced, and hence dropping-layer is more appropriate under such circumstances. In summary, WACL-Bert is robust to label noise even at a high noise ratio.

4.8 Visualization of the embedding space

To intuitively show the representation learning effect, we use t-SNE [49] to project the test samples’ high-dimensional feature vectors, which are generated from the projection layer of the final model. The visualization results are shown in Fig. 7. From left to right, the top captions of each sub-figure, i.e., random initialization, after pre-training, and after fine-tuning, refer to the three statuses of the model. The inter-class and intra-class average distances are also shown at the top of each sub-figure. In Fig. 7a, the test samples are scattered since the model parameters are only randomly initialized. After the contrastive pre-training, Fig. 7b shows that the inter-class average distance increases while the intra-class average distance decreases, indicating that the model has learned the contrastive patterns from the weakly-labeled data. Figure 7c reveals a further improvement in our model’s ability to distinguish different sentiment polarities.

Fig. 7
figure 7

Visualization of the embedding space. “pos” and “neg” represent the two sentiment categories of Amazon dataset respectively, “0”, “1”, “2”, “3”, and “4” represent the five sentiment categories of SST5 dataset respectively, \({D_{inter}}\) and \({D_{intra}}\) are inter-class and intra-class average distance, respectively

4.9 Temperature hyper-parameter tuning

As described in Sect. 3.2.1, selecting an appropriate temperature hyper-parameter t can improve the model performance on downstream tasks. We varying t and evaluate the model performance on the validation sets. We set relative small parameter intervals to search for an appropriate value. The tuning results are shown in Fig. 8. The curves of the two datasets have a same pattern: too large or too small temperature will cause performance degradation. We eventually set the temperature hyper-parameter to 0.5.

Fig. 8
figure 8

The effect of the temperature hyper-parameter on classification performance

5 Conclusion

We introduce the WACL framework in this paper, which can train robust representations from huge volumes of user-tagged opinionated texts with label noises. We first adopt the contrastive pre-training to learn the rich sentiment information and contrast patterns from the weakly-labeled data to generate a good sentiment embedding space. Second, we remove the top layers of the pre-trained model that are susceptible to noisy data, and finally we fine-tune the model to further improve the classification performance on the downstream tasks. The experimental results show that WACL can greatly boost the performance of deep models and, meanwhile, provide good anti-noise ability even in a high noise ratio.