Representation learning from noisy user-tagged data for sentiment classification

Chen, Long; Wang, Fei; Yang, Ruijing; Xie, Fei; Wang, Wenjing; Xu, Cai; Zhao, Wei; Guan, Ziyu

doi:10.1007/s13042-022-01622-7

Representation learning from noisy user-tagged data for sentiment classification

Original Article
Published: 05 August 2022

Volume 13, pages 3727–3742, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Representation learning from noisy user-tagged data for sentiment classification

Download PDF

Long Chen¹^na1,
Fei Wang¹^na1,
Ruijing Yang²,
Fei Xie ORCID: orcid.org/0000-0003-1867-7194³,
Wenjing Wang¹,
Cai Xu⁴,
Wei Zhao⁴ &
…
Ziyu Guan⁴

978 Accesses
8 Citations
1 Altmetric
Explore all metrics

Abstract

Sentiment classification aims to identify the sentiment orientation of an opinionated text, which is widely used for market research, product recommendation, and etc. Supervised deep learning approaches are prominent in sentiment classification and have shown the power in representation learning, however such methods suffer from the costly human annotations. Massive user-tagged opinionated texts on the Internet provide a new source for annotation, such as twitter with emoji. However, the texts may contain noisy labels, which may cause ambiguity during training process. In this paper, we propose a novel Weakly-supervised Anti-noise Contrastive Learning framework for sentiment classification, and name it as WACL. We first adopt the supervised contrastive training strategy during the pre-training phase to fully explore potential contrast patterns of weakly-labeled data to learn robust representations. Then we design a simple dropping-layer strategy to remove the top layers from the pre-trained model that are susceptible to noisy data. Last, we add a classification layer on top of the remaining model and fine tune it with labeled data. The proposed framework can learn rich contrastive sentiment patterns in the case of label noise and is applicable to a variety of deep encoders. The experimental results on the Amazon product review, Twitter and SST5 datasets demonstrate the superiority of our method.

Learning Discriminative Neural Sentiment Units for Semi-supervised Target-Level Sentiment Classification

An Indonesian Sentiment Classification Model Based on Multi-task Learning

Deep Transfer Learning for Social Media Cross-Domain Sentiment Classification

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Sentiment Classification is a challenging problem. Early research focused on lexicon-based methods [1, 2] and traditional machine learning-based methods [3]. However, the sentiment lexicon and feature engineering both need the expert knowledge and they do not consider the context of text sequences. Current deep learning-based methods achieve impressive performance on sentiment classification [4, 5]. The performance of deep models depends on large-scale annotated training data, whereas labeling large and high-quality opinionated texts is laborious. Fortunately, the Internet offers large opinionated texts with user tags, such as product reviews with ratings and tweets with emojis, where the tags can reflect the overall sentiment orientation of the users. However, the orientation of user tags may be inconsistent with the true sentiment semantics of texts, e.g. a 5-star customer review may contain a negative sentence. These noisy labels have a negative impact on the training process [6]. In essence, the user-tagged data is a kind of weakly-labeled data. The key problem of this work is learning robust representations from weakly-labeled data.

Learning a robust sentiment representation can benefit the sentiment classification task. Metric learning is a representation learning paradigm that aims to decrease the distance between samples from the same category while increasing the distance between samples from different categories [7]. The triplet training strategy and the contrastive training strategy are two common approaches to attaining this goal. The former employs the triplet loss function [8] to constrain the distance of the sampled triples, i.e. anchor sample, positive sample, and negative sample. Guan et al. [9] applied the triplet training strategy for sentiment classification task and obtained impressive results. However, because this technique only samples one negative and one positive sample for each anchor sample, the possible contrast patterns are not fully captured. In comparison, the contrastive training strategy used the NT-Xent loss function [10], which samples multiple negative and positive samples in a mini-batch, to learn rich contrast patterns. In weakly-supervised settings, low-cost user tags replace the data augmentation [10, 11] to generate positive and negative samples.

For weakly-supervised scenes, we also need to improve the anti-noise capability of the algorithm. Ghosh et al. [12] observed that the contrastive training strategy improves model robustness under label noise. The supervised robust methods work remarkably well when they are initialized with the contrastive representation learning model. We offer an explanation for the observation: multiple sampling on dissimilar pairs not only provides rich contrast patterns (i.e. positive sample vs negative sample), but also makes noisy instances non-significant among the sampled pairs. However, the robustness depends on a large mini-batch size and an appropriate temperature hyper-parameter. To achieve the robust representation learning on weakly-labeled data, an ad-hoc anti-noise technique based on the contrastive learning is required. Recent work on anti-noise methods fall into two categories. One is to design robust model structures [13,14,15]. These methods work on the assumption that there is a single transition probability^{Footnote 1} between the noisy label and ground-truth label, and a noise adaptation layer is added to simulate the label transition matrix of the noisy data. However, user behavior is arbitrary in weakly-supervised scenarios, resulting in chaotic tags. Therefore, this assumption may be inconsistent with real-world noisy labels, and the design of additional modules increases model complexity. The other works use the designed robust loss functions to alleviate the impact of noisy data [16, 17], but they also follow the same assumption about the noisy label transition. When there are noisy labels, deep learning models will eventually memorize these incorrectly assigned labels, resulting in poor generalization performance [18]. Li et al. [19] investigated the influence of noisy instances on model training and discovered that noisy data usually has a greater impact on the top layers of the model. Based on the above, we propose a novel framework called Weakly-supervised Anti-noise Contrastive Learning (WACL) for sentiment classification. The knowledge transfer strategy is used in this framework: we first use contrastive learning to pre-train a deep model on a large amount of weakly-labeled data. Then we design a dropping-layer strategy to remove the top layers of the pre-trained model that are susceptible to noise. Finally, we add a classification layer on top of the remaining model and run supervised training with small labels. This framework considers the impact of noisy data on the model training, so it does not rely on any assumptions about the noisy distribution and does not require the design of additional structures to combat noises.

The contributions can be summarized as follows:

1.
We propose a novel framework called Weakly-supervised Anti-noise Contrastive Learning (WACL) for sentiment classification. It uses contrastive pre-training to learn robust sentiment representations for downstream tasks even on data with a high noise ratio.
2.
The proposed framework is adaptable to encoders with various deep structures.
3.
The results of the experiments on the different sentiment classification datasets prove that our WACL framework can significantly improve the performance of deep models and WACL with Bert as encoder outperforms other baselines.

2 Related work

2.1 Deep learning on sentiment classification

Sentiment classification, also known as opinion mining, refers to mining the sentimental tendencies of sentimental texts and classifying their attitudes. Recent deep learning-based methods have achieved remarkable performances. Multi-layer perceptron (MLP) [20] demonstrated its powerful representation ability on sentiment classification tasks. Zhang et al. [21] and Habimana et al. [22] used convolutional neural networks (CNN) to identify the sentiment polarity of texts. Convolutions can capture local features by sliding filters and then aggregate them into high-level global representations. Al-Smadi et al. [23] and Arunava et al. [24] used long short-term memory network (LSTM) for sentiment classification. LSTM can learn the text’s long-term dependency relationship and “understand” the text’s sentiment meaning as a whole. In comparison to CNN, LSTM is more suitable for long text sentiment classification [6]. Ling et al. [25] designed a CNN-LSTM network to solve the problem of word polysemy. Transfer learning based on BERT created the SOTA performance models with minimal effort on 11 downstream NLP tasks [26]. BERT uses a bidirectional transformer encoder [27], which can capture the bidirectional semantics of a text sequence and allows for parallel processing. Masked Language Model (MLM) is the most important model structure of BERT and it mainly combines the ideas of transformer encoder and masked tokens. As many NLP downstream tasks (e.g. Natural Language Inference) are based on understanding the relationship between two text sentences, which is not directly captured by language modeling, BERT proposes the NSP by using the special token [CLS] as the first token of every sequence. Combining the above techniques together, BERT’s performance is elevated to a new height.

Although the deep learning methods do not require the creation of complex feature and rules, their effectiveness is reliant on enormous human-labeled data. Labeling large amounts of high-quality opinionated texts is time-consuming and requires considerable human and financial efforts. Besides, it is challenging to maintain consistency among different annotators. Fortunately, the low-cost user-tagged data provides a wide pool of resources to supplement human-labeled data.

2.2 Weakly-supervised learning

The scale of user-tagged data is huge, but it may contain noisy-labeled instances. The reason is that unconstrained users’ labeling behaviors do not follow a same standard. Hence, the user tag is a form of weak supervision. In recent years, researchers have attempted to exploit information from user-tagged data [28] for training sentiment classifiers. Qu et al. [29] proposed to use review data with ratings as weakly labeled data to train a probability model for sentence sentiment classification. Täckström and McDonald [30] proposed a sentence-level sentiment classifcation method based on hidden conditional random fields (HCRF) which combine review-level and sentence-level sentiment labels. Wang et al. [31] proposed a sentiment classification method based on multi-dimensional (language symbols, emoticons’ symbols, and punctuation symbols) and multi-level (words, sentences, and documents) modeling. Emojis are fused into the input for training the deep model. However, the above methods require feature engineering and do not consider the impact of noisy data. The research [9] is close to our work. Guan et al. proposed a weakly-supervised learning framework on customer review sentiment classification. They adopted the triplet training strategy to learn a good sentiment embedding space. But learning on randomly sampled triples cannot adequately capture the contrast patterns between positive and negative instances. To handle this problem, we adopt contrastive learning to sample multiple instances of different categories for each anchor instance. As a result, the model can learn richer contrast patterns between positive and negative samples than the triplet learning. The temperature hyper-parameter governs the degree of attention required for hard negative instances [32]. The sampled noisy negative instances that have the same sentiment polarity as the anchor are deemed as hard negative instances. We can adjust the temperature hyper-parameter so that the model does not focus too much on the hard ones during the learning process, preventing the noisy data from being classified incorrectly. In the weak-supervised scene, contrastive learning is used not only to obtain a good sentiment representation, but also to reduce the negative influence of noisy instances.

2.3 Learning from noisy data

User-tagged data is simple to collect, but we must mitigate the effect of noisy instances. The contrastive learning has certain anti-noise ability, but good anti-noise performance depends on extremely cautious manual hyper-parameter tuning and a large mini-batch size. In our work, we develop an ad-hoc anti-noise strategy to improve the algorithm’s anti-noise performance. This strategy is inspired by the researches of learning from noisy labels. These studies can be classified into two categories:

Designing a robust model structure. The label transition matrix is the key of these methods. It simulates a transition probability between the true and noisy labels, i.e., the sample with the true label has a certain probability of being marked as the noisy label. These methods commonly design an additional noise adaptation layer at the top of the network to model the transition probability, which is then removed during the evaluation phase. Chen et al. [33] used the confusion matrix of all training samples as the initial weight matrix ${{\mathbf {W}}}$ of the noise adaptation layer, and then modified the model’s output to achieve the goal of label correction. A series of studies proposed to initialize the model with the identity matrix ${{\mathbf {W}}}$ and then add a regularizer to limit the propagation of ${{\mathbf {W}}}$ during model training [34,35,36,37]. These methods commonly make strong assumptions about the distributions of noisy labels, limiting the model’s ability to explore complex label noise in real-world scenarios [16]. Furthermore, these methods were developed specifically for computer vision tasks, and their efficacy has not been demonstrated on NLP tasks.

Designed the robust loss function. The key idea is to design a loss function that is robust to noisy labels. The design of the loss functions necessitates that the noisy label satisfy certain conditions, ensuring that the classifier trained on noisy data or noisy-free data has the same misclassification probability [16, 38]. Ghosh et al. [16] proved that the mean absolute error loss has a better generalization effect than the categorical cross entropy loss. The reason is that only the MAE loss satisfies the above conditions. When dealing with complicated data, however, MAE loss has a significant constraint in terms of generalization performance. To address this issue, a more generic noise resilient loss called Generalized Cross Entropy loss is developed, which combines the benefits of both MAE and CCE [17]. But the distribution of noisy instances user-tagged data is more complicated, it is challenging to design a robust losses.

Deep learning models will eventually memorize these wrongly given labels, which leads to the poor generalization performance [18]. Li et al. studied how architecture affects learning with noisy labels and they observed that the last few layers of the model are more negatively affected by noisy labels [19]. Inspired by this, we devise a simple dropping-layer strategy to mitigate the harmful effects of incorrectly tagged instances.

3 Method

In this section, we describe the details of WACL. The framework is depicted in Fig. 1. It is a transfer style method that consists of three steps: contrastive pre-training, dropping-layer and supervised fine-tuning. During the contrastive pre-training phase, we feed a weakly-labeled instance into the encoder layer to obtain the high-level representation, which is then projected to a vector with fixed dimension. We use formula 6 as the objective function for pre-training. Because the multiple sampling of dissimilar pairs can: (1) utilize the large weakly-labeled data more effectively; (2) thus push samples from different classes as far as possible; (3) meanwhile make the noisy-labeled instances inconsequential. After the pre-training phase, we design a dropping-layer strategy to remove the top layers of the pre-trained model because the last few layers of the model are more negatively affected by noisy labels [19]. Finally, we add a classification layer on top of the remaining model for standard supervised training with labeled data. In the following sections, we will describe the model structure and the transfer style training strategy.

3.1 Model structure with BERT as encoder

In this section, we introduce the model structure of WACL. We choose Bidirectional Encoder Representation from Transformers (BERT)^{Footnote 2} as the text encoder because of its good performance on NLP tasks [26]. Bidirectional transformer encoder is the core structure of BERT. Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution [27]. It can guide the model to learn the semantic patterns from different views. Bert-base has certain advantages in semantic feature extraction, distant information capture, and syntactic feature extraction. The model structure is shown in Fig. 2. Hereafter referred to as WACL-Bert.

Input layer. An input text sequence of length t is a word sequence $< {w_1},{w_2},{w_3},...,{w_t} >$. In order to adapt to the input of the Bert model, we add CLS and SEP symbols at both ends of the word sequence, then the input of the model is ${{{\mathbf {s}}}_{{\mathbf {w}}}} = < CLS,{w_1},{w_2},{w_3},...,{w_t},SEP>$. Note that ${{{\mathbf {s}}}_{{\mathbf {w}}}}$ is sampled from the weakly-labeled dataset.

Encoder layer. We use the pre-trained Bert-base as the encoder. The computational details can be found in [26]. For simplicity, we use the notation $BERT( \cdot )$ to represent its encoding computation. Input sequence is encoded as:

$$\begin{aligned} {{\mathbf {H}}} = BERT({{{\mathbf {s}}}_{{\mathbf {w}}}}), \end{aligned}$$

(1)

where ${{{\mathbf {H}}}} \in {{\mathbb {R}}^{t \times d}}$ represents the feature matrix of the i-th instance, d = 768 represents dimension of the word vector. We choose the vector ${{{\mathbf {h}}}} \in {{\mathbb {R}}^{d}}$ corresponding to CLS as the input of the projection layer. This setting follows standard practice for fine-tuning pre-trained language model for classification [26].

Projection layer. On top of the encoding layer, we employ a projection operation to generate a fixed high-level representation. The projection layer is consisted of two linear layers and two non-linear layers. The mapping operations are computed as follows:

$$\begin{aligned}&{{{\mathbf {m}}}_{{\mathbf {1}}}} = f({{\mathbf {h}}}{{{\mathbf {W}}}_{{\mathbf {1}}}} + {{{\mathbf {b}}}_{{\mathbf {1}}}}){{{\mathbf {W}}}_{{\mathbf {2}}}} + {{{\mathbf {b}}}_{{\mathbf {2}}}}, \end{aligned}$$

(2)

$$\begin{aligned}&{{\mathbf {v}}} = f({{{\mathbf {m}}}_1}{{{\mathbf {W}}}_3} + {{{\mathbf {b}}}_3}){{{\mathbf {W}}}_4} + {{{\mathbf {b}}}_4}, \end{aligned}$$

(3)

where ${{{\mathbf {m}}}_{{\mathbf {1}}}} \in {{\mathbb {R}}^{d}}$ is the output feature vector of the first linear layer, ${{{\mathbf {W}}}_{{\mathbf {1}}}} \in {{\mathbb {R}}^{d \times c}}$ and ${{{\mathbf {W}}}_{{\mathbf {2}}}} \in {{\mathbb {R}}^{c \times d}}$ are parameter matrices. Similarly, ${{{\mathbf {v}}}} \in {{\mathbb {R}}^{n}}$ is the output of the second linear layer, ${{{\mathbf {W}}}_{{\mathbf {3}}}} \in {{\mathbb {R}}^{d \times n}}$ and ${{{\mathbf {W}}}_{{\mathbf {4}}}} \in {{\mathbb {R}}^{n \times n}}$ are parameter matrices. ${{{\mathbf {b}}}_1},{{{\mathbf {b}}}_2},{{{\mathbf {b}}}_3},{{{\mathbf {b}}}_4}$ are bias term. We use Gelu as the activation function of mapping function f. We empirically set c = 1000 and n = 384 for the projection layer.

The above is the computational process of pre-training. Next, we adopt a dropping-layer strategy to remove the top layers of the pre-trained model that are susceptible to noise sample and the rest layers build a model called remaining model. At fine-tuning stage, we add a classification layer on top of the remaining model and fine-tune the parameter of whole model with labeled data. Specifically, we first extract the high-level feature vector through the remaining model:

$$\begin{aligned} {{\mathbf {r}}} = RM({{{\mathbf {s}}}_{{\mathbf {l}}}}), \end{aligned}$$

(4)

where ${{\mathbf {r}}}\in {{\mathbb {R}}^u}$ is the high-level fixed vector, and ${{{\mathbf {s}}}_{{\mathbf {l}}}} = < CLS,{w_1},{w_2},{w_3},...,{w_m},SEP>$ is a text sequence from the labeled dataset. The mapping function of the remaining model is represented by $RM( \cdot )$. The final output layer is computed as:

$$\begin{aligned} {{\mathbf {p}}} = \sigma ({{\mathbf {r}}}{{{\mathbf {W}}}_5} + {{\mathbf {b}}}), \end{aligned}$$

(5)

where ${{{\mathbf {p}}}}$ is the whole model’s output. ${{{\mathbf {W}}}_5} \in {{\mathbb {R}}^{u\times m}}$ and ${{\mathbf {b}}}$ represent the parameter matrix and bias term, respectively, where m is the number of categories. $\sigma (\cdot )$ is softmax function.

3.2 Weakly-supervised anti-noise contrastive learning

Our WACL framework employs a knowledge transfer strategy: we first pre-train the deep model on large amounts of user-tagged data and then a dropping-layer strategy is designed to alleviate the negative effects on top layers of the pre-trained model. Finally, we add a classification layer on the top of the remaining pre-trained model and fine-tune the model’s parameters with a small labeled dataset. The three pieces are described in details in the following sections.

3.2.1 Contrastive pre-training with user-tagged data

The weakly-labeled data contains rich sentiment semantic information, but it also includes noisy instances that cannot be ignored. Pre-training aims to learn a good sentiment representation from a large amount of weakly-labeled data while minimizing the negative effects of noisy instances. The pre-training includes two steps: (1) assigning a weak label to each sample in the user-tagged dataset. In most cases, the user tags are the weak labels for corresponding instances, e.g. an emoji grin is the positive weak label of a tweet. In some additional scenarios, we still need to assign weak labels to fine-grained examples. For example, Zhao et al. [6] set 3-star as the threshold to binarize the 5-level ratings for a sentence-level customer review sentiment classification task. (2) Training a deep model with the constrastive learning strategy. The pre-training procedure aims to make the samples with the same sentiment polarity closer while keep the samples with different sentiment polarity as far as possible. Meanwhile, we also need to prevent the noisy samples from being grouped with incorrect labels. To this end, we use the SupCon loss proposed by Khosl et al. [39] for pre-training:

$$\begin{aligned} L = - \sum_{i = 1}^M {\frac{1}{{{M_{{y_i}}} - 1}}\sum \limits _{j = 1}^M {{l_{i \ne j}}{l_{{y_i} = {y_j}}}\ln \left[ {\frac{{\exp ({{{s_{i,j}}} / t})}}{{\exp ({{{s_{i,j}}} / t}) + \sum \nolimits _{k = 1}^M {{l_{{y_i} \ne {y_k}}}\exp ({{{s_{i,k}}} / t})} }}} \right] } } , \end{aligned}$$

(6)

where M represents a mini-batch size, ${y_i}$ and ${y_j}$ represent the label of the anchor sample i and the sample j respectively. ${M_{{y_i}}}$ represents the number of samples whose label is ${y_i}$ in a mini-batch. ${l_{i \ne j}} \in \{ 0,1\}$, ${l_{{y_i} = {y_j}}}$ and ${l_{{y_i} \ne {y_k}}}$ are similar indicator functions. For instance, ${l_{i \ne j}} \in \{ 0,1\} = 1$ if $i\ne j$; otherwise, ${l_{i \ne j}} \in \{ 0,1\} = 0$. ${s_{i,j}} = {{{{{\mathbf {v}}}_{{\mathbf {i}}}}^T{{{\mathbf {v}}}_{{\mathbf {j}}}}} / {\left\| {{{{\mathbf {v}}}_{{\mathbf {i}}}}} \right\| }}\left\| {{{{\mathbf {v}}}_{{\mathbf {j}}}}} \right\|$ is the cosine similarity between the sample i and the sample j, where ${{{\mathbf {v}}}_{{\mathbf {i}}}}$ and ${{{\mathbf {v}}}_{{\mathbf {j}}}}$ represent the high-level feature vectors of the sample i and the sample j respectively; t is the temperature hyper-parameter.

In the loss function, the samples with the same weak label as the anchor sample are deemed as the positive samples, and vice versa. SupCon loss samples multiple positive and negative samples for the anchor sample $s_{i}$. The loss function then guides the positive samples to be close to the anchor while pushing the negative samples away. In this way, the deep model can capture rich contrast patterns and general sentiment distribution from a large number of weakly-labeled data.

In the weakly supervised scene, we have to consider the negative effect of the weak instances with noisy labels that have inconsistent sentiment orientation to the labeled texts. Contrastive training with SupCon loss has a natural ability to alleviate the noise because: (1) the loss function has the property of self-discovery of hard negative samples. Hard negative samples have different labels from the anchor sample, but have embedding features very close to the anchor embedding. The loss function punishes hard negative samples to improve the quality of the learned representations. (2) Choosing a suitable temperature hyper-parameter can control the distance between anchor sample and hard negative sample [32]. A small value enlarge the distance, whereas a large temperature parameter shortens it. When a noisy instance is sampled as the negative sample of a “clean” anchor, it may violate the training objective and move to the wrong category. In this case, a greater temperature value can restrict the noisy instances’ incorrect movements. However, if the temperature is excessively large, the true hard negative sample and the anchor may not separate sufficiently. Hence, choosing a suitable temperature not only improves anti-noise capability but also sufficiently separates the true hard negative sample. (3) Setting a large mini-batch size decrease the possibility of sampling noisy instances since the proportion of noisy instances in weakly-labeled dataset is small. This set reduces the impact of noisy data, resulting in less erroneous movements during the training phase. However, a large batch is memory consumption. Hence, we propose a simple but effective anti-noise strategy to handle this problem.

3.2.2 Dropping-layer

In the pre-training phase, by introducing the SupCon, the model learned rich sentiment information and contrast patterns from a number of weakly-labeled data. Li et al. [19] found that noisy instances have a great negative impact on the last few layers of the model. Despite the SupCon’s intrinsic anti-noise capacity, it is unable of dealing with this critical issue. As a consequence, we also devise a dropping-layer strategy. The key is to drop the top layers that are susceptible to noise samples. The bottom features of deep neural networks have an excellent generalization ability [40] and noisy instances eventually impact on the top layers [19]. Considering that the properties of the bottom and top features differ, we first divide the pre-trained model into the upper part and the lower part from the middle, and then we devise two dropping strategies to evaluate which layers should be removed: one is to incrementally discard from the last layer (i.e. top-to-bottom), and the other is to incrementally discard from the first layer (i.e. bottom-to-top). Figure 3 only depicts one dropping strategy, i.e., top-to-bottom, because the two dropping processes are inverted.

The dropping-layer technique not only mitigates the negative impact of noisy data on the top layers of the deep model, but it also reduces the model parameter scale. The investigation of the dropping-layer provides empirical guidance: we only need to remove from the last layer incrementally to find the best dropping manner (see Sect. 4.5).

3.2.3 Supervised fine-tuning phase

We used the dropping-layer technique to remove the layers that are sensitive to noisy data while keeping the bottom pre-trained component that has learnt rich sentiment patterns from vast amounts of weakly-labeled data. Then, on top of the remaining model, we add a classification layer and use labeled data to fine-tune the parameters of the entire network in a standard supervised training paradigm. We use cross-entropy as the loss function:

$$\begin{aligned} {L_{ce}} = - \frac{1}{M}\sum \limits _{i = 1}^M {\sum \limits _{j = 1}^C {{y_{ij}}} } \ln {p_{ij}}, \end{aligned}$$

(7)

where M and C represent the size of a mini-batch and the number of categories, respectively. ${y_{ij}} \in \left\{ {0,1} \right\} = 1$ if the category of the i-th sample is j, ${{p_{ij}}}$ represents the probability that the i-th sample is predicted to be class j.

3.3 The behavior of SupCon loss for noisy data

As described in Sect. 3.2.1, adjusting temperature parameters of SupCon loss can mitigate the negative effects of noisy samples in some cases [32]: (1) as the temperature parameter decreases, the loss will impose a greater penalty on the negative samples with high similarity to the anchor sample (hard negative samples) to force them apart; (2) as the temperature parameter increases, the penalty for all negative samples becomes uniform, which reduces their tendency to separate. Next, we discuss the behavior of SupCon loss on an anchor sample i in two cases.

Case 1: the anchor sample i is a noisy sample.

a. Binary sentiment classification.
- (a) positive sample j .
  1. (a)
    if j is a noisy sample, and its true sentiment polarity is the same as the anchor i, the SupCon loss will pull them closer together so that the negative effect of the noisy label will be alleviated.
  2. (b)
    if j is a clean sample, and its true sentiment polarity is different from the anchor i, the SupCon loss will pull them closer. This wrong movement is inevitable.
- (b) negative sample k .
  1. (a)
    if k is a clean sample, and its true sentiment polarity is the same as the anchor i, the original embedding of k is distributed closer to the anchor sample i than the other negative samples in a mini-batch. In this case, k is a hard negative sample. We expect k and i to be closer because they have the same true sentiment polarity. To this end, we can set a larger temperature value to alleviate the wrong movement caused by the noisy label of the anchor sample.
  2. (b)
    if k is a noisy sample, and its true sentiment polarity is different from the anchor i, the SupCon loss will push them away since they have the different weak labels.
b. Multiple sentiment classification. Except for all the above cases, multiple sentiment classification additionally has the following special cases:
- (a) positive sample j .
  1. (a)
    if j is a noisy sample while its true sentiment polarity is the same as that of the anchor sample i. This negative effect is inevitable.
- (b) negative sample k .
  1. (a)
    if k is a clean sample while its true sentiment polarity is different from the anchor i, the SupCon loss can push them away.
  2. (b)
    if k is a noisy sample while its true sentiment polarity is the same as the anchor i, then k is a hard negative sample. We can use the same hyper-parameter setting of case1-a-(b)-a) to alleviate the negative effect of the noisy sample.

Case 2: the anchor sample i is a clean sample.

a. Binary sentiment classification.
- (a) positive sample j .
  1. (a)
    if j is a clean sample, and its true sentiment polarity is same as the anchor i, the SupCon loss will pull them closer.
  2. (b)
    if j is a noisy sample while its true sentiment polarity is different from the anchor i. The negative effect of the noisy sample is inevitable.
- (b) negative sample k .
  1. (a)
    if k is a clean sample, and its true sentiment polarity is is the same as the anchor i, the SupCon loss will push them away.
  2. (b)
    if k is a noisy sample, and its true sentiment polarity is same as the anchor i. Then k is a hard negative sample. We can use the same hyper-parameter setting of case1-a-(b)-a) to alleviate the negative effect of the noisy sample.
b. Multiple sentiment classification. Except for all of the above cases, multiple sentiment classification includes a special case: if k is a noisy sample and its true sentiment polarity is different from that of the anchor sample i, the SupCon loss will push them away.

Note that if a hard negative sample is a noisy sample, the behavior of the SupCon loss is the same as in the above cases where a negative sample k is a noisy sample.

4 Experiments and analysis

4.1 Datasets

We conducted experiments on Amazon product review dataset [9], Twitter dataset [28] and SST5 dataset [41]. The customer reviews are collected from three domains, including digital cameras, cell phones and laptops, and comprises of 1,143,721 sentences with rating information and 11,220 labeled data. The Twitter dataset contains 1,100,000 sentences with emojis^{Footnote 3} and 4714 labeled data. The sentence with:),: ),:-),:D, or =) is assigned a positive label while the sentence with:(,:( or:-( is assigned a negative label.^{Footnote 4} We use the twitter sentiment classification benchmark dataset in SemEval 2013 as the labeled data and neutral sentences are removed.^{Footnote 5} Statistics about the above two datasets are shown in Table 1. For the supervised fine-tuning step, we split the labeled data of Amazon dataset into training set (70%), validation set (10%), test set (20%). Since the scale of Twitter labeled dataset is smaller, we increase the proportion of test set. The labeled set is divided into training set (60%), validation set (10%) and test set (30%). The positive and negative proportion of above sets are 1:1.

The SST5 dataset^{Footnote 6} contains 11,855 labeled sentences and is divided into 5 categories: 0 (very negative), 1 (negative), 2 (neutral), 3 (positive), and 4 (very positive). The detailed distribution of each category is shown in Table 2. In order to simulate real-world scenarios, we divide SST5 data as follows: we randomly select 3,000 sentences from the SST5 labeled data as the fine-tuning corpus, 1,000 sentences as the test set and the remaining sentences as pre-training corpus. The fine-tuning corpus is divided into training set (90%) and validation set (10%). The proportion of each category is 1:1 in the fine-tuning corpus and the test set. For the pre-training corpus, we randomly replace 20% (because the average of the noise ratio of the other two data sets is approximately 20%) of the true labels of the sentences by using other labels. We also conduct experiment for different noise ratios in “The effect of varying noise ratio on classification performance”.

Table 1 Statistics about the Amazon and Twitter dataset

Full size table

Table 2 SST5 labeled data

Full size table

4.2 Experimental settings

We remove the stop words in the texts. We set the input length to 150, 47 and 40 for the Amazon dataset, Twitter dataset and SST5 dataset respectively. More training settings are shown in Table 3.

Table 3 Training settings. “64/128” represents that the mini-batch size is 64 when training a model with Bert encoder, otherwise the mini-batch size is 128. Other parameter settings also follow this way

Full size table

4.3 Baselines and main comparison

We employ Accuracy and Macro-F1 as the evaluation metrics. The baseline methods are as follows:

Lexicon: Lexicon-based method [42]. It use external evidences in other sentences and some linguistic conventions to identify the polarity of the opinion words.

NBSVM: NBSVM [43] combines naive bayes and support vector machine to construct a linear classifier for sentiment classification.

SSWE: SSWE [44] introduces sentiment information into a neural network to learn sentiment-specific word vectors on weakly-labeled data.

gMLP: A novel neural network model based on MLP [20], which achieved a good performance on the sentiment classification task.

WSD: A weakly-supervised method based on multi-channel CNN, which adopts triplet loss during the pre-training phase and adds subject information in the fine-tuning phase [9]. For the sake of fairness, all the baseline methods do not use the aspect information in the following experiments.

CESCL: CESCL fine-tunes of pre-trained language model with a loss function which combines the cross-entropy loss and the supervised contrastive loss [45]. We adopt Bert-based as the pre-trained language model.

Bert: Training Bert-base with labeled data.

Bert-Weak: We use weakly-labeled data as the labeled data to train Bert-base. We design this baseline method to prove whether the weakly data have a negative impact on training phase.

WACL-Bert w/o drop: Our method without dropping-layer strategy.

WACL-Bert: Our method.

Note that all the above methods are performed on the Amazon dataset, Twitter dataset and SST5 dataset. SSWE, WSD, WACL-Bert w/o drop and WACL-Bert have a pre-training phase on weakly-labeled data. The experimental results are shown in Table 4.

Table 4 The experimental results of baselines on Amazon dataset, Twitter dataset and SST5 dataset

Full size table

The results^{Footnote 7} of the three datasets are similar. WACL-Bert has the best classification performance while Lexicon has the worst classification performance. The manually constructed sentiment lexicon cannot cover all opinion terms in a specific dataset, and the designed rules cannot consider the various contextual information of the sentences. NBSVM achieves higher accuracy, but its performance is constrained by its limited feature generalization. SSWE performs slightly better than NBSVM due to the ad hoc learning process of sentiment embeddings on the weakly-labeled data. However, the linear classifier’s representation capacity is inferior than that of deep models. gMLP stacks multiple non-linear MLP layers, allowing the model to learn rich semantic information from labeled data. As a result, it outperforms better than SSWE. WSD outperforms gMLP thanks to the triplet training on weakly-labeled data. Bert performs better than WSD because of the rich prior knowledge of the pre-trained Bert. CESCL adds contrastive loss to guide Bert-base to learn contrast patterns from the labeled data, result in a better performance. When compared to our method, the accuracy of Bert-weak drops dramatically, illustrating the harmfulness of the noisy instances. The performance degradation caused by ablating the dropping layer strategy demonstrates its anti-noise capability, and noisy-labeled samples indeed have a negative impact on the top layers of the pre-trained model.

4.4 WACL adopting different encoders

We investigate the performance of WACL on different deep encoders, including MLP, CNN and LSTM. In the following, we will give a brief overview of the structures illustrated in Fig. 4.

MLP-based model. It use a MixerBlock module [46]. The structure of the MixerBlock is shown in the Fig. 5. MixerBlock contains two types of MLP: Dimension-Mix MLP and Token-Mix MLP. The former operates each token independently, allowing information exchange across the token’s vector dimensions, while the latter fuses information from each token. MixerBlock also employs skip-connections and layer normalization. The mapping of the MixerBlock is calculated as follows:

$$\begin{aligned} {{\mathbf {U}}} = {{\mathbf {S}}} + {(\sigma ({(LN({{\mathbf {S}}}))^T}{{{\mathbf {W}}}_{{\mathbf {1}}}}){{{\mathbf {W}}}_{{\mathbf {2}}}})^T},\end{aligned}$$

(8)

$$\begin{aligned} {{\mathbf {Y}}} = {{\mathbf {U}}} + \sigma (LN({{\mathbf {U}}}){{{\mathbf {W}}}_{{\mathbf {3}}}}){{{\mathbf {W}}}_{{\mathbf {4}}}}, \end{aligned}$$

(9)

where ${{\mathbf {S}}} \in {{\mathbb {R}}^{t \times d}}$ denotes a input sentence, t is the length of the sentence, d represents the dimension of word vectors, we set d to 300. ${{\mathbf {U}}}\in {{\mathbb {R}}^{t \times d}}$ represents the output of the Token-Mix MLP, ${{\mathbf {Y}}} \in {{\mathbb {R}}^{t \times d}}$ represents the output of the Dimension-Mix MLP, ${{{\mathbf {W}}}_{{\mathbf {1}}}} \in {{\mathbb {R}}^{t \times q}}$, ${{{\mathbf {W}}}_{{\mathbf {2}}}} \in {{\mathbb {R}}^{q \times t}}$, ${{{\mathbf {W}}}_{{\mathbf {3}}}} \in {{\mathbb {R}}^{d \times q}}$ and ${{{\mathbf {W}}}_{{\mathbf {4}}}} \in {{\mathbb {R}}^{q \times d}}$ are the parameter matrix of the fully-connected layers, we empirically set q = 512. LN and $\sigma$ represent layer normalization and Gelu active function respectively. We stack 6 MixerBlocks as the encoder layer, 2 MixerBlocks as the projection layer, and use average pooling on the output of the projection layer.

CNN-based model. We stack 5 identical convolutional layers as the encoder layer and use max pooling on the output of the encoder layer. For each convolutional layer, the kernel size is 3, the out channels are 300 and the activation function is Relu. We use two standard MLPs as the projection layer, and the activation function is Gelu.

LSTM-based model. We stack 5 identical LSTM layers as the encoder layer. For each LSTM, the dimension of the hidden layer is 300. The projection layer is same as the CNN-based model.

In the experiment, the word lookup table in the pre-training phase is initialized using the publicly available 300-dimensional word vectors trained on 100 billion words from Google News by word2vec [47]. Different from the Bert-based model, the above three models have 8 backbone layers, 7 backbone layers, and 7 backbone layers, respectively. Therefore, we divide it again into the upper part and the lower part, using the 4th layer as the boundary. The partitions are shown in the Table 5. The experimental results are shown in the Table 6. WACL baselines are fine-tuned on the labeled datasets. Baseline-Weak and baseline-Clean denote that the above models are directly trained on the weakly-labeled data and the labeled data respectively. The performance of Weak baselines are poor, proving the negative effect of the noisy samples. The results also demonstrate that WACL boost the performance of baseline-Cleans. The reason is that the contrastive pre-training and the dropping-layer strategy enable the models to learn the sentiment representations well from noisy labeled data and improve their performance for the downstream tasks.

Table 5 Partitions of lower part and upper part

Full size table

Table 6 Performance comparison on Amazon dataset, Twitter dataset and SST5 dataset

Full size table

4.5 Investigation of the dropping-layer strategy

Dropping-layer strategy improves model roubutness to noisy instances. We conduct the experiment to search a optimal drop manner. We first divide the pre-trained model into lower part and upper part, and then, adopting two discard strategies (bottom-to-top and top-to-bottom), we remove layers incrementally until we discover the best performance. Table 7 show the performances of dropping different layers.

Table 7 The experimental results of the dropping-layer strategy on Amazon dataset, Twitter dataset and SST5 dataset

Full size table

WACL adopts “Contrastive pre-training + Dropping-layer”, which makes the classification performance further improved compared to the WACL-Models w/o drop. From Table 7, we can see that the classification accuracy of WACL-Bert(D12–14) is 90.9% on Amazon dataset, which is 0.6% higher than WACL-Bert w/o drop. Similarly, WACL-Bert(D11–14)’s accuracy is 0.5% higher than WACL-Bert w/o drop on Twitter dataset. On SST5 dataset, WACL-Bert(D12–14)’s accuracy exceeds WACL-Bert w/o drop by 0.7%. We discovered that if we keep the lower part orderly and remove the layers in the upper part from the last layer, the performance first improves and then degrades. This observation is consistent with the conclusions: (1) the top layers are susceptible to noisy instances and (2) bottom layer of per-trained model learned the sentiment semantics. Furthermore, the dropping-layer method reduces the model’s parameter scale. On the Amazon dataset, the Twitter dataset and the SST5 dataset, the number of parameters is reduced by 8.1 M, 15.2 M and 8.1 M respectively.

4.6 WACL with different pre-training loss functions

To demonstrate the benefit of the contrastive loss, we compare SupCon loss function with cross-entropy loss function and triplet loss function [6] on Amazon dataset and SST5 dataset. The experimental results are shown in the Table 8. Essentially, triplet loss and contrastive loss are both capture the relative relationships between samples. However, because the triplet loss only samples one positive and one negative instance for the anchor sample, it cannot learn the rich contrast patterns as well as the contrastive loss. Furthermore, multiple sampling on both positive and negative samples improves the model’s robustness to noisy instances. Hence contrastive loss outperforms the triplet loss. The cross-entropy loss function performs worse since it lacks sufficient discriminant of inter-class samples [48].

Table 8 The experimental results of the different pre-training strategies

Full size table

Furthermore, we use intra-class and inter-class metrics [6] to assess the quality of representations learned with different pre-training strategies. We feed the test samples into the pre-trained model to obtain the high-level feature vectors, and then we compute the inter-class and intra-class average distances of the sample vectors. A better pre-training strategy results in a sample distribution with a larger inter-class distance and a smaller intra-class distance. The intra- and inter-class metrics are calculated as follows:

$$\begin{aligned}&{D_{inter}} = \frac{1}{|\Psi |}\sum _{(s_i,s_j)\in \Psi } dst(s_i,s_j), \end{aligned}$$

(10)

$$\begin{aligned}&{D_{intra}} = \frac{1}{|\Phi |}\sum _{(s_v,s_k)\in \Phi } dst(s_v,s_k), \end{aligned}$$

(11)

where ${D_{inter}}$ and ${D_{intra}}$ are the inter-class and intra-class average distance, respectively. $\Psi$ and $\Phi$ are the sampled pairs of different categories and the same category. $dst(\cdot )$ is the euclidean distance. The results are shown in Fig. 6. Pre-training with cross-entropy loss produces the worst results due to the lack of ability to capture inter-class patterns. The contrastive training strategy produces a larger inter-class average distance and a smaller intra-class average distance than the triplet training strategy, indicating that it can better capture the contrast patterns from training samples.

4.7 The effect of varying noise ratio on classification performance

In this section, we conduct experiments on the Amazon labeled dataset and SST5 dataset to study the performance of the proposed method under different label noise ratios. For binary sentiment classification, We first randomly select 50% sentences from the Amazon labeled data for the pre-training phase, 20% sentences for the fine-tuning, and the other 30% sentences are used as the test set. For the pre-training labeled dataset, we randomly replace the true labels of the sentences by the opposite labels. Then we set the noise ratios from 10 to 50%. Since the task is a binary classification and the pre-training only learns the relative relationship between the two categories, varying the ratio from 50 to 100% will produce a set of mirror results. For multiple sentiment classification, we follow the approach described in “Datasets” and we set the noise ratios from 10 to 100%. To investigate the anti-noisy capabilities of contrastive pre-training and dropping-layer method, we compare WACL-bert with Bert-Clean, Bert-Weak and WACL-Bert w/o drop. The results are depicted in the Tables 9 and 10.

Table 9 The experimental results on the Amazon labeled dataset

Full size table

Table 10 The experimental results on the SST5 dataset

Full size table

The performance of Bert-Weak decreases sharply as the noise ratio increases, indicating the significant harmful impacts of noisy data. From Table 9, when the noise ratio exceeds 30%, the performance of WACL-Bert without dropping-layer method obviously is inferior to Bert-clean, while WACL-Bert achieves a better performance than its w/o drop counterpart. On the SST5 dataset, the performance of WACL-Bert without dropping-layer method is worse than Bert-clean when the noise ratio exceeds 60%. However, our method is not as good as Bert-clean only when the noise ratio is 100%. And our method’s performance is better than WACL-Bert without dropping-layer method in the case of any noise ratio. From Table 10, we can see that the dropping-layer strategy shows its strength in the case of extremely large noise scale: it discards the layers that are easily affected by noise and greatly mitigates the negative effects of noisy data. Hence, the model’s performance still outperforms Bert-Clean under a 90% noise ratio. The above results demonstrate that the anti-noise capacity of the contrastive pre-training is limited in the presence of a high noise ratio, whereas the dropping-layer strategy can mitigate the negative effect of the noisy samples under the same conditions. The explanation for this could be that when the scale of noises is large, the top layers of the model are more easily influenced, and hence dropping-layer is more appropriate under such circumstances. In summary, WACL-Bert is robust to label noise even at a high noise ratio.

4.8 Visualization of the embedding space

To intuitively show the representation learning effect, we use t-SNE [49] to project the test samples’ high-dimensional feature vectors, which are generated from the projection layer of the final model. The visualization results are shown in Fig. 7. From left to right, the top captions of each sub-figure, i.e., random initialization, after pre-training, and after fine-tuning, refer to the three statuses of the model. The inter-class and intra-class average distances are also shown at the top of each sub-figure. In Fig. 7a, the test samples are scattered since the model parameters are only randomly initialized. After the contrastive pre-training, Fig. 7b shows that the inter-class average distance increases while the intra-class average distance decreases, indicating that the model has learned the contrastive patterns from the weakly-labeled data. Figure 7c reveals a further improvement in our model’s ability to distinguish different sentiment polarities.

4.9 Temperature hyper-parameter tuning

As described in Sect. 3.2.1, selecting an appropriate temperature hyper-parameter t can improve the model performance on downstream tasks. We varying t and evaluate the model performance on the validation sets. We set relative small parameter intervals to search for an appropriate value. The tuning results are shown in Fig. 8. The curves of the two datasets have a same pattern: too large or too small temperature will cause performance degradation. We eventually set the temperature hyper-parameter to 0.5.

5 Conclusion

We introduce the WACL framework in this paper, which can train robust representations from huge volumes of user-tagged opinionated texts with label noises. We first adopt the contrastive pre-training to learn the rich sentiment information and contrast patterns from the weakly-labeled data to generate a good sentiment embedding space. Second, we remove the top layers of the pre-trained model that are susceptible to noisy data, and finally we fine-tune the model to further improve the classification performance on the downstream tasks. The experimental results show that WACL can greatly boost the performance of deep models and, meanwhile, provide good anti-noise ability even in a high noise ratio.

Data availibility statement

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Notes

One category has a fixed possibility of being labeled as another.
We use the pre-trained bert-base with 12 layers and 768 hidden units, https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-12_H-768_A-12.zip.
Emojis are not used as an input to train the network.
http://help.sentiment140.com/for-students/.
https://github.com/datastories-semeval2017-task4.
http://nlp.stanford.edu/sentiment.
The scores of the two different metrics are slightly different because of the equal number of samples in each class of all test sets. The macro-F1 is taking the average of the F1 scores calculated from each class, thus regarding each category equally.

References

Turney PD (2002) Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 421–439
Hu M, Liu B (2004) Mining and summarizing customer reviews. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 168–177. https://doi.org/10.1145/1014052.1014073
Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pp 79–86
Li C, Gao F, Bu J, Xu L, Chen X, Gu Y, Shao Z, Zheng Q, Zhang N, Wang Y, Yu Z (2021) SentiPrompt: sentiment knowledge enhanced prompt-tuning for aspect-based sentiment analysis. CoRR arXiv:2109.08306
Timo S, Hinrich S (2021) Exploiting cloze-questions for few-shot text classification and natural language inference. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pp 255–269
Zhao W, Guan Z, Chen L, He X, Deng D, Wang B, Wang Q (2017) Weakly-supervised deep embedding for product review sentiment analysis. IEEE Trans Knowl Data Eng 30(1):185–197. https://doi.org/10.1109/TKDE.2017.2756658
Article Google Scholar
Eric X, Michael J, Stuart JR, Andrew N (2002) Distance metric learning with application to clustering with side-information. Adv Neural Inf Process Syst 15:521–528
Google Scholar
Kristina T, Anna R, Luke Z, Dilek H, Iz B, Steven B, Ryan C, Tanmoy C, Zhou Y (2021) Few-shot text classification with triplet networks, data augmentation,and curriculum learning. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, pp 5493–5500
Guan Z, Chen L, Zhao W, Zheng Y, Tan S, Deng C (2016) Weakly-supervised deep learning for customer review sentiment classification. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp 3719–3725
Ting C, Simon K, Mohammad N, Geoffrey H (2020) A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning, pp 1597–1607
John MG, Osvald N, Gary DB, Bo W (2021) Declutr: Deep contrastive learning for unsupervised textual representations. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp 879–895. https://doi.org/10.18653/v1/2021.acl-long.72
Aritra G, Andrew L (2021) Contrastive learning improves model robustness under label noise. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 2703–2708
Xiao T, Xia T, Yang Y, Chang H, Wang X (2015) Learning from massive noisy labeled data for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2691–2699. https://doi.org/10.1109/CVPR.2015.7298885
Goldberger J, Ben-Reuven E (2017) Training deep neural-networks using a noise adaptation layer. In: Proceedings of the 5th International Conference on Learning Representations
Ishan J, Matthew N, Xuewen C (2016) Learning deep networks from noisy labels with dropout regularization. In: 2016 IEEE 16th International Conference on Data Mining, pp 67–972. https://doi.org/10.1109/ICDM.2016.0121
Ghosh A, Kumar H, Sastry PS (2017) Robust loss functions under label noise for deep neural networks. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp 1919–1925
Zhang Z, Sabuncu MR (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. In: Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, pp 8792–8802
Zhang C, Samy B, Moritz H, Benjamin R, Oriol V (2021) Understanding deep learning (still) requires rethinking generalization. Commun ACM 64(3):107–115. https://doi.org/10.1145/3446776
Article Google Scholar
Li J, Zhang M, Xu K, Dickerson PJ, Ba J (2020) Noisy labels can induce good representations. CoRR arXiv:abs/2012.12896
Liu H, Dai Z, David R, Quoc V (2021) Pay attention to MLPs. CoRR arXiv:abs/2105.08050
Zhang S, Xu X, Pang Y, Han J (2020) Multi-layer attention based CNN for target-dependent sentiment classification. Neural Process Lett 51(3):2089–2103. https://doi.org/10.1007/s11063-019-10017-9
Article Google Scholar
Habimana O, Li Y, Li R, Gu X, Yan W (2020) Attentive convolutional gated recurrent network: a contextual model to sentiment analysis. Int J Mach Learn Cyber 11:2637–2651. https://doi.org/10.1007/s13042-020-01135-1
Article Google Scholar
Al-Smadi M, Talafha B, Al-Ayyoub M, Jararweh Y (2019) Using long short-term memory deep neural networks for aspect-based sentiment analysis of Arabic reviews. Int J Mach Learn Cyber 10:2163–2175. https://doi.org/10.1007/s13042-018-0799-4
Article Google Scholar
Arunava KC, Sourav D, Anup KK (2021) Sentiment analysis of Covid-19 tweets using evolutionary classification-based LSTM model. CoRR arXiv:abs/2106.06910
Ling M, Chen Q, Sun Q, Jia Y (2020) Hybrid neural network for Sina Weibo sentiment analysis. IEEE Trans Comput Soc Syst 7(4):983–990
Article Google Scholar
Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, pp 4171–4186. https://doi.org/10.18653/v1/n19-1423
Ashish V, Noam S, Niki P, Jakob U, Llion J, Aidan NG, Łukasz K, Illia P (2017) Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pp 998–6008
Alec G, Richa B, Huang L (2009) Twitter sentiment classification using distant supervision. CS224N Project Report Stanford. https://doi.org/10.1109/COMSNETS.2017.7945451
Article Google Scholar
Qu L, Gemulla R, Weikum G (2012) A weakly supervised model for sentence-level semantic orientation analysis with multiple experts. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp 149–159
Täckström O, McDonald RT (2011) Semi-supervised latent variable models for sentence-level sentiment analysis. In: The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, pp 569–574
Wang B, Shan D, Fan A, Liu L, Gao J (2022) A sentiment classification method of web social media based on multidimensional and multilevel modeling. IEEE Trans Ind Informatics 18(2):1240–1249
Article Google Scholar
Wang F, Liu H (2021) Understanding the behaviour of contrastive loss. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2495–2504
Chen X, Gupta A (2015) Webly supervised learning of convolutional networks. In: 2015 IEEE International Conference on Computer Vision, pp 1431–1439. https://doi.org/10.1109/ICCV.2015.168
Alec G, Richa B, Huang (2014) Training convolutional networks with noisy labels. CoRR abs/1406.2080
Nitish S, Geoffrey H, Alex K, Ilya S, Ruslan S (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet MATH Google Scholar
Bekker AJ, Goldberger J (2016) Training deep neural-networks based on unreliable labels. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, pp 2682–2686. https://doi.org/10.1109/ICASSP.2016.7472164
Cheng L, Zhou X, Zhao L, Li D, Shang H, Zheng Y, Pan P, Xu Y (2020) Weakly supervised learning with side information for noisy labeled images. In: European Conference on Computer Vision, pp 306–321. https://doi.org/10.1007/978-3-030-58577-8_19
Naresh M, PS S (2013) Noise tolerance under risk minimization. IEEE Trans Cybern 43(3):1146–1151. https://doi.org/10.1109/TSMCB.2012.2223460
Article Google Scholar
Khosla P, Teterwak P, Wang C, Sarna A, Tian Y, Isola P, Maschinot A, Liu C, Krishnan D (2020) Supervised contrastive learning. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020
Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks?. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, pp 3320–3328
Socher R, Perelygin A, Wu J, Chuang J, Manning DC, Andrew Y, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp 1631–1642
Ding X, Liu B, Philip SY (2008) A holistic lexicon-based approach to opinion mining. In: Proceedings of the 2008 international conference on web search and data mining, pp 231–240. https://doi.org/10.1145/1341531.1341561
Wang S, Christopher DM (2012) Baselines and bigrams: simple, good sentiment and topic classification. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pp 90–94
Tang D, Wei F, Nan Y, Ming Z, Bing Q (2014) Learning sentiment-specific word embedding for twitter sentiment classification. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp 1555–1565. https://doi.org/10.3115/v1/p14-1146
Gunel B, Du J, Conneau A, Stoyanov V (2021) Supervised contrastive learning for pre-trained language model fine-tuning. In: 9th International Conference on Learning Representations
Ilya T, Neil H, Alexander K, Lucas B, Zhai X, Thomas U, Jessica Y, Daniel K, Jakob U, Mario L, Dosovitskiy A (2021) Mlp-mixer: an all-mlp architecture for vision. CoRR arXiv:abs/2105.01601
Tomas M, Ilya S, Chen K, Greg S, Jeff D (2013) Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013, pp 3111–3119
Hu Z, Wu H, Liao S, Hu H, Liu S, Li B (2018) Person re-identification with hybrid loss and hard triplets mining. In: Fourth IEEE International Conference on Multimedia Big Data, pp 1–5. https://doi.org/10.1109/BigMM.2018.8499463
Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(2605):2579–2605
MATH Google Scholar

Download references

Acknowledgements

This research was supported by the National Natural Science Foundation of China (Grant Nos. 61902316, 62133012, 61936006, 61876144, 61876145, 62073255, 62103314, 61973249, 62001381), the Key Research and Development Program of Shaanxi (Program Nos. 2020ZDLGY04-07, 2021ZDLGY02-06), Innovation Capability Support Program of Shaanxi (Program No. 2021TD-05), and Natural Science Basic Research Program of Shaanxi (Program Nos. 2022JQ-675, 2021JQ-712).

Author information

Long Chen and Fei Wang contributed equally to this work.

Authors and Affiliations

Shaanxi Key Laboratory of Information Communication Network and Security, Xi’an University of Posts and Telecommunications, Xi’an, China
Long Chen, Fei Wang & Wenjing Wang
School of Information Science and Technology, Northwest University, Xi’an, China
Ruijing Yang
Academy of Advanced Interdisciplinary Research, Xidian University, Xi’an, China
Fei Xie
School of Computer Science and Technology, Xidian University, Xi’an, China
Cai Xu, Wei Zhao & Ziyu Guan

Authors

Long Chen
View author publications
You can also search for this author in PubMed Google Scholar
Fei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ruijing Yang
View author publications
You can also search for this author in PubMed Google Scholar
Fei Xie
View author publications
You can also search for this author in PubMed Google Scholar
Wenjing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Cai Xu
View author publications
You can also search for this author in PubMed Google Scholar
Wei Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Ziyu Guan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fei Xie.

Ethics declarations

Conflict of interest

The authors declare that they have no confict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, L., Wang, F., Yang, R. et al. Representation learning from noisy user-tagged data for sentiment classification. Int. J. Mach. Learn. & Cyber. 13, 3727–3742 (2022). https://doi.org/10.1007/s13042-022-01622-7

Download citation

Received: 22 November 2021
Accepted: 22 July 2022
Published: 05 August 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s13042-022-01622-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Representation learning from noisy user-tagged data for sentiment classification

Abstract

Similar content being viewed by others

Learning Discriminative Neural Sentiment Units for Semi-supervised Target-Level Sentiment Classification

An Indonesian Sentiment Classification Model Based on Multi-task Learning

Deep Transfer Learning for Social Media Cross-Domain Sentiment Classification

Explore related subjects

1 Introduction

2 Related work

2.1 Deep learning on sentiment classification

2.2 Weakly-supervised learning

2.3 Learning from noisy data

3 Method

3.1 Model structure with BERT as encoder

3.2 Weakly-supervised anti-noise contrastive learning

3.2.1 Contrastive pre-training with user-tagged data

3.2.2 Dropping-layer

3.2.3 Supervised fine-tuning phase

3.3 The behavior of SupCon loss for noisy data

4 Experiments and analysis

4.1 Datasets

4.2 Experimental settings

4.3 Baselines and main comparison

4.4 WACL adopting different encoders

4.5 Investigation of the dropping-layer strategy

4.6 WACL with different pre-training loss functions

4.7 The effect of varying noise ratio on classification performance

4.8 Visualization of the embedding space

4.9 Temperature hyper-parameter tuning

5 Conclusion

Data availibility statement

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation