Keywords

1 Introduction

Sentiment classification, which identifies the sentiment polarity of the review or a sentence, has attracted more and more research attention over the past decades. Traditional sentiment classification methods generally have good performance for a specific domain with abundant labeled data [1,2,3]. However, because labeling data is expensive and time-consuming, many domains lack of sufficient labeled data, which make traditional methods don’t work well.

To address the problem, cross-domain sentiment classification has been proposed. It uses the knowledge from source domain with sufficient labeled data to enhance the prediction accuracy of target domain with few or no labeled data. Researchers have proposed many methods to solve the cross-domain sentiment classification problem. Learning domain-shared and domain-specific features have been presented, which utilize the words with high co-occurrence in different domains and domain-independent words [4, 5]. While these methods require to manually extract domain-independent words. Recently, some methods can learn better sample features by deep neural network [6,7,8,9]. Domain-Adversarial training of Neural Networks (DANN) [8] which adds adversarial mechanism into the training of deep neural network. It introduces a domain classifier which can minimize the discrepancy between the source and target domain by gradient reversal. Most of the previous efforts only focus on aligning the global marginal distribution, while ignoring the category-level alignment. As shown in Fig. 1 (left), the positive/negative data aligns the negative/positive data from different domains. This mismatch promotes negative transfer and reduces classification accuracy.

Fig. 1.
figure 1

“+” and “−” denote positive and negative samples respectively. Left: domain adaptation without category-level alignment. Right: domain adaptation with category-level alignment

To overcome the sample mismatch issue, we propose the Category-level Adversarial Network (CAN) for cross-domain sentiment classification. CAN achieves the category-level alignment by introducing the category-wise domain classifiers, as shown in Fig. 1 (right). CAN constructs category-level adversarial network by combining the label information and document representations. This method can decide how much each document should be sent to the category-wise domain classifiers by utilizing the probability distribution over the label space. Besides, the word with sentiment polarity usually has higher contribution for document representation. CAN utilizes the hierarchical attention transfer mechanism, which automatically transfers word-level and sentence-level attentions across domains. In summary, the main contributions of our work are summarized as follows:

  • We introduce the category-level information to achieve fine-grained alignment of different data distributions.

  • We propose a CAN method which achieves category-level alignment. It adds the category-wise domain classifiers which joint the label information and document representations. Besides, the hierarchical attention transfer mechanism can transfer attentions by assigning different weights to words and sentences.

  • The experimental results clearly demonstrate that our method outperforms other state-of-the-art methods.

2 Related Work

Domain Adaptation: Domain adaptation has a large number of works in natural language processing over the past decades. Among them, Blitzer et al. [4] proposed the Structural Correspondence Learning (SCL) which produces correspondences among the features from different domains. Pan et al. [5] proposed the Spectral Feature Alignment (SFA) which solves mismatch of data distribution by aligning domain-specific words. Unfortunately, these methods mentioned above highly rely on manually selecting domain-shared features.

Recently, deep learning methods have obtained better feature representations for cross-domain sentiment classification. Glorot et al. [6] proposed the Stacked Denoising Auto-encoders (SDA) which successfully learns feature representations of a document from different domains [6]. [7] Chen et al. proposed Marginalized Stacked Denoising Autoencoder (mSDA) which reduces computing cost and improves the scalability to high-dimensional features. Kim, Yu et al. [10, 11] used two auxiliary tasks to produce the sentence embedding based on convolutional neural network. DANN leverages the adversarial training method to produce feature representations [8]. Li et al. [12] proposed the Adversarial Memory Network (AMN) which automatically obtains the pivots by using attention mechanism and adversarial training. Li et al. [9] proposed Hierarchical Attention Transfer Network (HATN) which transfers word-level and sentence-level attentions. Zhang et al. [13] proposed Interactive Attention Transfer Network (ITAN) which provides an interactive attention transfer mechanism by combining the information of sentence and aspect. Peng et al. [14] proposed the CoCMD which simultaneously extracts domain specific and invariant representations. Sharma et al. [15] proposed a method which can identify transferable information by searching significant consistent polarity (SCP) words. But these methods only align the global marginal distribution by fooling domain classifiers, which bring the category-level mismatch. To solve this problem, we align the category-level distribution by adding the label information.

Attention Mechanism: The contribution of each word in a document is different. To address this problem, attention mechanism is also used in many other tasks, such as machine translation [16], sentiment analysis [3], document classification [17], question prediction [18]. Besides, the hierarchical structure has superior performance than word-level attention which captures better feature representations since it expresses the hierarchical structure of the document.

3 Category-Level Adversarial Network

In this section, we first introduce the problem definition of cross-domain sentiment classification, followed by an summary of the model. Finally we present the details of CAN model.

3.1 Problem Definition

We assume that there are two domains \(D_s\) and \(D_t\) which denote a source domain and a target domain respectively. We further suppose that we give a set of labeled training data \({{\mathbf {X_s^l}}}=\{x_s^i, y_s^i\}_{i=1}^{N_s^l}\) and unlabeled training data \({{\mathbf {X_s^u}}}=\{x_s^j\}_{j=N_s^l+1}^{N_s}\) from the source domain, where \(N_s^l\) is the number of labeled data and \(N_s\) is the all data of source domain. Besides, we give a set of unlabeled training data \({{\mathbf {X_t}}}=\{x_t^j\}_{j=1}^{N_t}\) in the target domain, where \(N_t\) is the number of unlabeled data. The goal of cross-domain sentiment classification is training a robust model on labeled data and adapts it to predict the sentiment label on unlabeled data.

3.2 An Overview of CAN

We introduce the overview of the CAN as shown in Fig. 2. Firstly, we obtain document representations by Transferable Attention Network (TAN). Then we utilize the Category-level Adversarial (CA) classifiers which combine the label information and document representations on adversarial process. Finally, we use the data in source domain and unlabeled data in target domain to train CA classifiers. Meanwhile, we train sentiment classifier using the labeled data. Finally, TAN and sentiment classifier predict the sentiment label.

Fig. 2.
figure 2

The architecture of the CAN model, where \(\widehat{y}\) is the predicted sentiment label and \(\widehat{d}\) is the predicted domain label; y and d are the ground truth.

3.3 Components of CAN

CAN mainly includes three parts, i.e., the TAN transfers the word-level and sentence-level attentions across domains, sentiment classifier predicts the sentiment label and CA classifiers align the category from source to target domain. TAN includes the word attention transfer and sentence attention transfer. The document representations and labels are obtained by TAN and sentiment classifier respectively. On the basis of single domain classifier, CA classifiers add K category-wise domain classifiers, which combine the document representations and pseudo labels to avoid the mismatch. We describe the components of CAN successively.

Transferable Attention Network: To transfer the important words across domains, we use the hierarchical attention network that assign different weights to words [17]. Assuming that a document has L sentences and each sentence \(s_p\) contains Q words, where \(w_{pq}\) is the q-th word in the p-th sentence, \(q \in [1,Q]\), We map words into dense vectors for representing sentences through an embedding matrix M, \(x_{pq} = Mw_{pq}\). The sentence vector \(s_p\) summarizes all words’ representation by word attention transfer and the document vector v which summarizes the all sentences’ information by sentence attention transfer.

Category-Level Adversarial: In domain adaptation problems, the data distribution is usually very complicated and it is difficult to achieve complete alignment. Incorrect alignment may be prone to under transfer or negative transfer. To enhance the positive transfer and combat negative transfer, we urgently need a technology which can align the data distribution for improving the transfer effect.

We propose the CA classifiers. On the basis of single domain classifier \(G_d\), we add K category-wise domain classifiers \(G_d^k (k = 1,2 ... K)\), where K is the number of category. Since target domain data is unlabeled, we use the output of sentiment classifier \(\widehat{y} = G_c(v)\) as the probability of target domain. Similarly, it is also utilized on source domain. It is similar to attention mechanism which indicates the probability of the data from source or target doamin. In other words, this method take category-level alignment into account during the adversarial process. The document vector v is modeled by the weight \(\widehat{y}\) as the input of domain classifier \(G_d^k\). We use the data \(X_s^l\), \(X_s^u\) and \(X_t\) to train domain classifiers \(G_d\) and \(G_d^k\), which predict domain labels. The goal of domain classifiers is to distinguish the two domains as accurately as possible. However, we want to learn the common features which can not be distinguished by domain classifiers. To address this problem, we introduce the Gradient Reversal Layer (GRL) [8] to reverse the gradient in the training process. The feedward and backpropagation process are as follows:

$$\begin{aligned} G(x) = x, \frac{\partial G(x)}{\partial x} = -\lambda I \end{aligned}$$
(1)

The v and \(v \widehat{y^k}\) are the input of domain classifiers \(G_d\) and \(G_d^k\) respectively. Through the GRL as \(G(v) = \widetilde{v_d}\) and \(G(v \widehat{y^k}) = \widetilde{v_d^k}\). Then we feed it to the corresponding domain classifiers.

$$\begin{aligned} \widetilde{y_d} = softmax(fc(fc(\widetilde{v_d}))) \end{aligned}$$
(2)
$$\begin{aligned} \widetilde{y_d^k} = softmax(fc(fc(\widetilde{v_d^k}))) \end{aligned}$$
(3)

where fc is the fully connected layer.

Sentiment Classifiers: The sentiment classifier utilizes the unlabeled data from source domain. According to the document representation v, we calculate the output of sentiment classifier as follows:

$$\begin{aligned} \widetilde{y_s} = softmax(fc(fc(fc(v)))) \end{aligned}$$
(4)

Training Strategy: The cross-domain sentiment classifier needs to use the domain-shared features to predict sentiment labels. In order to achieve this goal, CAN has two tasks i.e., domain classification and sentiment classification. We introduce cross-entropy loss functions for training CA classifiers and sentiment classifier respectively:

$$\begin{aligned} L_{dom_{con}} = -\frac{1}{N_s + N_t} \sum \limits _{i=1}^{N_s + N_t} \widetilde{y_d}lny_d + (1-\widetilde{y_d})ln(1-y_d) \end{aligned}$$
(5)
$$\begin{aligned} L_{dom_{class}} = -\frac{1}{N_s + N_t} \sum \limits _{k=1}^{K} \sum \limits _{i=1}^{N_s + N_t} \widetilde{y_d^k}lny_d + (1-\widetilde{y_d^k})ln(1-y_d) \end{aligned}$$
(6)
$$\begin{aligned} L_{CA} = L_{dom_{con}} + L_{dom_{class}} \end{aligned}$$
(7)
$$\begin{aligned} L_{sen} = -\frac{1}{N_s^l} \sum \limits _{i=1}^{N_s^l} \widetilde{y_s}lny_s + (1-\widetilde{y_s})ln(1-y_s) \end{aligned}$$
(8)

where \(y_d\) and \(y_s\) are the ground truth. Besides, we add the squared \(l_2\) regularization for sentiment classifier and CA classifiers. Finally, the objective function is as follows:

$$\begin{aligned} L = L_{CA} + L_{sen} + \rho L_{reg} \end{aligned}$$
(9)

where \(L_{reg}\) is the regularization and prevents the overfitting, \(\rho \) is the regularization parameter. CAN model aims to minimize L expect the GRL part which is maximized. Besides, we optimize the parameters by SGD.

4 Experiments

4.1 Dataset Preparation

In this section, we utilize two datasets to evaluate the ability of CAN method, one is the Amazon reviews dataset, the other is Airline reviews dataset. Table 1 summarizes the all dataset. We select the data from four domains: Books (B), DVD (D), Electronics (E) and Kitchen (K). Each domain contains 6000 labeled reviews with 3000 positive samples (higher than 3 stars) and 3000 negative samples (lower than 3 stars). Additionally, the dataset also contains lots of unlabeled data. Here we randomly extract 8,000 unlabeled reviews as training data. These unlabeled data are only used to train the domain classifier which distinguishes the data comes from different domains. We choose 1000 reviews from the target domain as the testing data. We conduct the cross-domain experiments between every two domains and get 12 cross-domain sentiment classification tasks: B \(\rightarrow \) D, B \(\rightarrow \) E, B \(\rightarrow \) K, D \(\rightarrow \) B, D \(\rightarrow \) E, D \(\rightarrow \) K, E \(\rightarrow \) B, E \(\rightarrow \) D, E \(\rightarrow \) K, K \(\rightarrow \) B, K \(\rightarrow \) D, K \(\rightarrow \) E. For example, the B \(\rightarrow \) D is the task which transfers from the source domain B to the target domain D.

Airline reviews dataset is scraped from Skytraxs Web portal which has one of the most popular review sites within the air travel industry. It is the labeled data that includes 41396 reviews for Airline(AL), 17721 Reviews for Airport(AP), 1258 reviews for Seat(S), 2264 reviews for Lounge(L). We select the 3000 positive reviews (recommended value is 1) and 3000 negative reviews (recommended value is 0) from the Airline and Airport to be consistent with the Amazon reviews dataset. Besides, we randomly extract 8000 labeled reviews to train the domain classifier. We construct 8 cross-domain sentiment classification tasks: B \(\rightarrow \) AL, D \(\rightarrow \) AL, K \(\rightarrow \) AL, E \(\rightarrow \) AL, B \(\rightarrow \) AP, D \(\rightarrow \) AP, K \(\rightarrow \) AP, E \(\rightarrow \) AP.

Table 1. Statistics of Amazon and Airline datasets

4.2 Implementation Details

We adopt the 300-dimensional word2vec vectors with the skip-gram model to initialize the embedding matrix M [19]. The maximum sentence length L and maximum word length Q are 20 and 28 respectively. All weight matrices are randomly initialized by a uniform distribution U[−0.01,0.01]. The dimensional of GRU hidden states is set to 70. The regularization weight \(\rho \) and dropout rate are set to 0.005 and 0.6 respectively. We utilize the stochastic gradient descent with momentum rate 0.9 to optimize model during the training process. Because the different training sizes for different classifiers we set batch size \(b_s=50\) for the sentiment classifier and batch size \(b_d=260\) for the domain classifier. The adaptation rate is \(\lambda = \frac{2}{1+exp(-10p)-1}\), where \(p=\frac{n}{N}\). The n and N are current epoch and the maximum epoch respectively. The N is set to 100. The learning rate is \(\eta = max(0.003 * 0.1^{\left\lfloor \frac{n}{10} \right\rfloor },0.0005)\).

4.3 Benchmark Methods

Naive (Hochreiter et al. 1997): it only uses source domain data based on LSTM.

SCL (Blitzer et al. 2006): it aims to identify the correlation between pivot and non-pivot by using multiple pivot prediction tasks.

SFA (Pan et al. 2010): it aims to solve mismatch of data distribution by aligning domain-specific words.

DANN (Ganin et al. 2015): it uses the domain adaptation with the 5000-dimension feature representations by adversarial training.

AMN (Li et al. 2017): it uses memory networks and adversarial training to get domain-shared representations.

HATN & HATN\(^n\) (Li et al. 2018): it extracts pivots and non-pivots by the hierarchical attention network across domains. HATN does not contain the hierarchical positional encoding and HATN\(^h\) does.

ITAN & ITAN\(^n\) (Zhang et al. 2019): it uses interactive attention which combines aspects and sentences information. ITAN\(^n\) does not contain the aspects information and ITAN does.

CAN\(^s\) & CAN\(^c\) & CAN: it is our methods that include the single domain classifier, category-wise classifiers and the CA classifiers respectively.

Table 2. Classification results on the Amazon dataset.

The experimental results are shown in Table 2. Comparing with others methods, CAN model has achieved the best performances on most tasks. The Naive model performance is badly at every task because it only uses the source domain data. SFA model achieves 77.6% on average since the features of the review are manually extracted and the linear classifier is not sufficient to express the model. HATN\(^h\) model achieves 85.1% on average because it automatically learns the domain-shared and domain-specific features. IATN model achieves 85.9% on average because it combines the sentences and aspects information. However, HATN\(^h\) and IATN model does not take the category-level alignment into account. Compared with CAN\(^s\) model, CAN\(^c\) achieves 86.1% on average, which is 0.3% higher than CAN\(^s\) because the categories from different domains are aligned. Comparing with CAN\(^c\) model, CAN model has better performance. The reason is that the pseudo-labels may lead to the incorrect alignment of data distributions. For the hard transfer task E \(\rightarrow \) B, E \(\rightarrow \) D, the performance of CAN\(^c\) is very close to the CAN\(^s\). But for the simple transfer task E \(\rightarrow \) K, K \(\rightarrow \) E, the promotion is very obvious since the classification accuracy of target domain is higher. Finally, CAN model achieves 86.6% on average since we combine the single domain and category-wise classifiers.

The cross-domain sentiment classification tasks in Table 2 come from different domains with the same origin, such as E \(\rightarrow \) K and B \(\rightarrow \) D task. They are highly relevant. To show the performance of CAN on hard transfer tasks, we construct 8 new tasks which come from different origins. The experimental results are reported in Table 3. The Naive method without target domain samples performs the worst and the classification accuracy is \(8.1\%\) lower than CAN. The CAN and CAN\(^c\) improves the classification accuracy by 0.9% and 0.3% than CAN\(^s\), respectively. The results show that our method is effective on difficult transfer tasks.

Table 3. Classification results on the Airline dataset.

4.4 Visualization of Features

To better illustrate the effectiveness of CAN, we visualize the feature of penultimate layer as shown in Fig. 3. We choose the task E \(\rightarrow \) K and B \(\rightarrow \) D. The visualization results show that CAN model has better distinguishable features. The features of CAN\(^s\) model are not well discriminated clearly for lacking the category-wise alignment.

CAN model considers complex structures of the data distributions. The different domains are more indistinguishable and different categories are more discriminated under the CAN model. The experimental result is superior for cross-domain sentiment classification because we add the category-wise domain classifiers on training process.

Fig. 3.
figure 3

The t-SNE visualization of features extracted by CAN and CAN\(^s\) model for E \(\rightarrow \) K and B \(\rightarrow \) D task. The red, blue, yellow and green points denote the source positive, source negative, target positive and target negative examples correspondingly (Color figure online)

Fig. 4.
figure 4

Visualization of the CAN on D \(\rightarrow \) E and D \(\rightarrow \) K tasks

4.5 Visualization of Attention

We also visualize the word attention transfer in Fig. 4. We choose the D \(\rightarrow \) E and D \(\rightarrow \) K and highlight the words according to the attentions. Deeper red words mean that it has the heavier weight than others. Figure 4 shows that the words with sentiment polarity have higher attentions. Generally, the word “lightweight” intuitively indicates a negative sentiment in DVD domain. In contrast, the word “lightweight” usually indicates a positive sentiment in the electrics or kitchen domain. Since there are some words (good, delight) which have opposite sentiment polarity in the document, CAN model still correctly predicts labels on target domain. The document representation takes these words into account by hierarchical attention mechanism.

5 Conclusion

In this paper, we propose the CAN model. CA classifiers further align the category-level data distribution by combining the label information and document representations on adversarial process. Besides, we transfer word-level and sentence-levels attentions under TAN. The experimental results show that the CAN model effectively improves classification accuracy on Amazon reviews dataset and Airline reviews dataset.