Moka-ADA: adversarial domain adaptation with model-oriented knowledge adaptation for cross-domain sentiment analysis

Zhang, Maoyuan; Li, Xiang; Wu, Fei

doi:10.1007/s11227-023-05191-6

Moka-ADA: adversarial domain adaptation with model-oriented knowledge adaptation for cross-domain sentiment analysis

Published: 29 March 2023

Volume 79, pages 13724–13743, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

The Journal of Supercomputing Aims and scope Submit manuscript

Moka-ADA: adversarial domain adaptation with model-oriented knowledge adaptation for cross-domain sentiment analysis

Download PDF

Maoyuan Zhang^1,2,3,
Xiang Li^1,2,3 &
Fei Wu^1,2,3

323 Accesses
1 Citation
Explore all metrics

Abstract

Cross-domain sentiment analysis (CDSA) aims to overcome domain discrepancy to judge the sentiment polarity of the target domain lacking labeled data. Recent research has focused on using domain adaptation approaches to address such domain migration problems. Among them, adversarial learning performs domain distribution alignment via domain confusion to transfer domain-invariant knowledge. However, this method that transforms feature representations to be domain-invariant tends to align only the marginal distribution, and may inevitably distort the original feature representations containing discriminative knowledge, thus making the conditional distribution inconsistent. To alleviate this problem, we propose adversarial domain adaptation with model-oriented knowledge adaptation (Moka-ADA) for the CDSA task. We adopt the adversarial discriminative domain adaptation (ADDA) framework to learn domain-invariant knowledge for marginal distribution alignment, based on which knowledge adaptation is conducted between the source and target models for conditional distribution alignment. Specifically, we design a dual structure with similarity constraints on intermediate feature representations and final classification probabilities, so that the target model in training learns discriminative knowledge from the trained source model. Experimental results on a publicly available sentiment analysis dataset show that our method achieves new state-of-the-art performance.

Unsupervised Sentiment Analysis by Transferring Multi-source Knowledge

Article 23 July 2021

A Unified Adversarial Learning Framework for Semi-supervised Multi-target Domain Adaptation

Knowledge distillation for BERT unsupervised domain adaptation

Article 20 August 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Sentiment analysis aims to judge the sentiment polarity of the given textual data. Recently, with the development of deep networks and pre-trained language model, the performance of sentiment analysis has been greatly improved. Whereas, most existing works heavily rely on a large amount of labeled training data to train separate sentiment classifiers for each domain, which are both time-consuming and labor-intensive to obtain [1]. Thus, it is very necessary to leverage the labeled data-rich domain (source domain) to help sentiment analysis on the labeled data-poor domain (target domain). Therefore, the cross-domain sentiment analysis (CDSA) task becomes a worthy research direction.

The major challenge of CDSA is domain discrepancy between the source and target domains. Domain adaptation is a widely studied field of research that can be effectively used to tackle this problem [2], which can be grouped into three major categories. First, pseudo-labeling techniques [3, 4], use a model trained on the source labeled data to produce pseudo-labels for unlabeled target data and then train a model for the target domain in a supervised manner. Second, pivot-based methods [5, 6], aim to select domain-invariant features and use them as a basis for cross-domain mapping. Third, adversarial training approaches [7, 8], aim to learn a domain-independent mapping for input samples by adding an adversarial cost during model training, that minimizes the distance between the source and target domain distributions.

Adversarial domain adaptation performs adversarial training to confuse the distribution between two domains by maximizing domain difference while minimizing classification error. The representative work includes adversarial discriminative domain adaptation (ADDA) [8], which incorporates discriminative modeling, untied weight sharing, and GAN-based loss. Specifically, the source encoder and classifier are first trained with labeled source data and the source encoder weights are copied to the target encoder. Then, the target encoder and discriminator are alternately optimized in a two-player adversarial game similar to GANs [9]. In terms of its purpose, the discriminator learns to distinguish the source and target domains, while the target encoder learns to fool the discriminator by acquiring domain-invariant knowledge.

Although adversarial training approaches such as ADDA can largely reduce the domain discrepancy, they are flawed when matching the feature distribution of the source domain to that of the target domain, and their discriminability of features may not be guaranteed. As shown in Fig. 1b, they mainly tend to align only the marginal distribution between the two domains to bridge the domain gap. However, this may not be efficient enough, since there is still a conditional distribution inconsistency as shown in Fig. 1c. The reason is that the original feature representations containing discriminative knowledge are distorted, leading to an enlarged error of the ideal joint hypothesis. Based on the domain adaptation theory [10, 11], the error of the ideal joint hypothesis is an explicit quantification of the adaptability between the two domains. When the adaptability is poor, we can hardly expect to learn a classifier with low target error by minimizing the source error as well as the distance between the two domain distributions.

To resolve the above problem, we propose adversarial domain adaptation with model-oriented knowledge adaptation (Moka-ADA) for the CDSA task, which aims to simultaneously align the marginal and conditional distributions as shown in Fig. 1d. In this work, we adopt ADDA as a base adversarial training framework to learn domain-invariant knowledge for marginal distribution alignment. Meanwhile, to learn discriminative knowledge to align conditional distribution, we first consider measuring and minimizing the distance of intermediate feature representations by maximum mean difference (MMD) [12] to reduce domain discrepancy. Wang et al. demonstrate that minimizing MMD leads to an increase in intra-class distance, while the relationship between intra-class and inter-class distances is one decreasing and the other increasing [13]. Thus, we further perform knowledge distillation (KD) [14] at the final classification probabilities to facilitate knowledge transfer, which helps to increase the inter-class distance and thus decrease the intra-class distance. Therefore, we propose the complete model-oriented knowledge adaptation (Moka) module, including intermediate feature representations similarity constraint (ISC) and final classification probabilities similarity constraint (FSC), which aims to help the target model in training to learn discriminative knowledge from the trained source model, so that the effectiveness of adversarial domain adaptation (ADA) can be improved. In particular, the ablation study indicates that this possibly prevents a mode collapse phenomenon in adversarial training.

The main contributions are summarized as follows:

We propose a new method, Moka-ADA, to learn domain-invariant and discriminative knowledge to ensure that the marginal and conditional distributions are aligned simultaneously.
We design a model-oriented knowledge adaptation module containing dual structure with similarity constraints, which enables the target model in training to learn discriminative knowledge from the trained source model.
We adopt knowledge distillation to facilitate the transfer of discriminative knowledge, which helps to increase inter-class distance and thus reduce intra-class distance, and enhance the stability of adversarial domain adaptation.
We conduct extensive experiments on the Amazon reviews benchmark datasets with an average accuracy of 94.25%, improving the state-of-the-art performance of the CDSA task by 1.11%.

2 Related work

2.1 Cross-domain sentiment analysis

The CDSA task investigates the problem of cross-domain sentiment transfer. There are many approaches that have been proposed, such as word embedding-based techniques [15, 16], pivot and non-pivot-based methods [17, 18], and domain adaptation-based approaches [19, 20]. Recently, as pre-trained language models have evolved, they have brought tremendous performance improvements in numerous natural language processing tasks including CDSA. Du et al. pose domain adversarial training in the context of pre-trained language model BERT [21]. Karouzos et al. have highlighted the merits of using language modeling as an auxiliary task during fine-tuning [22]. Zhou et al. pre-trains a sentiment-aware language model (SentiX) via domain-invariant sentiment knowledge from large-scale review datasets [23]. In this work, we utilize the pre-trained language model to extract feature representations containing semantic information and then apply them to domain adaptation methods.

2.2 Domain adaptation

Domain adaptation aims to acquire transferable information by reducing domain discrepancy, which is widely used in various cross-domain tasks. Traditionally, the main direction has been to minimize some measure of distance between the source and target feature distributions. Deep Domain Confusion (DDC) [24] introduces an adaptation layer to minimize maximum mean discrepancy in addition to classification loss on the source data. Deep Adaptation Network (DAN) [25] applies multiple kernels to multiple layers based on previous work. Recently, adversarial training approaches to minimize domain discrepancy have received much attention. Domain Adversarial Neural Network (DANN) [7] proposes a domain binary classification with a gradient reversal layer to train in the presence of domain confusion. Adversarial Discriminative Domain Adaptation (ADDA) [8] trains two feature extractors for the source and target domains respectively, and produces embeddings fooling the discriminator. However, during adversarial training, there is a distortion of the original feature representations containing discriminative knowledge, which will lead to an enlarged error of the ideal joint hypothesis in domain adaptation theory. Based on existing studies, we adopt ADDA as a base adversarial training framework and attempt to improve it further by designing a model-oriented knowledge adaptation module.

2.3 Knowledge distillation

Knowledge distillation (KD) transfers knowledge from a trained teacher model to a student model in training [14]. Originally, KD is a model compression technique that transfers knowledge from a cumbersome model to a tiny model that is more suitable for deployment [27]. But Furlanello et al. found that given the student and teacher models of the same size, it is possible to make the student model outperform the teacher model [28]. Wang et al. point out that hard label is sensitive to incorrectly predicted samples, which may mislead the modeling process of label-induced loss [29]. Zhang et al. utilize softer final classification probabilities for the teacher model as the learning objective for the student model, while adjusting an appropriate distillation temperature to mitigate the negative transfer phenomenon [30]. In our model-oriented knowledge adaptation module, the student and teacher models have the same network structure, and the aligned KD objectives include intermediate feature representations and final classification probabilities, thereby facilitating knowledge transfer.

3 Methodology

3.1 Problem definition and notations

The CDSA task aims to generalize a robust classifier trained on labeled source data to judge the sentiment polarity of unlabeled target data. Let ${\mathbb {D}}_{S}$ and ${\mathbb {D}}_{T}$ represent the source and target sample distributions, respectively, $y_s^d$ and $y_t^d$ is the corresponding domain label. In the source domain, ${\varvec{X}}_{S}=\left\{ ({\varvec{x}}_{s}^{i},y_{s}^{i})\right\} _{i=1}^{n_{s}}$ are $n_{s}$ labeled source domain samples, where ${\varvec{x}}_{s}$ means a sentence and $y_{s}$ is the corresponding polarity label, $({\varvec{x}}_{s},y_{s})\sim {\mathbb {D}}_{S}$. In the target domain, there is a set of unlabeled samples ${\varvec{X}}_{T}=\left\{ ({\varvec{x}}_{t}^{i})\right\} _{i=1}^{n_{t}}$, where $n_{t}$ is the number of unlabeled target domain samples, ${\varvec{x}}_{t}\sim {\mathbb {D}}_{T}$.

As shown in Fig. 2, the underlying network of our model consists of three components, including two feature extractors $E_{s}$ and $E_{t}$ that extract feature representations ${\varvec{h}}$, a classifier $C_{s}$ that maps the feature representations ${\varvec{h}}$ to the classification logits ${\varvec{p}}$, and a domain discriminator $C_{d}$ that maps the feature representations ${\varvec{h}}$ to the domain probabilities ${\varvec{q}}$.

3.2 Model-oriented knowledge adaptation

To make the target encoder in training learn discriminative knowledge from the trained source encoder, we design a model-oriented knowledge adaptation module , including intermediate feature representations similarity constraint (ISC) and final classification probabilities similarity constraint (FSC).

3.2.1 Intermediate similarity constraints (ISC) based on the reproducing kernel hilbert space

The source and target encoders map the source data to a common feature space to obtain the feature representations, which are then transformed to the reproducing kernel Hilbert space (RKHS) by using kernel functions, for increasing their matching probability in the high-dimensional space. Still, there is no known pairwise correspondence between them, so pairwise testing is not possible. Thus, we can formulate the problem as a two-sample test, and consider measuring the distance by the maximum mean difference (MMD). By minimizing MMD to reduce the distance between intermediate feature representations, the knowledge of the source model is transferred to the target model, resulting in better feature representations and improved generalization ability of the model.

Given the source data ${\varvec{x}}_{s}\sim {\mathbb {D}}_{S}$, we can obtain the feature representations ${\varvec{h}}_{s}=E_{s}({\varvec{x}}_{s})$ and $\varvec{\hat{h}}_{t}=E_{t}({\varvec{x}}_{s})$. Let ${\varvec{H}}_{S}=\{({\varvec{h}}_{s}^{i})\}_{i=1}^{n}\sim {\mathbb {H}}_{S}$, ${\varvec{H}}_{T}=\{(\varvec{\hat{h}}_{t}^{i})\}_{i=1}^{n}\sim {\mathbb {H}}_{T}$, where ${\mathbb {H}}_{S}$ and ${\mathbb {H}}_{T}$ are the respective feature distribution and n is the set cardinality. Thus, the distance between ${\mathbb {H}}_{S}$ and ${\mathbb {H}}_{T}$ can be defined below:

$$\begin{aligned} \begin{aligned}&\textrm{MMD}[{\mathcal {F}},{\varvec{h}}_{s},\varvec{\hat{h}}_{t}]\\&\quad =\mathop {\sup }_{\begin{array}{c} f\in {\mathcal {F}}\\ \Vert f\Vert _{\mathcal {H}}\le 1 \end{array}}\left( {\mathbb {E}}_{{\varvec{h}}_{s}\sim {\mathbb {H}}_{S}}f({\varvec{h}}_{s})-{\mathbb {E}}_{\varvec{\hat{h}}_{t}\sim {\mathbb {H}}_{T}}f(\varvec{\hat{h}}_{t})\right) \\&\quad =\mathop {\sup }_{\begin{array}{c} f\in \mathcal {F}\\ \Vert f\Vert _{\mathcal {H}}\le 1 \end{array}}\left( {\mathbb {E}}_{{\varvec{h}}_{s}\sim {\mathbb {H}}_{S}}\langle \phi ({\varvec{h}}_{s}),f\rangle _{\mathcal {H}}-{\mathbb {E}}_{\varvec{\hat{h}}_{t}\sim {\mathbb {H}}_{T}}\langle \phi (\varvec{\hat{h}}_{t}),f\rangle _{\mathcal {H}}\right) \\&\quad =\mathop {\sup }_{\begin{array}{c} f\in \mathcal {F}\\ \Vert f\Vert _{\mathcal {H}}\le 1 \end{array}}\left\langle {\mathbb {E}}_{{\varvec{h}}_{s}\sim {\mathbb {H}}_{S}}\phi ({\varvec{h}}_{s})-{\mathbb {E}}_{\varvec{\hat{h}}_{t}\sim {\mathbb {H}}_{T}}\phi (\varvec{\hat{h}}_{t}),f\right\rangle _{\mathcal {H}}\\&\quad =\left\| {\mathbb {E}}_{{\varvec{h}}_{s}\sim {\mathbb {H}}_{S}}\phi ({\varvec{h}}_{s})-{\mathbb {E}}_{\varvec{\hat{h}}_{t}\sim {\mathbb {H}}_{T}}\phi (\varvec{\hat{h}}_{t})\right\| _{\mathcal {H}}, \end{aligned} \end{aligned}$$

(1)

where $\mathcal {H}$ is a RKHS, function class $\mathcal {F}=\{f:\Vert f\Vert \le 1\}$, and infinite dimensional feature map $\phi (\cdot ):\mathcal {X}\rightarrow \mathcal {H}$. In addition, the feature map $\phi (\cdot )$ corresponds to a positive semi-definite kernel k so that $k({\varvec{u}},{\varvec{v}})=\langle \phi ({\varvec{u}}),\phi ({\varvec{v}})\rangle _{\mathcal {H}}$, thus Eq. (1) can be rewritten in terms of k. Therefore, the objective function of similarity constraints in the “intermediate" can be written as:

$$\begin{aligned} \begin{aligned}&\min _{E_{t}}\mathcal {L}_{\textrm{ISC}}({\varvec{x}}_{s})\\&\quad =\textrm{MMD}^{2}[\mathcal {F},{\varvec{h}}_{s},\varvec{\hat{h}}_{t}]\\&\quad =\left\| {\mathbb {E}}_{{\varvec{h}}_{s}\sim {\mathbb {H}}_{S}}\phi ({\varvec{h}}_{s})-{\mathbb {E}}_{\varvec{\hat{h}}_{t}\sim {\mathbb {H}}_{T}}\phi (\varvec{\hat{h}}_{t})\right\| _{\mathcal {H}}^{2}\\&\quad ={\mathbb {E}}_{{\varvec{h}}_{s},{\varvec{h}}_{s}^{\prime }\sim {\mathbb {H}}_{S},{\mathbb {H}}_{S}}k({\varvec{h}}_{s},{\varvec{h}}_{s}^{\prime })\\&\quad -2{\mathbb {E}}_{{\varvec{h}}_{s},\varvec{\hat{h}}_{t}\sim {\mathbb {H}}_{S},{\mathbb {H}}_{T}}k({\varvec{h}}_{s},\varvec{\hat{h}}_{t})\\ {}&\qquad +{\mathbb {E}}_{\varvec{\hat{h}}_{t},\varvec{\hat{h}}_{t}^{\prime }\sim {\mathbb {H}}_{T},{\mathbb {H}}_{T}}k(\varvec{\hat{h}}_{t},\varvec{\hat{h}}_{t}^{\prime }), \end{aligned} \end{aligned}$$

(2)

where ${\varvec{h}}_{s}^{\prime }$ is an independent copy of ${\varvec{h}}_{s}$ with the same distribution, and $\varvec{\hat{h}}_{t}^{\prime }$ is an independent copy of $\varvec{\hat{h}}_{t}$. As for the kernel function k, we choose to use a linear combination of multiple Gaussian kernels over a range of standard deviations, such as $k({\varvec{u}},{\varvec{v}})=\sum \nolimits _{i=1}^{m}\exp \left\{ -\frac{1}{2\delta _{i}}\Vert {\varvec{u}}-{\varvec{v}}\Vert _{2}^{2}\right\}$, where m is the number of kernel functions and $\delta _{i}$ denotes the standard deviation of the i-th Gaussian kernel.

3.2.2 Final similarity constraints (FSC) based on the knowledge distillation

The trained classifier will receive the feature representations and map them to the classification logits for judgment. The traditional training directly takes one-hot encoded labels as the target, which is prone to result in overfitting during repeated training epochs. To alleviate this problem, we utilize knowledge distillation (KD) to control the degree of knowledge transfer by producing a softer probability distribution. Unlike the hard label, which focuses only on the label value of maximum probability, the soft label describes the probability distribution by multiple probability values, which can better handle noise and uncertainty. Moreover, it contains information about the correlation between different classes, which can help to increase the inter-class distance and thus reduce the intra-class distance.

Given the acquired feature representations ${\varvec{h}}_{s}$ and $\varvec{\hat{h}}_{t}$, the trained classifier $C_{s}$ will map them to the classification logits ${\varvec{p}}_{s}=C_{s}({\varvec{h}}_{s})$ and $\varvec{\hat{p}}_{t}=C_{s}(\varvec{\hat{h}}_{t})$, respectively. As with KD, we obtain the softer classification probabilities ${\varvec{P}}=\sigma ({\varvec{p}}_{s}/T)$ and ${\varvec{Q}}=\sigma (\varvec{\hat{p}}_{t}/T)$, where $\sigma (\cdot )$ is the softmax function and T is temperature value that controls the degree of knowledge transfer. Therefore, the objective function of similarity constraints in the “final" can be conducted by using the Kullback–Leibler divergence between ${\varvec{P}}$ and ${\varvec{Q}}$:

$$\begin{aligned} \begin{aligned}&\min _{E_{t}}\mathcal {L}_{\textrm{FSC}}({\varvec{x}}_{s})\\&\quad =T^{2}\cdot \textrm{KL}({\varvec{P}}\Vert {\varvec{Q}})\\&\quad =T^{2}\cdot {\mathbb {E}}_{{\varvec{x}}_{s}\sim {\mathbb {D}}_{S}}\sum _{k=1}^{K}P_{k}\log \frac{P_{k}}{Q_{k}}, \end{aligned} \end{aligned}$$

(3)

where ${\varvec{P}}\triangleq [P_{1},\cdots ,P_{K}]\in {\mathbb {R}}^{1\times {K}}$, $\sum _{k=1}^{K}P_{k}=1$ and ${\varvec{Q}}\triangleq [Q_{1},\cdots ,Q_{K}]\in {\mathbb {R}}^{1\times {K}}$, $\sum _{k=1}^{K}Q_{k}=1$, $P_{k}$ and $Q_{k}$ is the probability of the k-th class, and K is the number of classes.

In summary, the inputs to the source and target encoders are the same, and the target encoder imitates the source encoder in terms of “intermediate" and “final", thereby transferring discriminative knowledge for conditional distribution alignment.

3.3 Adversarial domain adaptation with model-oriented knowledge adaptation

In order to compensate for the deficiencies of adversarial domain adaptation in discriminative knowledge via model-oriented knowledge adaptation, we propose the Moka-ADA, which guarantees that both domain-invariant knowledge and discriminative knowledge are fully learned. Figure 2 illustrates the overall framework of our proposed model, which consists of three steps. Step 1: Supervised training the source encoder $E_{s}$ and classifier $C_{s}$ on the source data. Step 2: Adversarial training the target encoder $E_{t}$ and discriminator $C_{d}$ to align the source and target domain distributions. Step 3: Inferring with the trained target encoder $E_{t}$ and classifier $C_{s}$ on the target data.

In Step 1, we aim to train a well-performing source model using labeled data from the source domain, which serves as a “teacher” for subsequent training of the target model. The source error can be minimized through supervised training of the source encoder $E_{s}$ and classifier $C_{s}$ on $({\varvec{x}}_{s},y_{s})$ by using the Cross-Entropy loss:

$$\begin{aligned} \begin{aligned}&\min _{E_{s},C_{s}}\mathcal {L}_{\textrm{cls}}({\varvec{x}}_{s},y_{s})\\&\quad ={\mathbb {E}}_{({\varvec{x}}_{s},y_{s})\sim {\mathbb {D}}_{S}}-\sum _{k=1}^{K}\mathbbm {1}_{\left[ k=y_{s}\right] }\log \sigma ({\varvec{p}}_{s}), \end{aligned} \end{aligned}$$

(4)

where ${\varvec{p}}_{s}=C_{s}({\varvec{h}}_{s})$, ${\varvec{h}}_{s}=E_{s}({\varvec{x}}_{s})$, $\sigma (\cdot )$ is the softmax function, and K is the number of classes.

Then, the source encoder parameters are frozen, which fixes the source domain feature distribution. Thus, we obtained the reference distribution for adversarial training, which is analogous to the real image distribution in the GANs setting [9]. Prior to adversarial training, we first initialize the target encoder weights with the source encoder weights, as this practice can improve the convergence properties.

In Step 2, the discriminator $C_{d}$ aims to infer the domain probabilities ${\varvec{q}}_{s}$ or ${\varvec{q}}_{t}$ of an sample, i.e., coming from the source or target domain. Thus, the discriminator $C_{d}$ is optimized on $({\varvec{x}}_{s},y_s^d=0)$ and $({\varvec{x}}_{t},y_t^d=1)$:

$$\begin{aligned} \begin{aligned}&\min _{C_{d}}\mathcal {L}_{\textrm{s}}^{\textrm{dis}}({\varvec{x}}_{s},y_s^d)\\&\quad ={\mathbb {E}}_{{\varvec{x}}_{s}\sim {\mathbb {D}}_{S}}-[y_s^d\log {\varvec{q}}_{s}+(1-y_s^d)\log (1-{\varvec{q}}_{s})]\\&\quad ={\mathbb {E}}_{{\varvec{x}}_{s}\sim {\mathbb {D}}_{S}}-\log (1-{\varvec{q}}_{s}), \end{aligned} \end{aligned}$$

(5)

where ${\varvec{q}}_{s}=C_{d}({\varvec{h}}_{s})$, ${\varvec{h}}_{s}=E_{s}({\varvec{x}}_{s})$, and

$$\begin{aligned} \begin{aligned}&\min _{C_{d}}\mathcal {L}_{\textrm{t}}^{\textrm{dis}}({\varvec{x}}_{t},y_t^d)\\&\quad ={\mathbb {E}}_{{\varvec{x}}_{t}\sim {\mathbb {D}}_{T}}-[y_t^d\log {\varvec{q}}_{t}+(1-y_t^d)\log (1-{\varvec{q}}_{t})]\\&\quad ={\mathbb {E}}_{{\varvec{x}}_{t}\sim {\mathbb {D}}_{T}}-\log {\varvec{q}}_{t}, \end{aligned} \end{aligned}$$

(6)

where ${\varvec{q}}_{t}=C_{d}({\varvec{h}}_{t})$, ${\varvec{h}}_{t}=E_{t}({\varvec{x}}_{t})$.

According to Eqs. (5) and (6), we can obtain the final objective function of the discriminator $C_{d}$:

$$\begin{aligned} \begin{aligned}&\min _{C_{d}}\mathcal {L}_{\textrm{dis}}({\varvec{x}}_{s},{\varvec{x}}_{t},y_s^d,y_t^d)\\&\quad =\min _{C_{d}}\left[ \dfrac{\mathcal {L}_{\textrm{s}}^{\textrm{dis}}({\varvec{x}}_{s},y_s^d)+\mathcal {L}_{\textrm{t}}^{\textrm{dis}}({\varvec{x}}_{t},y_t^d)}{2}\right] \\&\quad =\dfrac{{\mathbb {E}}_{{\varvec{x}}_{s}\sim {\mathbb {D}}_{S}}-\log (1-{\varvec{q}}_{s})+{\mathbb {E}}_{{\varvec{x}}_{t}\sim {\mathbb {D}}_{T}}-\log {\varvec{q}}_{t}}{2}. \end{aligned} \end{aligned}$$

(7)

To adversarially train the target encoder $E_{t}$, it is encouraged to fool the discriminator $C_{d}$ by reversing the domain label. Thus, the target encoder $E_{t}$ is optimized on $({\varvec{x}}_{t},y_s^d=0)$:

$$\begin{aligned} \begin{aligned}&\min _{E_{t}}\mathcal {L}_{\textrm{gen}}({\varvec{x}}_{t},y_s^d)\\&\quad ={\mathbb {E}}_{{\varvec{x}}_{t}\sim {\mathbb {D}}_{T}}-[y_s^d\log {\varvec{q}}_{t}+(1-y_s^d)\log (1-{\varvec{q}}_{t})]\\&\quad ={\mathbb {E}}_{{\varvec{x}}_{t}\sim {\mathbb {D}}_{T}}-\log (1-{\varvec{q}}_{t}), \end{aligned} \end{aligned}$$

(8)

where ${\varvec{q}}_{t}=C_{d}({\varvec{h}}_{t})$, ${\varvec{h}}_{t}=E_{t}({\varvec{x}}_{t})$.

Based on Eq. (2) and Eq. (3) in Sect. 3.2 and Eq. (8), the final objective function for training the target encoder $E_{t}$ can be defined as:

$$\begin{aligned} \begin{aligned}&\min _{E_{t}}\mathcal {L}_{\textrm{tgt}}({\varvec{x}}_{s},{\varvec{x}}_{t},y_s^d)\\&\quad =\min _{E_{t}}[\mathcal {L}_{\textrm{gen}}({\varvec{x}}_{t},y_s^d)+\mathcal {L}_{\textrm{ISC}}({\varvec{x}}_{s})+\mathcal {L}_{\textrm{FSC}}({\varvec{x}}_{s})]. \end{aligned} \end{aligned}$$

(9)

Through Eq. (7) and Eq. (9), the discriminator $C_{d}$ and target encoder $E_{t}$ are alternately optimized in a two-player adversarial game similar to GANs [9], as in the ADDA framework [8].

In Step 3, we can finally use the trained target encoder $E_{t}$ and classifier $C_{s}$ to make inferences on the target data used for testing, whose sentiment polarity label can be predicted as below:

$$\begin{aligned} \begin{aligned} \hat{y}_{t}=\mathop {\arg \max }{\varvec{p}}_{t}, \end{aligned} \end{aligned}$$

(10)

where ${\varvec{p}}_{t}=C_{s}({\varvec{h}}_{t})$, ${\varvec{h}}_{t}=E_{t}({\varvec{x}}_{t})$.

The overall iterative training procedure of Moka-ADA is summarized in Algorithm 1.

3.4 Theoretical analysis

We provide a theoretical understanding of why our method can enhance adversarial domain adaptation based on the domain adaptation theory from Ben-David et al. [10, 11], a key outcome of which is the following theorem:

Theorem 1. Let $\mathcal {H}$ be the hypothesis space, $\epsilon _{S}$ and $\epsilon _{T}$ be the generalization error on the source domain ${\mathbb {D}}_{S}$ and the target domain ${\mathbb {D}}_{T}$, respectively. Then for any $h\in \mathcal {H}$, there is

$$\begin{aligned} \begin{aligned} \epsilon _{T}(h)\le \epsilon _{S}(h)+d_{\mathcal {H}\Delta \mathcal {H}}\left( {\mathbb {D}}_{S},{\mathbb {D}}_{T}\right) +\lambda , \end{aligned} \end{aligned}$$

(11)

where $d_{\mathcal {H}\Delta \mathcal {H}}$ is the $\mathcal {H}\Delta \mathcal {H}$-divergence [31] to measure the domain discrepancy between ${\mathbb {D}}_{S}$ and ${\mathbb {D}}_{T}$, defined as:

$$\begin{aligned} \begin{aligned} d_{\mathcal {H}\Delta \mathcal {H}}\triangleq \sup _{h,h^{\prime }\in \mathcal {H}}|{\mathbb {E}}_{{{\textbf {x}}}_{s}\sim {\mathbb {D}}_{S}}\left[ h({{\textbf {x}}}_{s})\ne {h^{\prime }({{\textbf {x}}}_{s})}\right] \\ -{\mathbb {E}}_{{{\textbf {x}}}_{t}\sim {\mathbb {D}}_{T}}\left[ h({{\textbf {x}}}_{t})\ne {h^{\prime }({{\textbf {x}}}_{t})}\right] |, \end{aligned} \end{aligned}$$

(12)

where h and $h^{\prime }$ are two sets of hypotheses in $\mathcal {H}$, and $\lambda$ is the error of the ideal joint hypothesis $h^{*}$, where $h^{*}$ is defined as $h^{*}=\mathop {\arg \min }\limits _{h\in \mathcal {H}}\epsilon _{S}(h)+\epsilon _{T}(h)$, such that

$$\begin{aligned} \begin{aligned} \lambda =\epsilon _{S}(h^{*})+\epsilon _{T}(h^{*}). \end{aligned} \end{aligned}$$

(13)

From Eq. (11), the generalization error on the target domain $\epsilon _{T}(h)$ is upper bounded by a combination of the generalization error on the source domain $\epsilon _{S}(h)$, the domain discrepancy $d_{\mathcal {H}\Delta \mathcal {H}}$, and the error of the ideal joint hypothesis $\lambda$. First, it is easy to minimize $\epsilon _{S}(h)$ by supervised training with labeled source data. Then, $d_{\mathcal {H}\Delta \mathcal {H}}$ can be reduced by aligning the marginal distribution via adversarial domain adaptation. Moreover, the dual structure with similarity constraints can yield lower $\lambda$ and further reduce $d_{\mathcal {H}\Delta \mathcal {H}}$ by acquiring discriminative knowledge for conditional distribution alignment.

4 Experiments

Table 1 Statistics of the Amazon reviews benchmark datasets

Full size table

4.1 Datasets

We evaluate our method on the Amazon reviews benchmark datasets collected by Blitzer et al. [32], which is publicly available and widely used for the CDSA task. It includes reviews from four product domains: Books (B), DVDs (D), Electronics (E), and Kitchen appliances (K). Each domain contains 2000 labeled samples, of which 1000 are negative and 1000 are positive. Following the previous works [22, 33], we construct 12 cross-domain tasks of source-target domain pairs. For each domain pair, 1600 labeled source samples and the same number of unlabeled target samples are used for training, and the remaining 400 labeled source samples for validation. Then, we perform a test with all the labeled target samples. Table 1 lists the relevant statistics.

4.2 Implementation details

We adopt SentiX as the context feature extractor, which is a sentiment-aware pre-trained language model proposed by Zhou et al. [23]. For all experiments, we limit the maximum sequence length is 256, while the batch size is set to 32. The optimizer is Adam with learning rate $10^{-5}$, $\beta _{1}$ = 0.9, $\beta _{2}$ = 0.999. During the supervised training, we train for 5 epochs and use the validation dataset to choose an appropriate epoch to save the model. For adversarial training, we train for 1 to 5 epochs to report the average results and empirically set some hyperparameters with a gradient norm of 1.0, a clip value of 0.01, and a knowledge distillation temperature of 20 for more stable adversarial training.

Table 2 Accuracy results on 12 domain pairs from the Amazon reviews benchmark datasets. (The best performance is indicated in bold.)

Full size table

4.3 Compared methods

We consider the following methods for comparison, including PERL [34], DAAT [21], p+CFd [35], UDALM [22], DA-SDS [33], and AdSPT [36]. We present the best results reported in the original paper of these approaches. Besides, we adopt the SentiX model as a baseline and design several variants of our model:

Baseline: The sentiment-aware pre-trained language model SentiX.
ISC-ADA: A variant of the proposed model, which only imposes similarity constraints on intermediate feature representations.
FSC-ADA: A variant of the proposed model, which only imposes similarity constraints on final classification probabilities.
Moka-ADA: The full model introduced in Sect. 3.3.

4.4 Experimental results

In Table 2, we report the accuracy results of the compared methods on 12 cross-domain tasks. Compared with most other works, the baseline achieves better performance, which is mainly attributed to its learning of sentiment knowledge through pre-training with large-scale review datasets. Notably, our Moka-ADA can improve the average accuracy by 1.57% compared to the baseline and has an improvement of 6.75%, 4.13%, 3.62%, 2.51%, 2.77% and 1.11% compared to other methods, respectively.

As shown in Fig. 3, it can be observed that our methods outperform the baseline in almost all domain pairs, which proves that either ISC-ADA or FSC-ADA can effectively conduct similarity constraints to enhance adversarial domain adaptation. Compared to ISC-ADA and FSC-ADA, the full Moka-ADA performed better on 7 of the 12 domain pair tasks, and has mostly relatively smaller standard deviations, indicating greater robustness.

4.5 Visualization of features

To more intuitively assess the effect of model-oriented knowledge adaptation on the feature distribution, we further visualize the feature representations of the source and target data for the B $\rightarrow$ D task. The visualization of the feature representations is performed using the t-SNE algorithm to transform the 768-dimensional feature space into a two-dimensional space. In Fig. 4, the visualization results of Baseline, ISC-ADA, FSC-ADA, and Moka-ADA are presented separately.

In Fig. 4a, we observe that samples of different polarities in the source domain are well separated, while for the target domain, some samples of different polarities are mixed together with unclear decision boundaries. In Fig. 4b, the situation has improved and samples of the same polarity across domains tend to be consistent, indicating that ISC-ADA reduces the distance between feature representations across domains and thereby reduces discrepancy in domain distributions. In Fig. 4c, although samples of the same polarity across domains are less aligned, FSC-ADA increases the inter-class distance and reduces the intra-class distance, making the decision boundaries more clear. In Fig. 4d, the Moka-ADA not only makes samples of the same polarity across domains become compact and aligned, but also has better decision boundaries.

4.6 Ablation studies

Table 3 Experimental results of the Only-ADA

Full size table

Table 4 Experimental results of the Moka-ADA

Full size table

To analyze the effect of our method on adversarial training, we conduct ablation experiments and the results are shown in Tables 3 and 4, where the Only-ADA represents adversarial training without model-oriented knowledge adaptation. By comparison, it is easy to observe that our methods are effective and robust, while the Only-ADA experiences a dramatic decrease with increasing training epochs.

For further study, we perform feature visualization of the Only-ADA for the K $\rightarrow$ B task as shown in Fig. 5. In the first subplot, all samples belong to four clusters, which indicates that adversarial training brings domain awareness to the model. Nonetheless, in the remaining subplots, it appears that samples of different polarities in the target domain gradually mix into the same cluster, which is a mode collapse phenomenon in adversarial training. In contrast, our models have better stability and flexibility of adversarial training, which effectively prevents the mode collapse phenomenon.

5 Conclusion and future work

In this study, we propose a novel method, Moka-ADA, for cross-domain sentiment analysis. It aims to learn domain-invariant and discriminative knowledge to ensure that the marginal and conditional distributions are aligned simultaneously. The model-oriented knowledge adaptation module we designed can effectively facilitate knowledge transfer. Extensive experiments show that our Moka-ADA outperforms the state-of-the-art result on the Amazon reviews benchmark datasets. Theoretical analysis and ablation studies verify the reasonableness and effectiveness of our method.

In future, we would like to adapt our method to more realistic and challenging scenarios, such as multi-source domain [37] and sparsely labeled source domain [38], and further explore applications for other cross-domain tasks in the direction of natural language processing and computer vision.

Availability of data and materials

Data supporting the results of this study are available upon request from the corresponding author xli@mails.ccnu.edu.cn. Because they contain information that may compromise the consent of study participants, these data are not publicly available.

References

Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng AY, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642
Wilson G, Cook DJ (2018) Adversarial transfer learning. arXiv preprint arXiv:1812.02849
Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: 33rd Annual Meeting of the Association for Computational Linguistics, pp. 189–196
Zhou Z-H, Li M (2005) Tri-training: exploiting unlabeled data using three classifiers. IEEE Transactions on knowledge and data engineering 17(11):1529–1541
Article Google Scholar
Blitzer J, McDonald R, Pereira F (2006) Domain adaptation with structural correspondence learning. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 120–128
Pan SJ, Ni X, Sun J-T, Yang Q, Chen Z (2010) Cross-domain sentiment classification via spectral feature alignment. In: Proceedings of the 19th International Conference on World Wide Web, pp. 751–760
Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, Marchand M, Lempitsky V (2016) Domain-adversarial training of neural networks. J Mach Learn Res 17(1):2096
MathSciNet MATH Google Scholar
Tzeng E, Hoffman J, Saenko K, Darrell T (2017) Adversarial discriminative domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. Advances in neural information processing systems 27
Ben-David S, Blitzer J, Crammer K, Pereira F (2006) Analysis of representations for domain adaptation. Advances in neural information processing systems 19
Ben-David S, Blitzer J, Crammer K, Kulesza A, Pereira F, Vaughan JW (2010) A theory of learning from different domains. Mach Learn 79(1):151–175
Article MathSciNet MATH Google Scholar
Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, Smola A (2012) A kernel two-sample test. J Mach Learn Res 13(1):723–773
MathSciNet MATH Google Scholar
Wang W, Li H, Ding Z, Nie F, Chen J, Dong X, Wang Z (2023) Rethinking maximum mean discrepancy for visual domain adaptation. IEEE Trans Neural Netw Learn Syst 34(1):264–277. https://doi.org/10.1109/TNNLS.2021.3093468
Article Google Scholar
Hinton G, Vinyals O, Dean J, et al (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.025312(7)
Bollegala D, Mu T, Goulermas JY (2015) Cross-domain sentiment classification using sentiment sensitive embeddings. IEEE Trans Knowledge and Data Eng 28(2):398–410
Article Google Scholar
Liu J, Zheng S, Xu G, Lin M (2021) Cross-domain sentiment aware word embeddings for review sentiment analysis. Int J Mach Learn Cybernet 12(2):343–354
Article Google Scholar
Ziser Y, Reichart R (2018) Pivot based language modeling for improved neural domain adaptation. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1241–1251
Zheng L, Ying W, Yu Z, Qiang Y (2018) Hierarchical attention transfer network for cross-domain sentiment classification. In: AAAI18
Li Z, Zhang Y, Wei Y, Wu Y, Yang Q (2017) End-to-end adversarial memory network for cross-domain sentiment classification. In: IJCAI, pp. 2237–2243
Qu X, Zou Z, Cheng Y, Yang Y, Zhou P (2019) Adversarial category alignment network for cross-domain sentiment classification. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2496–2508
Du C, Sun H, Wang J, Qi Q, Liao J (2020) Adversarial and domain-aware bert for cross-domain sentiment analysis. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4019–4028
Karouzos C, Paraskevopoulos G, Potamianos A (2021) Udalm: Unsupervised domain adaptation through language modeling. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2579–2590
Zhou J, Tian J, Wang R, Wu Y, Xiao W, He L (2020) Sentix: A sentiment-aware pre-trained model for cross-domain sentiment analysis. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 568–579
Tzeng E, Hoffman J, Zhang N, Saenko K, Darrell T (2014) Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474
Long M, Cao Y, Wang J, Jordan M (2015) Learning transferable features with deep adaptation networks. In: International Conference on Machine Learning, pp. 97–105 .PMLR
Sun S, Cheng Y, Gan Z, Liu J (2019) Patient knowledge distillation for bert model compression. In: EMNLP/IJCNLP (1)
Yim J, Joo D, Bae J, Kim J (2017) A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4133–4141
Furlanello T, Lipton Z, Tschannen M, Itti L, Anandkumar A (2018) Born again neural networks. In: International Conference on Machine Learning, pp. 1607–1616. PMLR
Wang W, Li B, Wang M, Nie F, Wang Z, Li H (2022) Confidence regularized label propagation based domain adaptation. IEEE Trans Circuits and Syst Video Technol 32(6):3319–3333. https://doi.org/10.1109/TCSVT.2021.3104835
Article Google Scholar
Zhang B, Zhang X, Liu Y, Cheng L, Li Z (2021) Matching distributions between model and data: Cross-domain knowledge distillation for unsupervised domain adaptation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5423–5433
Kifer D, Ben-David S, Gehrke J (2004) Detecting change in data streams. In: VLDB, vol. 4, pp. 180–191 . Toronto, Canada
Blitzer J, Dredze M, Pereira F (2007) Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 440–447
Fu Y, Liu Y (2022) Domain adaptation with a shrinkable discrepancy strategy for cross-domain sentiment classification. Neurocomputing
Ben-David E, Rabinovitz C, Reichart R (2020) Perl: Pivot-based domain adaptation for pre-trained deep contextualized embedding models. Trans Assoc Comput Linguis 8:504–521
Article Google Scholar
Ye H, Tan Q, He R, Li J, Ng HT, Bing L (2020) Feature adaptation of pre-trained language models across languages and domains with robust self-training. arXiv preprint arXiv:2009.11538
Wu H, Shi X (2022) Adversarial soft prompt tuning for cross-domain sentiment analysis. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2438–2447
Fu Y, Liu Y (2022) Contrastive transformer based domain adaptation for multi-source cross-domain sentiment classification. Knowledge-Based Syst 245:108649
Article Google Scholar
Wang W, Chen S, Xiang Y, Sun J, Li H, Wang Z, Sun F, Ding Z, Li B (2021) Sparsely-labeled source assisted domain adaptation. Pattern Recognition 112:107803
Article Google Scholar

Download references

Funding

This work is supported by the Fundamental Research Funds of the National Language Committee (Grant No. YB135-40).

Author information

Authors and Affiliations

Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, 430079, Hubei, China
Maoyuan Zhang, Xiang Li & Fei Wu
School of Computer, Central China Normal University, Wuhan, 430079, Hubei, China
Maoyuan Zhang, Xiang Li & Fei Wu
National Language Resources Monitor and Research Center for Network Media, Central China Normal University, Wuhan, 430079, Hubei, China
Maoyuan Zhang, Xiang Li & Fei Wu

Authors

Maoyuan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Fei Wu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Maoyuan Zhang, Xiang Li and Fei Wu. The first draft of the manuscript was written by Xiang Li and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xiang Li.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical Approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A supplemental experimental results

See Tables 5 and 6.

Table 5 Experimental results of the ISC-ADA

Full size table

Table 6 Experimental results of the FSC-ADA

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, M., Li, X. & Wu, F. Moka-ADA: adversarial domain adaptation with model-oriented knowledge adaptation for cross-domain sentiment analysis. J Supercomput 79, 13724–13743 (2023). https://doi.org/10.1007/s11227-023-05191-6

Download citation

Accepted: 12 March 2023
Published: 29 March 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s11227-023-05191-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Moka-ADA: adversarial domain adaptation with model-oriented knowledge adaptation for cross-domain sentiment analysis

Abstract

Similar content being viewed by others

Unsupervised Sentiment Analysis by Transferring Multi-source Knowledge

A Unified Adversarial Learning Framework for Semi-supervised Multi-target Domain Adaptation

Knowledge distillation for BERT unsupervised domain adaptation

Explore related subjects

1 Introduction

2 Related work

2.1 Cross-domain sentiment analysis

2.2 Domain adaptation

2.3 Knowledge distillation

3 Methodology

3.1 Problem definition and notations

3.2 Model-oriented knowledge adaptation

3.2.1 Intermediate similarity constraints (ISC) based on the reproducing kernel hilbert space

3.2.2 Final similarity constraints (FSC) based on the knowledge distillation

3.3 Adversarial domain adaptation with model-oriented knowledge adaptation

3.4 Theoretical analysis

4 Experiments

4.1 Datasets

4.2 Implementation details

4.3 Compared methods

4.4 Experimental results

4.5 Visualization of features

4.6 Ablation studies

5 Conclusion and future work

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Additional information

Publisher's Note

Appendix A supplemental experimental results

Appendix A supplemental experimental results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation