1 Introduction

Sentiment analysis aims to judge the sentiment polarity of the given textual data. Recently, with the development of deep networks and pre-trained language model, the performance of sentiment analysis has been greatly improved. Whereas, most existing works heavily rely on a large amount of labeled training data to train separate sentiment classifiers for each domain, which are both time-consuming and labor-intensive to obtain [1]. Thus, it is very necessary to leverage the labeled data-rich domain (source domain) to help sentiment analysis on the labeled data-poor domain (target domain). Therefore, the cross-domain sentiment analysis (CDSA) task becomes a worthy research direction.

The major challenge of CDSA is domain discrepancy between the source and target domains. Domain adaptation is a widely studied field of research that can be effectively used to tackle this problem [2], which can be grouped into three major categories. First, pseudo-labeling techniques [3, 4], use a model trained on the source labeled data to produce pseudo-labels for unlabeled target data and then train a model for the target domain in a supervised manner. Second, pivot-based methods [5, 6], aim to select domain-invariant features and use them as a basis for cross-domain mapping. Third, adversarial training approaches [7, 8], aim to learn a domain-independent mapping for input samples by adding an adversarial cost during model training, that minimizes the distance between the source and target domain distributions.

Adversarial domain adaptation performs adversarial training to confuse the distribution between two domains by maximizing domain difference while minimizing classification error. The representative work includes adversarial discriminative domain adaptation (ADDA) [8], which incorporates discriminative modeling, untied weight sharing, and GAN-based loss. Specifically, the source encoder and classifier are first trained with labeled source data and the source encoder weights are copied to the target encoder. Then, the target encoder and discriminator are alternately optimized in a two-player adversarial game similar to GANs [9]. In terms of its purpose, the discriminator learns to distinguish the source and target domains, while the target encoder learns to fool the discriminator by acquiring domain-invariant knowledge.

Fig. 1
figure 1

An illustration of domain adaptation. a A classifier trained on the source domain does not apply well to the target domain before domain adaptation. b Aligning the marginal distribution via adversarial learning. c The inconsistency of the conditional distribution in (b) may lead to still high classification error of the target domain. d The marginal and conditional distributions are aligned simultaneously by our method

Although adversarial training approaches such as ADDA can largely reduce the domain discrepancy, they are flawed when matching the feature distribution of the source domain to that of the target domain, and their discriminability of features may not be guaranteed. As shown in Fig. 1b, they mainly tend to align only the marginal distribution between the two domains to bridge the domain gap. However, this may not be efficient enough, since there is still a conditional distribution inconsistency as shown in Fig. 1c. The reason is that the original feature representations containing discriminative knowledge are distorted, leading to an enlarged error of the ideal joint hypothesis. Based on the domain adaptation theory [10, 11], the error of the ideal joint hypothesis is an explicit quantification of the adaptability between the two domains. When the adaptability is poor, we can hardly expect to learn a classifier with low target error by minimizing the source error as well as the distance between the two domain distributions.

To resolve the above problem, we propose adversarial domain adaptation with model-oriented knowledge adaptation (Moka-ADA) for the CDSA task, which aims to simultaneously align the marginal and conditional distributions as shown in Fig. 1d. In this work, we adopt ADDA as a base adversarial training framework to learn domain-invariant knowledge for marginal distribution alignment. Meanwhile, to learn discriminative knowledge to align conditional distribution, we first consider measuring and minimizing the distance of intermediate feature representations by maximum mean difference (MMD) [12] to reduce domain discrepancy. Wang et al. demonstrate that minimizing MMD leads to an increase in intra-class distance, while the relationship between intra-class and inter-class distances is one decreasing and the other increasing [13]. Thus, we further perform knowledge distillation (KD) [14] at the final classification probabilities to facilitate knowledge transfer, which helps to increase the inter-class distance and thus decrease the intra-class distance. Therefore, we propose the complete model-oriented knowledge adaptation (Moka) module, including intermediate feature representations similarity constraint (ISC) and final classification probabilities similarity constraint (FSC), which aims to help the target model in training to learn discriminative knowledge from the trained source model, so that the effectiveness of adversarial domain adaptation (ADA) can be improved. In particular, the ablation study indicates that this possibly prevents a mode collapse phenomenon in adversarial training.

The main contributions are summarized as follows:

  • We propose a new method, Moka-ADA, to learn domain-invariant and discriminative knowledge to ensure that the marginal and conditional distributions are aligned simultaneously.

  • We design a model-oriented knowledge adaptation module containing dual structure with similarity constraints, which enables the target model in training to learn discriminative knowledge from the trained source model.

  • We adopt knowledge distillation to facilitate the transfer of discriminative knowledge, which helps to increase inter-class distance and thus reduce intra-class distance, and enhance the stability of adversarial domain adaptation.

  • We conduct extensive experiments on the Amazon reviews benchmark datasets with an average accuracy of 94.25%, improving the state-of-the-art performance of the CDSA task by 1.11%.

2 Related work

2.1 Cross-domain sentiment analysis

The CDSA task investigates the problem of cross-domain sentiment transfer. There are many approaches that have been proposed, such as word embedding-based techniques [15, 16], pivot and non-pivot-based methods [17, 18], and domain adaptation-based approaches [19, 20]. Recently, as pre-trained language models have evolved, they have brought tremendous performance improvements in numerous natural language processing tasks including CDSA. Du et al. pose domain adversarial training in the context of pre-trained language model BERT [21]. Karouzos et al. have highlighted the merits of using language modeling as an auxiliary task during fine-tuning [22]. Zhou et al. pre-trains a sentiment-aware language model (SentiX) via domain-invariant sentiment knowledge from large-scale review datasets [23]. In this work, we utilize the pre-trained language model to extract feature representations containing semantic information and then apply them to domain adaptation methods.

2.2 Domain adaptation

Domain adaptation aims to acquire transferable information by reducing domain discrepancy, which is widely used in various cross-domain tasks. Traditionally, the main direction has been to minimize some measure of distance between the source and target feature distributions. Deep Domain Confusion (DDC) [24] introduces an adaptation layer to minimize maximum mean discrepancy in addition to classification loss on the source data. Deep Adaptation Network (DAN) [25] applies multiple kernels to multiple layers based on previous work. Recently, adversarial training approaches to minimize domain discrepancy have received much attention. Domain Adversarial Neural Network (DANN) [7] proposes a domain binary classification with a gradient reversal layer to train in the presence of domain confusion. Adversarial Discriminative Domain Adaptation (ADDA) [8] trains two feature extractors for the source and target domains respectively, and produces embeddings fooling the discriminator. However, during adversarial training, there is a distortion of the original feature representations containing discriminative knowledge, which will lead to an enlarged error of the ideal joint hypothesis in domain adaptation theory. Based on existing studies, we adopt ADDA as a base adversarial training framework and attempt to improve it further by designing a model-oriented knowledge adaptation module.

2.3 Knowledge distillation

Knowledge distillation (KD) transfers knowledge from a trained teacher model to a student model in training [14]. Originally, KD is a model compression technique that transfers knowledge from a cumbersome model to a tiny model that is more suitable for deployment [27]. But Furlanello et al. found that given the student and teacher models of the same size, it is possible to make the student model outperform the teacher model [28]. Wang et al. point out that hard label is sensitive to incorrectly predicted samples, which may mislead the modeling process of label-induced loss [29]. Zhang et al. utilize softer final classification probabilities for the teacher model as the learning objective for the student model, while adjusting an appropriate distillation temperature to mitigate the negative transfer phenomenon [30]. In our model-oriented knowledge adaptation module, the student and teacher models have the same network structure, and the aligned KD objectives include intermediate feature representations and final classification probabilities, thereby facilitating knowledge transfer.

3 Methodology

3.1 Problem definition and notations

The CDSA task aims to generalize a robust classifier trained on labeled source data to judge the sentiment polarity of unlabeled target data. Let \({\mathbb {D}}_{S}\) and \({\mathbb {D}}_{T}\) represent the source and target sample distributions, respectively, \(y_s^d\) and \(y_t^d\) is the corresponding domain label. In the source domain, \({\varvec{X}}_{S}=\left\{ ({\varvec{x}}_{s}^{i},y_{s}^{i})\right\} _{i=1}^{n_{s}}\) are \(n_{s}\) labeled source domain samples, where \({\varvec{x}}_{s}\) means a sentence and \(y_{s}\) is the corresponding polarity label, \(({\varvec{x}}_{s},y_{s})\sim {\mathbb {D}}_{S}\). In the target domain, there is a set of unlabeled samples \({\varvec{X}}_{T}=\left\{ ({\varvec{x}}_{t}^{i})\right\} _{i=1}^{n_{t}}\), where \(n_{t}\) is the number of unlabeled target domain samples, \({\varvec{x}}_{t}\sim {\mathbb {D}}_{T}\).

As shown in Fig. 2, the underlying network of our model consists of three components, including two feature extractors \(E_{s}\) and \(E_{t}\) that extract feature representations \({\varvec{h}}\), a classifier \(C_{s}\) that maps the feature representations \({\varvec{h}}\) to the classification logits \({\varvec{p}}\), and a domain discriminator \(C_{d}\) that maps the feature representations \({\varvec{h}}\) to the domain probabilities \({\varvec{q}}\).

Fig. 2
figure 2

The overall framework of our proposed method, where \(E_{s}\) and \(E_{t}\) are the feature extractors, \(C_{s}\) is the classifier, and \(C_{d}\) is the domain discriminator; \({\varvec{h}}\) denotes the feature representations, \({\varvec{p}}\) denotes the classification logits, and \({\varvec{q}}\) denotes the domain probabilities

3.2 Model-oriented knowledge adaptation

To make the target encoder in training learn discriminative knowledge from the trained source encoder, we design a model-oriented knowledge adaptation module , including intermediate feature representations similarity constraint (ISC) and final classification probabilities similarity constraint (FSC).

3.2.1 Intermediate similarity constraints (ISC) based on the reproducing kernel hilbert space

The source and target encoders map the source data to a common feature space to obtain the feature representations, which are then transformed to the reproducing kernel Hilbert space (RKHS) by using kernel functions, for increasing their matching probability in the high-dimensional space. Still, there is no known pairwise correspondence between them, so pairwise testing is not possible. Thus, we can formulate the problem as a two-sample test, and consider measuring the distance by the maximum mean difference (MMD). By minimizing MMD to reduce the distance between intermediate feature representations, the knowledge of the source model is transferred to the target model, resulting in better feature representations and improved generalization ability of the model.

Given the source data \({\varvec{x}}_{s}\sim {\mathbb {D}}_{S}\), we can obtain the feature representations \({\varvec{h}}_{s}=E_{s}({\varvec{x}}_{s})\) and \(\varvec{\hat{h}}_{t}=E_{t}({\varvec{x}}_{s})\). Let \({\varvec{H}}_{S}=\{({\varvec{h}}_{s}^{i})\}_{i=1}^{n}\sim {\mathbb {H}}_{S}\), \({\varvec{H}}_{T}=\{(\varvec{\hat{h}}_{t}^{i})\}_{i=1}^{n}\sim {\mathbb {H}}_{T}\), where \({\mathbb {H}}_{S}\) and \({\mathbb {H}}_{T}\) are the respective feature distribution and n is the set cardinality. Thus, the distance between \({\mathbb {H}}_{S}\) and \({\mathbb {H}}_{T}\) can be defined below:

$$\begin{aligned} \begin{aligned}&\textrm{MMD}[{\mathcal {F}},{\varvec{h}}_{s},\varvec{\hat{h}}_{t}]\\&\quad =\mathop {\sup }_{\begin{array}{c} f\in {\mathcal {F}}\\ \Vert f\Vert _{\mathcal {H}}\le 1 \end{array}}\left( {\mathbb {E}}_{{\varvec{h}}_{s}\sim {\mathbb {H}}_{S}}f({\varvec{h}}_{s})-{\mathbb {E}}_{\varvec{\hat{h}}_{t}\sim {\mathbb {H}}_{T}}f(\varvec{\hat{h}}_{t})\right) \\&\quad =\mathop {\sup }_{\begin{array}{c} f\in \mathcal {F}\\ \Vert f\Vert _{\mathcal {H}}\le 1 \end{array}}\left( {\mathbb {E}}_{{\varvec{h}}_{s}\sim {\mathbb {H}}_{S}}\langle \phi ({\varvec{h}}_{s}),f\rangle _{\mathcal {H}}-{\mathbb {E}}_{\varvec{\hat{h}}_{t}\sim {\mathbb {H}}_{T}}\langle \phi (\varvec{\hat{h}}_{t}),f\rangle _{\mathcal {H}}\right) \\&\quad =\mathop {\sup }_{\begin{array}{c} f\in \mathcal {F}\\ \Vert f\Vert _{\mathcal {H}}\le 1 \end{array}}\left\langle {\mathbb {E}}_{{\varvec{h}}_{s}\sim {\mathbb {H}}_{S}}\phi ({\varvec{h}}_{s})-{\mathbb {E}}_{\varvec{\hat{h}}_{t}\sim {\mathbb {H}}_{T}}\phi (\varvec{\hat{h}}_{t}),f\right\rangle _{\mathcal {H}}\\&\quad =\left\| {\mathbb {E}}_{{\varvec{h}}_{s}\sim {\mathbb {H}}_{S}}\phi ({\varvec{h}}_{s})-{\mathbb {E}}_{\varvec{\hat{h}}_{t}\sim {\mathbb {H}}_{T}}\phi (\varvec{\hat{h}}_{t})\right\| _{\mathcal {H}}, \end{aligned} \end{aligned}$$
(1)

where \(\mathcal {H}\) is a RKHS, function class \(\mathcal {F}=\{f:\Vert f\Vert \le 1\}\), and infinite dimensional feature map \(\phi (\cdot ):\mathcal {X}\rightarrow \mathcal {H}\). In addition, the feature map \(\phi (\cdot )\) corresponds to a positive semi-definite kernel k so that \(k({\varvec{u}},{\varvec{v}})=\langle \phi ({\varvec{u}}),\phi ({\varvec{v}})\rangle _{\mathcal {H}}\), thus Eq. (1) can be rewritten in terms of k. Therefore, the objective function of similarity constraints in the “intermediate" can be written as:

$$\begin{aligned} \begin{aligned}&\min _{E_{t}}\mathcal {L}_{\textrm{ISC}}({\varvec{x}}_{s})\\&\quad =\textrm{MMD}^{2}[\mathcal {F},{\varvec{h}}_{s},\varvec{\hat{h}}_{t}]\\&\quad =\left\| {\mathbb {E}}_{{\varvec{h}}_{s}\sim {\mathbb {H}}_{S}}\phi ({\varvec{h}}_{s})-{\mathbb {E}}_{\varvec{\hat{h}}_{t}\sim {\mathbb {H}}_{T}}\phi (\varvec{\hat{h}}_{t})\right\| _{\mathcal {H}}^{2}\\&\quad ={\mathbb {E}}_{{\varvec{h}}_{s},{\varvec{h}}_{s}^{\prime }\sim {\mathbb {H}}_{S},{\mathbb {H}}_{S}}k({\varvec{h}}_{s},{\varvec{h}}_{s}^{\prime })\\&\quad -2{\mathbb {E}}_{{\varvec{h}}_{s},\varvec{\hat{h}}_{t}\sim {\mathbb {H}}_{S},{\mathbb {H}}_{T}}k({\varvec{h}}_{s},\varvec{\hat{h}}_{t})\\ {}&\qquad +{\mathbb {E}}_{\varvec{\hat{h}}_{t},\varvec{\hat{h}}_{t}^{\prime }\sim {\mathbb {H}}_{T},{\mathbb {H}}_{T}}k(\varvec{\hat{h}}_{t},\varvec{\hat{h}}_{t}^{\prime }), \end{aligned} \end{aligned}$$
(2)

where \({\varvec{h}}_{s}^{\prime }\) is an independent copy of \({\varvec{h}}_{s}\) with the same distribution, and \(\varvec{\hat{h}}_{t}^{\prime }\) is an independent copy of \(\varvec{\hat{h}}_{t}\). As for the kernel function k, we choose to use a linear combination of multiple Gaussian kernels over a range of standard deviations, such as \(k({\varvec{u}},{\varvec{v}})=\sum \nolimits _{i=1}^{m}\exp \left\{ -\frac{1}{2\delta _{i}}\Vert {\varvec{u}}-{\varvec{v}}\Vert _{2}^{2}\right\}\), where m is the number of kernel functions and \(\delta _{i}\) denotes the standard deviation of the i-th Gaussian kernel.

3.2.2 Final similarity constraints (FSC) based on the knowledge distillation

The trained classifier will receive the feature representations and map them to the classification logits for judgment. The traditional training directly takes one-hot encoded labels as the target, which is prone to result in overfitting during repeated training epochs. To alleviate this problem, we utilize knowledge distillation (KD) to control the degree of knowledge transfer by producing a softer probability distribution. Unlike the hard label, which focuses only on the label value of maximum probability, the soft label describes the probability distribution by multiple probability values, which can better handle noise and uncertainty. Moreover, it contains information about the correlation between different classes, which can help to increase the inter-class distance and thus reduce the intra-class distance.

Given the acquired feature representations \({\varvec{h}}_{s}\) and \(\varvec{\hat{h}}_{t}\), the trained classifier \(C_{s}\) will map them to the classification logits \({\varvec{p}}_{s}=C_{s}({\varvec{h}}_{s})\) and \(\varvec{\hat{p}}_{t}=C_{s}(\varvec{\hat{h}}_{t})\), respectively. As with KD, we obtain the softer classification probabilities \({\varvec{P}}=\sigma ({\varvec{p}}_{s}/T)\) and \({\varvec{Q}}=\sigma (\varvec{\hat{p}}_{t}/T)\), where \(\sigma (\cdot )\) is the softmax function and T is temperature value that controls the degree of knowledge transfer. Therefore, the objective function of similarity constraints in the “final" can be conducted by using the Kullback–Leibler divergence between \({\varvec{P}}\) and \({\varvec{Q}}\):

$$\begin{aligned} \begin{aligned}&\min _{E_{t}}\mathcal {L}_{\textrm{FSC}}({\varvec{x}}_{s})\\&\quad =T^{2}\cdot \textrm{KL}({\varvec{P}}\Vert {\varvec{Q}})\\&\quad =T^{2}\cdot {\mathbb {E}}_{{\varvec{x}}_{s}\sim {\mathbb {D}}_{S}}\sum _{k=1}^{K}P_{k}\log \frac{P_{k}}{Q_{k}}, \end{aligned} \end{aligned}$$
(3)

where \({\varvec{P}}\triangleq [P_{1},\cdots ,P_{K}]\in {\mathbb {R}}^{1\times {K}}\), \(\sum _{k=1}^{K}P_{k}=1\) and \({\varvec{Q}}\triangleq [Q_{1},\cdots ,Q_{K}]\in {\mathbb {R}}^{1\times {K}}\), \(\sum _{k=1}^{K}Q_{k}=1\), \(P_{k}\) and \(Q_{k}\) is the probability of the k-th class, and K is the number of classes.

In summary, the inputs to the source and target encoders are the same, and the target encoder imitates the source encoder in terms of “intermediate" and “final", thereby transferring discriminative knowledge for conditional distribution alignment.

3.3 Adversarial domain adaptation with model-oriented knowledge adaptation

In order to compensate for the deficiencies of adversarial domain adaptation in discriminative knowledge via model-oriented knowledge adaptation, we propose the Moka-ADA, which guarantees that both domain-invariant knowledge and discriminative knowledge are fully learned. Figure 2 illustrates the overall framework of our proposed model, which consists of three steps. Step 1: Supervised training the source encoder \(E_{s}\) and classifier \(C_{s}\) on the source data. Step 2: Adversarial training the target encoder \(E_{t}\) and discriminator \(C_{d}\) to align the source and target domain distributions. Step 3: Inferring with the trained target encoder \(E_{t}\) and classifier \(C_{s}\) on the target data.

In Step 1, we aim to train a well-performing source model using labeled data from the source domain, which serves as a “teacher” for subsequent training of the target model. The source error can be minimized through supervised training of the source encoder \(E_{s}\) and classifier \(C_{s}\) on \(({\varvec{x}}_{s},y_{s})\) by using the Cross-Entropy loss:

$$\begin{aligned} \begin{aligned}&\min _{E_{s},C_{s}}\mathcal {L}_{\textrm{cls}}({\varvec{x}}_{s},y_{s})\\&\quad ={\mathbb {E}}_{({\varvec{x}}_{s},y_{s})\sim {\mathbb {D}}_{S}}-\sum _{k=1}^{K}\mathbbm {1}_{\left[ k=y_{s}\right] }\log \sigma ({\varvec{p}}_{s}), \end{aligned} \end{aligned}$$
(4)

where \({\varvec{p}}_{s}=C_{s}({\varvec{h}}_{s})\), \({\varvec{h}}_{s}=E_{s}({\varvec{x}}_{s})\), \(\sigma (\cdot )\) is the softmax function, and K is the number of classes.

Then, the source encoder parameters are frozen, which fixes the source domain feature distribution. Thus, we obtained the reference distribution for adversarial training, which is analogous to the real image distribution in the GANs setting [9]. Prior to adversarial training, we first initialize the target encoder weights with the source encoder weights, as this practice can improve the convergence properties.

In Step 2, the discriminator \(C_{d}\) aims to infer the domain probabilities \({\varvec{q}}_{s}\) or \({\varvec{q}}_{t}\) of an sample, i.e., coming from the source or target domain. Thus, the discriminator \(C_{d}\) is optimized on \(({\varvec{x}}_{s},y_s^d=0)\) and \(({\varvec{x}}_{t},y_t^d=1)\):

$$\begin{aligned} \begin{aligned}&\min _{C_{d}}\mathcal {L}_{\textrm{s}}^{\textrm{dis}}({\varvec{x}}_{s},y_s^d)\\&\quad ={\mathbb {E}}_{{\varvec{x}}_{s}\sim {\mathbb {D}}_{S}}-[y_s^d\log {\varvec{q}}_{s}+(1-y_s^d)\log (1-{\varvec{q}}_{s})]\\&\quad ={\mathbb {E}}_{{\varvec{x}}_{s}\sim {\mathbb {D}}_{S}}-\log (1-{\varvec{q}}_{s}), \end{aligned} \end{aligned}$$
(5)

where \({\varvec{q}}_{s}=C_{d}({\varvec{h}}_{s})\), \({\varvec{h}}_{s}=E_{s}({\varvec{x}}_{s})\), and

$$\begin{aligned} \begin{aligned}&\min _{C_{d}}\mathcal {L}_{\textrm{t}}^{\textrm{dis}}({\varvec{x}}_{t},y_t^d)\\&\quad ={\mathbb {E}}_{{\varvec{x}}_{t}\sim {\mathbb {D}}_{T}}-[y_t^d\log {\varvec{q}}_{t}+(1-y_t^d)\log (1-{\varvec{q}}_{t})]\\&\quad ={\mathbb {E}}_{{\varvec{x}}_{t}\sim {\mathbb {D}}_{T}}-\log {\varvec{q}}_{t}, \end{aligned} \end{aligned}$$
(6)

where \({\varvec{q}}_{t}=C_{d}({\varvec{h}}_{t})\), \({\varvec{h}}_{t}=E_{t}({\varvec{x}}_{t})\).

According to Eqs. (5) and  (6), we can obtain the final objective function of the discriminator \(C_{d}\):

$$\begin{aligned} \begin{aligned}&\min _{C_{d}}\mathcal {L}_{\textrm{dis}}({\varvec{x}}_{s},{\varvec{x}}_{t},y_s^d,y_t^d)\\&\quad =\min _{C_{d}}\left[ \dfrac{\mathcal {L}_{\textrm{s}}^{\textrm{dis}}({\varvec{x}}_{s},y_s^d)+\mathcal {L}_{\textrm{t}}^{\textrm{dis}}({\varvec{x}}_{t},y_t^d)}{2}\right] \\&\quad =\dfrac{{\mathbb {E}}_{{\varvec{x}}_{s}\sim {\mathbb {D}}_{S}}-\log (1-{\varvec{q}}_{s})+{\mathbb {E}}_{{\varvec{x}}_{t}\sim {\mathbb {D}}_{T}}-\log {\varvec{q}}_{t}}{2}. \end{aligned} \end{aligned}$$
(7)

To adversarially train the target encoder \(E_{t}\), it is encouraged to fool the discriminator \(C_{d}\) by reversing the domain label. Thus, the target encoder \(E_{t}\) is optimized on \(({\varvec{x}}_{t},y_s^d=0)\):

$$\begin{aligned} \begin{aligned}&\min _{E_{t}}\mathcal {L}_{\textrm{gen}}({\varvec{x}}_{t},y_s^d)\\&\quad ={\mathbb {E}}_{{\varvec{x}}_{t}\sim {\mathbb {D}}_{T}}-[y_s^d\log {\varvec{q}}_{t}+(1-y_s^d)\log (1-{\varvec{q}}_{t})]\\&\quad ={\mathbb {E}}_{{\varvec{x}}_{t}\sim {\mathbb {D}}_{T}}-\log (1-{\varvec{q}}_{t}), \end{aligned} \end{aligned}$$
(8)

where \({\varvec{q}}_{t}=C_{d}({\varvec{h}}_{t})\), \({\varvec{h}}_{t}=E_{t}({\varvec{x}}_{t})\).

Based on Eq. (2) and Eq. (3) in Sect. 3.2 and Eq. (8), the final objective function for training the target encoder \(E_{t}\) can be defined as:

$$\begin{aligned} \begin{aligned}&\min _{E_{t}}\mathcal {L}_{\textrm{tgt}}({\varvec{x}}_{s},{\varvec{x}}_{t},y_s^d)\\&\quad =\min _{E_{t}}[\mathcal {L}_{\textrm{gen}}({\varvec{x}}_{t},y_s^d)+\mathcal {L}_{\textrm{ISC}}({\varvec{x}}_{s})+\mathcal {L}_{\textrm{FSC}}({\varvec{x}}_{s})]. \end{aligned} \end{aligned}$$
(9)

Through Eq. (7) and Eq. (9), the discriminator \(C_{d}\) and target encoder \(E_{t}\) are alternately optimized in a two-player adversarial game similar to GANs [9], as in the ADDA framework [8].

In Step 3, we can finally use the trained target encoder \(E_{t}\) and classifier \(C_{s}\) to make inferences on the target data used for testing, whose sentiment polarity label can be predicted as below:

$$\begin{aligned} \begin{aligned} \hat{y}_{t}=\mathop {\arg \max }{\varvec{p}}_{t}, \end{aligned} \end{aligned}$$
(10)

where \({\varvec{p}}_{t}=C_{s}({\varvec{h}}_{t})\), \({\varvec{h}}_{t}=E_{t}({\varvec{x}}_{t})\).

The overall iterative training procedure of Moka-ADA is summarized in Algorithm 1.

figure a

3.4 Theoretical analysis

We provide a theoretical understanding of why our method can enhance adversarial domain adaptation based on the domain adaptation theory from Ben-David et al. [10, 11], a key outcome of which is the following theorem:


Theorem 1. Let \(\mathcal {H}\) be the hypothesis space, \(\epsilon _{S}\) and \(\epsilon _{T}\) be the generalization error on the source domain \({\mathbb {D}}_{S}\) and the target domain \({\mathbb {D}}_{T}\), respectively. Then for any \(h\in \mathcal {H}\), there is

$$\begin{aligned} \begin{aligned} \epsilon _{T}(h)\le \epsilon _{S}(h)+d_{\mathcal {H}\Delta \mathcal {H}}\left( {\mathbb {D}}_{S},{\mathbb {D}}_{T}\right) +\lambda , \end{aligned} \end{aligned}$$
(11)

where \(d_{\mathcal {H}\Delta \mathcal {H}}\) is the \(\mathcal {H}\Delta \mathcal {H}\)-divergence [31] to measure the domain discrepancy between \({\mathbb {D}}_{S}\) and \({\mathbb {D}}_{T}\), defined as:

$$\begin{aligned} \begin{aligned} d_{\mathcal {H}\Delta \mathcal {H}}\triangleq \sup _{h,h^{\prime }\in \mathcal {H}}|{\mathbb {E}}_{{{\textbf {x}}}_{s}\sim {\mathbb {D}}_{S}}\left[ h({{\textbf {x}}}_{s})\ne {h^{\prime }({{\textbf {x}}}_{s})}\right] \\ -{\mathbb {E}}_{{{\textbf {x}}}_{t}\sim {\mathbb {D}}_{T}}\left[ h({{\textbf {x}}}_{t})\ne {h^{\prime }({{\textbf {x}}}_{t})}\right] |, \end{aligned} \end{aligned}$$
(12)

where h and \(h^{\prime }\) are two sets of hypotheses in \(\mathcal {H}\), and \(\lambda\) is the error of the ideal joint hypothesis \(h^{*}\), where \(h^{*}\) is defined as \(h^{*}=\mathop {\arg \min }\limits _{h\in \mathcal {H}}\epsilon _{S}(h)+\epsilon _{T}(h)\), such that

$$\begin{aligned} \begin{aligned} \lambda =\epsilon _{S}(h^{*})+\epsilon _{T}(h^{*}). \end{aligned} \end{aligned}$$
(13)

From Eq. (11), the generalization error on the target domain \(\epsilon _{T}(h)\) is upper bounded by a combination of the generalization error on the source domain \(\epsilon _{S}(h)\), the domain discrepancy \(d_{\mathcal {H}\Delta \mathcal {H}}\), and the error of the ideal joint hypothesis \(\lambda\). First, it is easy to minimize \(\epsilon _{S}(h)\) by supervised training with labeled source data. Then, \(d_{\mathcal {H}\Delta \mathcal {H}}\) can be reduced by aligning the marginal distribution via adversarial domain adaptation. Moreover, the dual structure with similarity constraints can yield lower \(\lambda\) and further reduce \(d_{\mathcal {H}\Delta \mathcal {H}}\) by acquiring discriminative knowledge for conditional distribution alignment.

4 Experiments

Table 1 Statistics of the Amazon reviews benchmark datasets

4.1 Datasets

We evaluate our method on the Amazon reviews benchmark datasets collected by Blitzer et al. [32], which is publicly available and widely used for the CDSA task. It includes reviews from four product domains: Books (B), DVDs (D), Electronics (E), and Kitchen appliances (K). Each domain contains 2000 labeled samples, of which 1000 are negative and 1000 are positive. Following the previous works [22, 33], we construct 12 cross-domain tasks of source-target domain pairs. For each domain pair, 1600 labeled source samples and the same number of unlabeled target samples are used for training, and the remaining 400 labeled source samples for validation. Then, we perform a test with all the labeled target samples. Table 1 lists the relevant statistics.

4.2 Implementation details

We adopt SentiX as the context feature extractor, which is a sentiment-aware pre-trained language model proposed by Zhou et al. [23]. For all experiments, we limit the maximum sequence length is 256, while the batch size is set to 32. The optimizer is Adam with learning rate \(10^{-5}\), \(\beta _{1}\) = 0.9, \(\beta _{2}\) = 0.999. During the supervised training, we train for 5 epochs and use the validation dataset to choose an appropriate epoch to save the model. For adversarial training, we train for 1 to 5 epochs to report the average results and empirically set some hyperparameters with a gradient norm of 1.0, a clip value of 0.01, and a knowledge distillation temperature of 20 for more stable adversarial training.

Table 2 Accuracy results on 12 domain pairs from the Amazon reviews benchmark datasets. (The best performance is indicated in bold.)

4.3 Compared methods

We consider the following methods for comparison, including PERL [34], DAAT [21], p+CFd [35], UDALM [22], DA-SDS [33], and AdSPT [36]. We present the best results reported in the original paper of these approaches. Besides, we adopt the SentiX model as a baseline and design several variants of our model:

  • Baseline: The sentiment-aware pre-trained language model SentiX.

  • ISC-ADA: A variant of the proposed model, which only imposes similarity constraints on intermediate feature representations.

  • FSC-ADA: A variant of the proposed model, which only imposes similarity constraints on final classification probabilities.

  • Moka-ADA: The full model introduced in Sect. 3.3.

Fig. 3
figure 3

Accuracy results of our methods compared to the baseline

Fig. 4
figure 4

Feature visualization for the B \(\rightarrow\) D task using the t-SNE algorithm

4.4 Experimental results

In Table 2, we report the accuracy results of the compared methods on 12 cross-domain tasks. Compared with most other works, the baseline achieves better performance, which is mainly attributed to its learning of sentiment knowledge through pre-training with large-scale review datasets. Notably, our Moka-ADA can improve the average accuracy by 1.57% compared to the baseline and has an improvement of 6.75%, 4.13%, 3.62%, 2.51%, 2.77% and 1.11% compared to other methods, respectively.

As shown in Fig. 3, it can be observed that our methods outperform the baseline in almost all domain pairs, which proves that either ISC-ADA or FSC-ADA can effectively conduct similarity constraints to enhance adversarial domain adaptation. Compared to ISC-ADA and FSC-ADA, the full Moka-ADA performed better on 7 of the 12 domain pair tasks, and has mostly relatively smaller standard deviations, indicating greater robustness.

4.5 Visualization of features

To more intuitively assess the effect of model-oriented knowledge adaptation on the feature distribution, we further visualize the feature representations of the source and target data for the B \(\rightarrow\) D task. The visualization of the feature representations is performed using the t-SNE algorithm to transform the 768-dimensional feature space into a two-dimensional space. In Fig. 4, the visualization results of Baseline, ISC-ADA, FSC-ADA, and Moka-ADA are presented separately.

In Fig. 4a, we observe that samples of different polarities in the source domain are well separated, while for the target domain, some samples of different polarities are mixed together with unclear decision boundaries. In Fig. 4b, the situation has improved and samples of the same polarity across domains tend to be consistent, indicating that ISC-ADA reduces the distance between feature representations across domains and thereby reduces discrepancy in domain distributions. In Fig. 4c, although samples of the same polarity across domains are less aligned, FSC-ADA increases the inter-class distance and reduces the intra-class distance, making the decision boundaries more clear. In Fig. 4d, the Moka-ADA not only makes samples of the same polarity across domains become compact and aligned, but also has better decision boundaries.

4.6 Ablation studies

Fig. 5
figure 5

Feature visualization of the Only-ADA at different adversarial training epochs for the K \(\rightarrow\) B task

Table 3 Experimental results of the Only-ADA
Table 4 Experimental results of the Moka-ADA

To analyze the effect of our method on adversarial training, we conduct ablation experiments and the results are shown in Tables 3 and 4, where the Only-ADA represents adversarial training without model-oriented knowledge adaptation. By comparison, it is easy to observe that our methods are effective and robust, while the Only-ADA experiences a dramatic decrease with increasing training epochs.

For further study, we perform feature visualization of the Only-ADA for the K \(\rightarrow\) B task as shown in Fig. 5. In the first subplot, all samples belong to four clusters, which indicates that adversarial training brings domain awareness to the model. Nonetheless, in the remaining subplots, it appears that samples of different polarities in the target domain gradually mix into the same cluster, which is a mode collapse phenomenon in adversarial training. In contrast, our models have better stability and flexibility of adversarial training, which effectively prevents the mode collapse phenomenon.

5 Conclusion and future work

In this study, we propose a novel method, Moka-ADA, for cross-domain sentiment analysis. It aims to learn domain-invariant and discriminative knowledge to ensure that the marginal and conditional distributions are aligned simultaneously. The model-oriented knowledge adaptation module we designed can effectively facilitate knowledge transfer. Extensive experiments show that our Moka-ADA outperforms the state-of-the-art result on the Amazon reviews benchmark datasets. Theoretical analysis and ablation studies verify the reasonableness and effectiveness of our method.

In future, we would like to adapt our method to more realistic and challenging scenarios, such as multi-source domain [37] and sparsely labeled source domain [38], and further explore applications for other cross-domain tasks in the direction of natural language processing and computer vision.