1 Introduction

The increasing Web popularity for digital communication has started with the first social media platform, Six Degrees in 1997 [10] and continued until nowadays, when over 3.7 billion people, which accounts for 48% of the world population, are exchanging information on the social Web [30]. Due to this popularity and the subsequent increasing amount of reviews on products or services, a wealth of useful information is available to businesses and consumers alike. But due to this increased amount of reviews available, it is very difficult or even impossible to analyse these by hand.

Aspect-Based Sentiment Analysis (ABSA) classifies a person’s feeling towards specific aspects [31]. ABSA is concerned with target extraction (TE), aspect detection (AD), and target sentiment classification (SC) [27]. This paper concentrates on the last task, SC, classifying the user’s sentiment with respect to the aspects, and is called Aspect-Based Sentiment Classification (ABSC) [2]. As an example, we provide the sentence “the atmosphere and service were terrible, but the food was good”. In this case, the “atmosphere”, “service”, and “food” are all target words and “terrible” and “good” are the expressions that give context to these target words to be able to label the targets as positive, neutral, or negative. As one can notice, the context around the target words is essential to capture the explicit aspect sentiments.

There are multiple practical applications of ABSC. By evaluating and deciding which features need improvement, a company can apply specific enhancements to their products, efficiently improving their customer services. At the same time, social media platforms such as Facebook and Twitter can implement ABSC on tweets and messages, and sell this valuable information to the marketing department of multinationals. In addition, financial firms can apply ABSC to forecast the feelings of financial individuals towards the economic market and thereby predict future stock movements. This would have been extremely beneficial for Melvin Capital before the whole GameStop phenomena at Reddit and might have prevented enormous losses [3]. Last, knowing the opinion of previous customers can help potential clients make better-informed buying decisions.

An issue that has gained much attention recently is the limited availability of labeled data. Generating new labeled data for specific domains is expensive, time-consuming, and requires manual labour. In order to decrease the dependence on labeled data, transfer learning, also called cross-domain learning, is a valuable solution [21]. This approach concentrates on training a model on a related source domain and then predicting for a different target domain. Several state-of-the-art cross-domain models rely on Domain-Adversarial Neural Networks (DANN) introduced in [8]. These neural networks are applied to diverse tasks, ranging from textual entailment analysis [14] to image classification [41]. However, little research is available on ABSA using Domain-Adversarial Training (DAT). To our knowledge, there is limited research on training domain-invariant features for Aspect-Based Sentiment Classification (ABSC) using DANN. Prior methods for sentiment classification apply a Support Vector Machine (SVM) [22] in order to predict the sentiments, but due to its shortcomings, this system was replaced by knowledge-based models [29] and deep learning algorithms [16]. Whereas the deep learning methods are flexible, knowledge-based models require more manual labour, but achieve better results [19]. Since a combination of both approaches benefits from the advantages of both solutions, several researchers merge the methods into a hybrid model [32, 35].

The proposed methodology of this paper is based on the hybrid HAABSA++ model proposed in [33], a state-of-the-art ABSC approach that produces excellent results for the commonly used SemEval 2015 [24] and 2016 [25] data sets, attaining a classification accuracy of 81.7% and 87.0%, respectively. Nevertheless, our study focuses on the development of a neural network that can classify texts on multiple different domains. For this reason, the knowledge-based ontology part of the HAABSA++ approach is eliminated, leaving the LCR-Rot-hop++ model based on [44]. This model is expanded by applying the structure of DAT as proposed by [8]. More specifically, the neural network is trained concurrently on labeled instances of a source domain and unlabeled instances from a target domain. We call the newly established system Domain Adversarial Training LCR-Rot-hop++, abbreviated as DAT-LCR-Rot-hop++. All source code and data can be retrieved from https://github.com/jorisknoester/DAT-LCR-Rot-hop-PLUS-PLUS.

The current work is an extension of our previous work [15] dedicated to the DANN applicability as a knowledge transfer method for ABSA domains without annotations. Additionally to [15], we explore how sensitive our results are with respect to the discriminator architecture, finding out that a too simple structure will mainly facilitate the prediction of the majority class. We also perform additional experiments on how the distribution of the training data affects the DANN performance. As expected, a DAT learning strategy requires a substantial amount of data for all class labels in order to avoid majority voting.

The rest of this paper is structured as follows. First, Section 2 gives the relevant literature concerned with ABSC, the different components of the LCR-Rot-hop++ model, and the ideas behind transfer learning. Second, Section 3 provides a short description of the data together with a couple of descriptive statistics. Third, Section 4 concentrates on the methodology of this research for which the results are reported in Section 5. Last, Section 6 provides our conclusion and suggestions for future work.

2 Related work

ABSC is concerned with classifying a person’s sentiment towards specific aspects in a sentence. One of the first works on sentiment analysis is presented in [23] and sentiment analysis has been a hot topic ever since. While the traditional methods were mainly developed based on knowledge-based systems providing higher prediction scores for domain-specific documents [19], the more recent ones tend to rely more on deep learning solutions. Despite being considered alternative solutions, in [38] it is observed that the two approaches are in fact complementary. Consecutively, a hybrid method which incorporates both the domain ontology and a deep learning neural network is introduced in [28]. After several additional improvements, this has resulted in the HAABSA++ method [33].

HAABSA++ method has two steps so that if the ontology is unable to provide reliable results, the LCR-Rot-hop++ neural network is used as a backup. As the topic of our work is transfer learning, we focus only on the neural network. LCR-Rot-hop++ is actually a bi-directional Long Short-Term Memory (Bi-LSTM) model accompanied by attention. This attention layer is able to put focus on specific parts of a sentence, thereby limiting the influence of the less important words. On top of this attention layer, a hierarchical attention layer is employed, enabling the model to process the text on the sentence-level, bringing together the local sentence representations. The last component of the LCR-Rot-hop++ model is its rotary system that increases the interaction between targets and contexts by sharing information in order to capture the most indicative sentiment words.

Transfer learning [21] is a machine learning technique that focuses on storing information from one data set and applying this knowledge on another. Because obtaining annotated data is costly and time-consuming, new models must be developed to provide reliable results for multiple domains. The variety of methods of transfer learning is continuously expanding. One of the proposed solutions focuses on freezing the first layers of an LSTM neural network [4]. This approach is based on the fact that the source domain contains valuable universal information, and higher layer neurons tend to specialize more towards the target domain, while the lower hidden layers generate more common word features [39]. A state-of-the-art method, BertMasker [40] uses the idea of masking [6] for their BERT Base network. BertMasker is able to mask domain-related words. This transforms the remaining sentence text to be domain-invariant, while maintaining its most sentiment-explicit words. Another solution scientists developed is a domain adapting network by creating counterfactual features [13]. These counterfactual depictions reduce the inductive bias of the source domain. The designed positive (negative) counterfactuals bridge the dimensional gap between the positive (negative) classified instances of the source and target domain. Different from the previous works, in this research we apply the methodology of Generative Adversarial Networks (GAN). This last solution is introduced in [9] and has shown superior performance in a broad range of scientific areas, such as image classification [5], event detection [12], and textual cross-domain sentiment classification [43]. Applying the logic of GAN, DANN is introduced in [7]. A DANN model is able to perform machine learning tasks on unlabeled target domain data, while trained on a labeled source domain with a relatively similar distribution, both in terms of polarity distribution and batch size [7]. The advantage that one does not need annotated target data makes DANN very valuable for future cross-domain deep learning problems and is therefore an important contribution to the existing machine learning techniques.

To implement DANN for cross-domain sentiment classification, we rely on the features computed by the LCR-Rot-hop++ model. A similar study to ours based on adversarial learning is discussed also in [37]. However, in [37], the class and domain discrimination use features computed by different models. As opposed, in our work, we consider that class and domain discrimination are complementary tasks that will take advantage of using shared features computed by the LCR-Rot-hop++ model.

3 Data

In this paper, two different data sets are used. These are the Semantic Evaluation (SemEval) 2014 [26], and the Amazon/LibraryThing (ALT) 2019 [18]. SemEval 2014 includes information about the restaurant and laptop domain, and ALT contains the data for the book domain. The reviews of the data sets are divided into single sentences, consisting of one or more aspects and sentimental context words, which are used to classify the polarities (positive, neutral, or negative). The partitioning of the data into a training and test set is done as follows. The aspects are divided into 80% training and 20% testing. The training set consists of 80% pure training and 20% validation to compute the optimal values for the hyperparameters. The results of the split into training and test data are presented in Table 1.

Table 1 The distribution of the aspect sentiment polarities of the three domains

In terms of data pre-processing, the same approach is applied as introduced in [36] and [44]. The implicitly opinionated review sentences contain a sentiment, but the aspect term is missing. This makes it impossible to perform ABSC using the LCR-Rot-hop++ model. In addition to this, it could occur that an aspect has conflicting sentiments. This happens when there is both negative and positive context towards an aspect. Both the conflicting sentiment and the implicitly opinionated sentences are removed from the datasets.

DAT-LCR-Rot-hop++ requires aspects from two domains to go through the model. These domains are defined as the source domain and the target domain. Only the source domain instances have a sentiment label attached to them. As a result, during training, the aspects of the source domain consist of two labels, the domain class d, and the sentiment category y, while the instances of the target domain only contain a domain class d. For testing, the polarity labels of the target domain aspects are determined in order to evaluate the performance of DAT-LCR-Rot-hop++. Our proposed model is concurrently trained on one source and one target domain to obtain domain-invariant features. Results are presented for six different domain combinations (restaurant-laptop, restaurant-book, laptop-restaurant, laptop-book, book-restaurant, and book-laptop). As an example, the restaurant-laptop model means that the restaurant data set is the source domain, and the laptop data set is the target domain. When training is finished, the test instances of the target domain are fed into the model for evaluation. The performance of DAT-LCR-Rot-hop++ is analysed according to its predicting sentiment accuracy of the target test aspects.

4 Framework

A GAN model [9] generally consists of two additional elements on top of the neural feature extractor, which is the LCR-Rot-hop++ model in this paper. The feature extractor transforms an input sentence into a vector representation that is ought to capture the important characteristics of the sentence. The other two extra elements are the generator and discriminator. A visual representation of our neural network is shown in Figure 1.

Fig. 1
figure 1

A visualisation of the LCR-Rot-hop++ model

The DAT method introduced in [7] is able to adapt to target domains without any labeled target data. This is done by generating deep features that are discriminative for the main learning classifying task by using the labeled sentiments of the source domain. This method ensures that these representations are invariant to shifts between the source and target domain in order to be domain in-discriminative by using the domain class of both the source and target domain. The proposed DANN solution is revisited in [8], which provides a more detailed and elaborate description of the mathematics behind the system.

The change to the GAN network in order to conform to a DANN model means replacing the generator with a Gradient Reversal Layer (GRL). GRL aims to make the task of the domain discriminator as hard as possible, which is the direct connection to GAN. In DAT-LCR-Rot-hop++, the loss of the domain discriminator is passed through GRL, which reverses the gradient before back-propagation into the feature extractor. This causes the hidden layers of LCR-Rot-hop++ to react by constructing features which will not be recognised as a certain domain by the domain discriminator. This process continues until at some point, the word vectors are completely domain-invariant, which causes the domain discriminator to be unable of distinguishing the source and target domain in the shared feature representations. The rest of this section is structured as follows. Section 4.1 presents the feature extractor LCR-Rot-hop++, Section 4.2 describes the structure of our DANN proposed model, and Section 4.3 presents the training procedure for our model.

4.1 LCR-Rot-hop++

The LCR-Rot model is introduced in [44], extended in [34] with multi-hop attention to LCR-Rot-hop, and further expanded with deep contextual word embeddings and hierarchical attention resulting in LCR-Rot-hop++, as described in [33].

First, the sentences are split into three separate parts, consisting of the left context, \([s_1^l, ..., s_L^l]\), target phrase, \([s_1^t, ..., s_T^t]\), and right context, \([s_1^r, ..., s_R^r]\). These sentence fractions have lengths L, T, and R, respectively, such that L+T+R is equal to the complete sentence length, S. These chunks are converted to contextual word embeddings using the pre-trained BERT Base model (L = 12, A = 12, H = 768) as introduced in [6]. The final word embeddings are calculated by summing the last 4 layers of the BERT model:

$$\begin{aligned} BERT_i = \sum _{j = 9}^{12}H_{i, j}. \end{aligned}$$
(1)

All word embeddings have a dimension of \(1 \times d\), where d is equal to H = 768. So, the word representations are each a vector of size 768.

Next, the left context word embeddings, \([w_1^l, ..., w_L^l]\), the target word embedding, \([w_1^t, ..., w_T^t]\), and the right word embedding, \([w_1^r, ..., w_R^r]\), are each the input for a three hidden layer bi-LSTM feed-forward neural network, resulting in the hidden states \([h_1^l, ..., h_L^l]\), \([h_1^t, ..., h_T^t]\), and \([h_1^r, ..., h_R^r]\). These hidden states all have dimension of \(2d \times 1\) due to the bidirectional structure. This process is shown by the dark red dashed arrows in Figure 1.

Afterwards, a rotary attention mechanism is applied to the outputs to capture the most indicative words in the left and right contexts and the target phrase. This is a two-step mechanism. In the first step the target2context vectors are computed. This is done by average pooling the target phrase, which results in \(r^t\), as shown in (2).

$$\begin{aligned} \underset{2d\times 1}{r^t}\ = pooling([\underset{2d\times 1}{h_1^t},...\underset{2d\times 1}{h_T^t}]) \end{aligned}$$
(2)

Then, the neural network utilises this as extra input in the bilinear attention layer of the two context parts of the sentence. In this attention layer, the target phrase representation is combined with the \(h_i^l\) and \(h_i^r\) for the left and right contexts, respectively. A bilinear attention score f, see (3), is employed to achieve accurate representations of the left and right contexts. In the remainder of this section, the left context representation will be used as example to avoid duplicity.

$$\begin{aligned} f(\underset{1 \times 1}{h^l_i}, r^{t}) = tanh(\underset{1 \times 2d}{h^{l}_i} \times \underset{2d \times 2d}{W_c^l}\ \times \underset{2d \times 1}{r^{t}} + \underset{1 \times 1}{b^l_c}), \end{aligned}$$
(3)

where \(h^l_i\) is the hidden state of the left context bi-LSTM, \(W^l_c\) represents the weight matrix, and \(b^l_c\) depicts the bias term, for \(i = 1,...,L\).

Next, the attention scores are normalised to range from 0 to 1 by a softmax function, which results in \(\alpha _i^l\). This is defined as follows:

$$\begin{aligned} \alpha _i^l = \frac{exp(f(h_i^l, r^{t}))}{\sum _{j=1}^L exp(f(h_j^l, r^{t}))}. \end{aligned}$$
(4)

Last, the left and right context representation can be retrieved by computing a weighted combination of the hidden states and the attention scores:

$$\begin{aligned} \underset{2d \times 1}{r^l}\ = \sum _{i = 1}^{L}\underset{1 \times 1}{\alpha ^l_i}\ \times \underset{2d \times 1}{h_i^l}. \end{aligned}$$
(5)

Next, in the second step of the rotary system, these left and right context representations, called \(r^l\) and \(r^r\), are fed into the bilinear attention layer of the left- and right-aware representations of the target phrase. Now, the same methodology is applied as described for the previous context depictions applied to \(r^l\) or \(r^r\) and \(h_i^t\). This results in the context2target vectors, \(r_l^t\) and \(r_r^t\):

$$\begin{aligned} \underset{2d \times 1}{r^{t}_l} = \sum _{i = 1}^{T}\underset{1 \times 1}{\alpha _i^{t_l}} \times \underset{2d \times 1}{h_i^t}, \end{aligned}$$
(6)

where \(h_i^t\) represents the hidden states of the target phrase bi-LSTM layers and \({\alpha _i^{t_l}}\) is the attention score associated with the target phrase’s hidden output with respect to the left context.

Both representative target phrase features, \(r_l^t\) and \(r_r^t\), are then used as input for the first step, the target2context computation. As a result, the average pooling in (2) is skipped, because the newly calculated \(r_l^t\) and \(r_r^t\) are the input to (3)-(5). This whole procedure is repeated three times as was decided to be optimal in [34]. It is shown by the light green arrows (light grey for black and white printing) in Figure 1.

After having completed the first rotary attention mechanism, the four representations are fed into a hierarchical attention system. This component helps overcome the issue of only utilising local information. First, the word features are split into two groups: the target representations, \(r_l^t\) and \(r_r^t\), and the context representations, \(r^l\) and \(r^r\). Both combinations are then separately fed into a new attention layer with attention score f:

$$\begin{aligned} f(\underset{1 \times 1}{v^i_{}}) = tanh(\underset{1 \times 2d}{v^{i}} \times \underset{2d \times 1}{W_{h}^{c}} + \underset{1 \times 1}{b_{h}^{c}}), \end{aligned}$$
(7)

where \(W_{h}^{c}\) is the weight matrix of the hierarchical layer for the current partition and \(b_{h}^{c}\) represents the bias term for the current partition. In the context case, \(v^{i} \in \{r^l, r^r\}\), but for the target phrase \(v^{i} \in \{r_l^t, r_r^t\}\).

As for the bilinear attention layer, the function value is normalised by (8):

$$\begin{aligned} \alpha ^i = \frac{exp(f(v^i)}{exp(f(v^{i_1})) + exp(f(v^{i_2}))}, \end{aligned}$$
(8)

after which the representations are updated:

$$\begin{aligned} \underset{2d \times 1}{v^i}\ = \underset{1 \times 1}{\alpha ^i}\ \times \underset{2d \times 1}{v^i}, \end{aligned}$$
(9)

This procedure is also rerun multiple times, which is visualised by the green arcs (dark grey for black and white printing). According to the standard implementation of LCR-Rot-hop++, the four vectors are concatenated into one single vector, \(r = {r^l; r_l^t; r_r^t; r^t}\) with dimensions \(8d \times 1\) and then passed into a Multi-Layer Perceptron (MLP), which uses a softmax function to predict the polarities. Since in the current work we want to integrate the model into a DANN, the MLP layer is replaced by a domain adversarial component discussed in the next section.

4.2 DANN

The domain adversarial component is represented by two standard feed-forward MLPs, which are the class discriminator and the domain discriminator that takes as input the context and target representations concatenated into r. First, the domain discriminator aims to correctly classify the domain of r. The predicted domain is given by s. Classifying the domain is a binary problem with \(s=0\) for the source and \(s=1\) for the target domain labels. Next, the class discriminator uses a softmax function to compute the probabilities of the sentiment of the aspect, resulting in a \(1 \times 3\) output vector, p. The polarity with largest probability will be chosen as the final sentiment. The sigmoid function is used for domain prediction because it shows good performance for examining binary cases and is applied by multiple researches in domain discriminators [11, 42]. The DAT component is visualised by the dark purple solid arrows in Figure 1.

The objective is reducing the error term of both the domain discriminator, denoted as \(L_d(\theta _f,\theta _d)\), and the class discriminator (sentiment discriminator), denoted as \(L_c(\theta _f,\theta _c)\). Here, \(\theta \) represents the parameters of the feature extractor (LCR-Rot-hop++ without original MLP), the domain discriminator, and the class discriminator, defined by the underscores f, d, and c, respectively. Hence, the objective function to optimise is:

$$\begin{aligned} \min _{\theta } \, L_{c,d}(\theta _f,\theta _c,\theta _d) = L_d(\theta _f,\theta _d) + L_c(\theta _f,\theta _c). \end{aligned}$$
(10)

However, as previously described, the GRL tries to fool the domain discriminator. After the domain is predicted and its parameters \(\theta _d\), are updated, the loss is back-propagated into the feature extractor to change the weights accordingly. But this loss first passes through the GRL, which reverses the gradient by multiplying it with \(-\lambda \) in order to hinder the performance of the domain discriminator. The reversing of the gradient forces the hidden layers of the LCR-Rot-hop++ to respond by adjusting their weights in the exact opposite way as desired by the domain discriminator, hereby making the task of the domain classifier more difficult. As a result, the features become more domain indiscriminative. This process leads to the following adjusted loss function:

$$\begin{aligned} \min _{\theta } \, L^{\prime }_{c,d}(\theta _f,\theta _c,\theta _d) = -\lambda L_{d}(\theta _f,\theta _d) + L_c(\theta _f,\theta _c), \end{aligned}$$
(11)
$$\begin{aligned} L_d(\theta _f, \theta _d) = -\sum _{i = 1}^{N} d_i * log(s_i) + \pi _d * ||\theta _d||^2, \end{aligned}$$
(12)
$$\begin{aligned} L_c(\theta _f, \theta _c) = -\sum _{i = 1}^{n} y_i * log(p_i) + \pi _c * (||\theta _f||^2 + ||\theta _c||^2). \end{aligned}$$
(13)

Here \(d_i\) refers to the actual domain class, and \(y_i\) represents the real polarity. \(s_i\) is the predicted domain and \(p_i\) is the predicted sentiment. \(\pi \) represents the L2-regularisation term for the class and domain discriminator with underscore c and d, respectively. Last, n equals the source domain sample size, and N is the total sample size of the source and target domain data combined. As described, both the source and target aspects are fed into the domain discriminator, while only the source instances are passed into the class discriminator. Applying the principle of adversarial training, first the loss involving only the domain discriminator is maximised in order to learn the discriminator to differentiate the domains. This min-max situation resolves to:

$$\begin{aligned} \hat{\theta _d} = {arg max}_{\theta _d} L^{\prime }_{c,d}(\hat{\theta _f},\hat{\theta _c},\theta _d) \end{aligned}$$
(14)
$$\begin{aligned} (\hat{\theta _f},\hat{\theta _c}) = {arg min}_{\theta _f,\theta _c} L^{\prime }_{c,d}(\theta _f,\theta _c,\hat{\theta _d}) \end{aligned}$$
(15)

At this saddle point, the parameters of the domain discriminator, \(\theta _d\) (12), minimise the domain classification error. Secondly, \(\theta _c\) and \(\theta _f\) are computed to optimise (11) by minimising the sentiment prediction loss and maximising the domain classification error (confusing the discriminator of domains). The hyperparameter \(\lambda \) regulates the balance and trade-off between both goals.

The original DANN paper [7] implements Stochastic Gradient Descent (SGD) optimisation. However, the state-of-the-art image classifying model proposed in [20] shows that utilising the faster momentum method [17] instead of SGD also produces accurate results. In each iteration, the parameters of the neural network will be updated according to this method:

$$\begin{aligned} v_t \longleftarrow \gamma * v_{t-1} + \eta * \nabla _{\theta _k}L(\theta _k) \end{aligned}$$
(16)
$$\begin{aligned} \theta _{k} \longleftarrow \theta _k - v_t. \end{aligned}$$
(17)

Here, the hyperparameters are the learning rate, \(\eta \), and the momentum factor, \(\gamma \). In addition, the parameter \(\theta _k\) represents the weights and biases for the domain discriminator, the feature extractor, and the class discriminator, with \(k = d\), \(k = f\), and \(k = c\), respectively. Last, L represents the corresponding loss function.

4.3 Training procedure

After constructing the feature representations by the feature extractor, both the source and target domain aspects are passed into the domain discriminator. But, only the source instances are fed into the class discriminator. In our research, the aspects of the target domain also contain a sentiment polarity, but this information is not used in the training and remains unknown to the model up until the moment of testing. The performance of the DAT-LCR-Rot-hop++ is evaluated based on this testing accuracy. The benefit of being able to employ a model, which is trained only on the labels of a source domain, on a target domain gives the DANN approach an advantage over other methods.

The weights and biases are improved using the combined loss function, given by (11). This equation includes the \(-\lambda \) multiplication in order to create sentiment discriminative and domain indiscriminate features. The domain discriminator uses ascending gradient to maximise this loss function, whereas the feature extractor and the class discriminator use descending gradient to minimise it. The exact training procedure is shown in Algorithm 1. The stopping condition is \(max(acc_{t-1}, acc_{t-2})\) \(-\) \(acc_{t-3}>\epsilon \), which specifies that if the maximum of the accuracy of the previous epoch and the epoch before, minus the loss of three epochs ago is larger than \(\epsilon \) continue with training. In other words, we continue if there is still a significant improvement.

Algorithm 1
figure a

Training procedure of Domain-Adversarial Learning.

The other hyperparameters besides \(\lambda \) in DAT-LCR-Rot-hop++ are the learning rates, \(\eta _k\), the momentum terms, \(\gamma _k\), the L2-regularisation terms, \(\pi _k\), and the dropout rate. \(k = d\) for the domain discriminator and \(k = c\) for the feature extractor and class discriminator. First, \(\eta \) determines the rate at which the momentum optimiser converges. In addition, \(\gamma \) determines the influence of past gradient values on the current instance. Furthermore, \(\pi \) reduces overfitting. Fourth, as previously described, \(\lambda \) is a parameter that balances the trade-off between the discriminative objectives of the class and domain discriminator. Last, the dropout probability regulates the number of layer outputs to be randomly dropped from the network in order to prevent overfitting.

Because the dropout rate does not differ between the methods proposed in [33] and [34], this variable is kept at 0.3 in this research. The remaining hyperparameters (\(\eta _k\), \(\gamma _k\), \(\pi _k\), and \(\lambda \)) are determined by a Tree-structured Parzen Estimator (TPE), which replaces the distribution of the initial observations with a non-parametric distribution by applying a threshold that splits the observations based on different densities [1].

As in the research performed in [33] and [34], the dimension of the word embeddings, \(1 \times d\), is equal to \(1 \times 768\). For convenience, the number of nodes in the Bi-LSTMs bilinear and hierarchical attention layer are the same as in [33]. These are 300, 600, and 600, respectively. The number of hidden layers and cells in both the class and domain discriminator is optimised by TPE. The weights of the layers are initialised randomly using a normal distribution with a zero mean. The biases are set to zero at the start.

After the hyperparameters are initialised, DAT-LCR-Rot-hop++ is trained on the training set. The sentiment accuracy of the validation set is used to decide which combination of hyperparameters achieves the best performance. We decided to let the program run 15 times for each source-target domain combination with different settings for the structure and the hyperparameters. Each run has maximum 50 epochs. The hyperparameter fine-tuning occurs twice. In the first step, \(\lambda \) is excluded, because we want to show the effect of \(\lambda \) on the cross-domain performance of the model. A higher \(\lambda \) should increase the domain-invariance of the features. As a result, \(\lambda \) will first be set to the value of 1.0 [8] in order to find the optimal values for the other parameters. After that the influence of \(\lambda \) is analysed and, then, all hyperparameters, including \(\lambda \), are fine-tuned to define the best possible configuration. This setting is applied for the final training optimisation with a maximum of 200 epochs.

5 Evaluation

In Section 5.1, we first describe the influence of \(\lambda \) on the performance of DAT-LCR-Rot-hop++. Next, Section 5.2 provides a sensitivity analysis towards the structure of the class and domain discriminators. Then, in Section 5.3, the results for the final optimisation are shown. Last, Section 5.4 examines the performance of the algorithm when the neutral aspects are added to either the positive or negative class (to better balance classes).

5.1 Impact of \(\lambda \)

First, an estimate of the optimal settings for the algorithm is computed. These initial hyperparameters are shown in Table 2. The values are similar for every source-target domain combination. Similar settings are used to increase the comparability between the combinations. DAT-LCR-Rot-hop++ is run for seven incrementing values of \(\lambda \) with these settings, starting from 0.5 up until 1.1 with a step of 0.1 for each domain combination (Figure 2). The impact of the balance hyperparameter \(\lambda \) is visualised in the Figure 2. In these graphs, the dark blue (dark grey in black and white printing) line represents the labeling accuracies of the test set of the target domain, while the light orange (light grey in black and white printing) line shows the base performance when the majority group of the test sample was selected. Notice that the algorithm is only trained on the sentiments of the source domain, so one could argue that the base performance should equal the majority aspect class of the source domain. However, we use the more conservative view by defining the benchmark based on the target domain, which decreases the relative performance of our model. The model uses a maximum of 50 epochs.

Table 2 Hyperparameter values for structure sensitivity analysis
Fig. 2
figure 2

Labeling accuracies for different values of \(\lambda \). Here the dark blue (dark grey in black and white printing) line is used for the target domain classification by our model and the light orange (light grey in black and white printing) line for the target domain classification by the majority classifier

When analysing Figure 2a and c, we observe that the classifying accuracy for the restaurant-laptop and laptop-restaurant domain combination is significantly higher than for the other four. Since the similarities between laptops and restaurants do not seem more prevalent than those between books and laptops, this might come across as a surprising result. However, both the laptop and restaurant domains are taken from the SemEval 2014 dataset [26] while the book domain is retrieved from the ALT 2019 [18]. First of all, these datasets share a common context and target text with words such as “service” and “quality”. Whereas the ALT 2019 dataset contains these target words 6 and 0 times, respectively, the words occur 59 and 85 times in the laptop set and 420 and 85 times in the restaurant domain, respectively. In addition, the (digital) language might have changed throughout these 5 years. Last, the fraction of neutral aspects in the book data test set is significantly higher than the neutral percentage in the training sets of both the restaurant and laptop domains with a percentage of 63.1 as compared to 17.7 and 19.8, respectively. This causes emotional phrases, for example “awesome”, to appear 5 times in the book domain as compared to 30 and 16 times in the laptop and restaurant domain, respectively. On these grounds, it is expected that the predicting score of the book domain in combination with either laptop or restaurant will result in lower scores compared to laptop-restaurant or restaurant-laptop.

Furthermore, the accuracy of the book as a target domain is worse than applying the book as a source domain. Because each domain has a disproportionate training set, it causes the neurons of the model to start with predicting the sentiment that occurs most often, especially in the early iterations. So, for both the restaurant and laptop domains this results in predicting a positive sentiment, which leads to a low score for the book target domain. On the other hand, using the book as a source domain does lead to acceptable performance. Not surprisingly, most neutral aspects are correctly classified in the book-restaurant and book-laptop combinations, with an average accuracy of 65% and 81%, respectively. Next, DAT-LCR-Rot-hop++ focuses on the second largest polarity percentage, the positive sentiments, which is the majority in both the restaurant and laptop domains. The drawback of this is the bad score for the negative polarities with an average accuracy of 3.2% and 0%, respectively. Besides this result, the restaurant-laptop combination does beat the base performance line at 52% with an average of 65%. The same applies to the laptop-restaurant case for which the observation at \(\lambda =0.9\) appears to be an outlier.

When looking at Figure 2a, we observe a scattered graph with an almost flat regression line. Accordingly, the coefficient of the OLS slope is 0.0061, which means that increasing \(\lambda \) with 1.0 increases the testing accuracy by 0.61%. The same holds for Figure 2c, which has a slope of 0.0204. One reason for this could be the previously mentioned overlap between the restaurant and laptop domain, thereby decreasing the difficulty of the cross-domain task and hence, making \(\lambda \) less important. In contrast to the flat regression lines of the restaurant-laptop and laptop-restaurant cases, the restaurant-book case depicts a clear positive linear trend with a slope of 0.0973 in Figure 2b. The same applies to the graph in Figure 2d, which has a slope of 0.0691, showing the effect of \(\lambda \).

As previously stated, employing the book as the target domain produces a poor labeling accuracy. Especially, the positive outlier in Figure 2b is an observation that proves the effect of the disproportionate data sets. Only during this run, DAT-LCR-Rot-hop++ was capable of predicting neutral sentiments correctly, which results in a neutral accuracy of 26% as compared to a maximum of 5% for the other runs with the book as a target domain. The competence to classify the neutral aspects precisely immediately leads to a significantly better performance with an accuracy of 45%, as compared to approximately 35%. The disproportionate sets cause DAT-LCR-Rot-hop++ to overfit on the training set and focus only on the major two polarities, resulting in low scores for the restaurant-book and laptop-book combination. Whereas there is a clear ascending performance for the training set, reaching percentages up until 92%, the maximum accuracy of the target domain is reached after approximately 100 epochs. The statistic then moves around this number with some large outliers in both directions, but never really improving. After some time, the accuracy starts to drop. The model becomes too much specified towards the information of the training set.

Both book-restaurant and book-laptop in Figure 2e and f, provide an ascending line with a slope coefficient of 0.0704 and 0.0488, respectively. However, the data points in Figure 2e are scattered, causing a standard deviation of 10%. Therefore, this positive relationship might be questioned in this case.

5.2 Sensitivity to discriminator structures

To estimate the sensitivity of the model to the neural structure of the domain and class discriminator, we examine the performance of three different structures. We run the DAT-LCR-Rot-hop++ for 125 epochs with the same hyperparameter settings as for the \(\lambda \)-analysis, given by Table 2. The output layer consists of the three sentiment classes. The three structures all have an input layer of 2400 nodes, but the hidden layers differ. The analysed set-ups are 1) No hidden layer, 2) one hidden layer with 600 neurons, 3) two hidden layers, one with 1200 nodes and the other consisting of 600 neurons.

Table 3 Test accuracies for DAT-LCR-Rot-hop++ model for three different structures of the class and domain discriminator

The results for the examination are shown in Table 3. As one can see, the 2400-3 structure is not performing well for the classification problem. Due to the simplicity of the model, the DAT-LCR-Rot-hop++ is mostly predicting the majority class of the training domain. It is not trained enough to have good knowledge of other polarities as well. This leads to extremely low and high accuracies for specific classes. For instance, the restaurant-laptop 99% positive score and the book-laptop 0% negative accuracy. For five out of six domains, the dominant sentiment is predicted for at least 93% correctly, and the minor polarity scores for a maximum of 3%. Only once, for the laptop-book combination, there is some division over positive and negative aspects.

On the other hand, the 2400-600-3 system shows better scores. The accuracies are now more equally divided over the three different sentiments. Especially, the laptop-restaurant, book-restaurant, and book-laptop improvement within the division of correctly classified aspects is substantial. These three domain combinations lose some of their majority polarity accuracies but gain a significant better performance for the less prominent sentiments. The laptop-restaurant model sees an increase in accuracy of 39 percentage points for the negative aspects. So with an additional hidden layer, the algorithm is able to learn about the less dominant classes in the training sample. The performance of the 2400-1200-600-3 structure is somewhat similar to the 2400-600-3 model with some small score differences within the sentiment class distribution. However, the increase of 4 minutes per iteration (50% extra) has to be considered, making the 2400-600-3 model a better fit for our classification task.

5.3 Final optimisation

From the results in Table 3, we decided to continue with a similar domain and class discriminator for all domain combinations. This structure consists of an input layer with 2400 neurons, one hidden layer with 600 nodes, and an output layer with 3 neurons, representing the sentiment classes. The final values of the hyperparameters for the final optimisation with maximum 200 iterations are defined in Table 4. Each domain combination is tested for the final prediction using these parameter settings. The results are shown in Table 5. As expected, the accuracies improve for each source and target domain model as compared to the previous run with maximum 50 epochs and optimal hyperparameters including \(\lambda \). The training label accuracy increases from 84% up to 90% for the book-laptop domain. In addition, the maximum testing accuracy of 78% for the restaurant-laptop case is a 7 percentage points improvement from the previous 71%. The ratios of correctly predicted polarities follow the previously seen distribution.

Table 4 Hyperparameter values for final model estimation
Table 5 Test accuracies for DAT-LCR-Rot-hop++ model

The performance of the restaurant-laptop domain increased significantly, which results in a total test accuracy of 75%. The relevance of not requiring any labeled target data should not be underestimated when comparing it with other research because this ability reduces labeling costs significantly. Specifically, the outcomes for the book-restaurant case are promising. Both domains are not closely related in terms of sentiment distribution, but the model achieves an encouraging test accuracy of 69%, which is an improvement of 5 percentage points as compared to the value after the previous runs. Interestingly, the fraction of correctly labeled sentiments are more balanced, instead of one polarity that is driving the results.

5.4 Extension on neutral sentiments

Looking at Table 5, one can conclude that correctly classifying neutral sentiments is difficult for the algorithm. One reason for this might be the disproportionate datasets, especially in terms of neutral aspects. Another problem is the difference in proportions between source and target domains. Third, one can argue that predicting neutral sentiments is more difficult than labeling negative or positive polarities. For example, “amazing”, and “terrible” are clearly positive and negative words, respectively, while “‘fine” and “okay” can indicate both positive and neutral words. As a result, we also examine the model with the standard binary classification of either positive or negative. To do this, we analyze three cases: 1) the “Base” case with the neutral sentiment included, 2) the neutral aspects added to the positive polarity, and 3) the neutral and negative polarities combined. Each specification is tested using the same settings as for the structure analysis in Table 2. The algorithm is run for 125 epochs.

Table 6 shows the new polarity proportions. First of all, we hypothesise that the scores will be better for the binary case because there are two options to choose from instead of one. Furthermore, from this table, one can expect that the “neutral to positive” adaption produces better results compared to “neutral to negative”, due to the dominant component of positive polarities. Last, one can assume that the book domain performs significantly better in terms of source and target domain as the differences between the distributions of the dataset and the restaurant and laptop domains are smaller.

Table 6 The distribution of the aspect sentiment polarities of the three domains for the neutral extension
Table 7 Test accuracies for DAT-LCR-Rot-hop++ model for the neutral extension

The results of the neutral extension are given by Table 7. The percentage scores for the base case are significantly lower than in Table 5. This can be the consequence of a mix of reasons. First, the optimization is run for 125 epochs instead of 200. Second, the optimal hyperparameters for each individual domain combination are not used, but a general setting to increase comparability instead. Third, it can be bad luck. The algorithm could end up in a local minimum without further exploring other minima.

As expected, the test scores are higher for every run of both the “neutral to positive” and “neutral to negative”. The average difference in percentage between the base case is 37% and 16%, respectively. The same holds for the train classification performance. Transforming the three-dimensional polarity labeling issue to a binary classification problem improves the accuracy substantially.

Furthermore, it is not surprising to see that the positive transformation performs better than the negative transformation for each of the domain combinations. The extreme towards positive polarity distributed datasets of all the domains result in the algorithm predicting almost all positive aspects correctly. A 100% score is reached for the book-restaurant and 98% for the book-laptop combination. Both are accompanied by a low score for the negative sentiment. In addition, interesting to see that the positive opinions are classified as the least accurate for the laptop source domain because its train dataset is the most equally divided. Also, one can notice the minimum 27% accuracy of negative aspects for the laptop source domain compared to the others. This again proves the importance of similarity between sentiment distribution between source and target domains.

Last, Table 7 shows indeed that the score for the book domain as either the source or target domain is significantly higher with the neutral extension than for the base case. Interesting to observe that the accurate labeling of a polarity depends on the amount of a specific polarity in the training set to be able to predict this label category correctly in the test set. The book-restaurant combination for the positive adaption returns a 100% and 2% accuracy for the positive and negative aspects, respectively. However, the book-restaurant combination for the negative transformation results in a score of 16% and 96%, respectively. It shows that the DAT-LCR-Rot-hop++ requires a substantial level of a certain sentiment class to have the knowledge to label the corresponding aspects correctly.

6 Conclusion

The important role of user-generated content on the Web increases the relevance of ABSA, and in particular ABSC. Since obtaining labelled target data is extremely costly, new models should be developed that can be employed in a variety of domains. The state-of-the-art LCR-Rot-hop++ structure forms the basis of our proposed DAT-LCR-Rot-hop++, which adds an adversarial component based on DANN. Based on our results, we show that the domain invariance implemented through DAT-LCR-Rot-hop++, in general, can improve performance over the target data, especially for similar domains.

All accuracy scores for the restaurant-laptop, laptop-restaurant, and book-restaurant domains exceed 70%. So in half of the considered source-target domain cases, DAT-LCR-Rot-hop++ is able to classify polarities properly, but it depends on which combination of domains is used. Domains with similar polarity distributions seem to benefit the most from the proposed approach.

To further examine our method, we investigate the effect of adding the neutral aspects to either the positive or negative side, creating a binary classification problem. Classifying neutral aspects appears to be a harsh task for the neural network, looking at the results, as this class is poorly represented in two out of our three domains. The transformation to the binary problem improves the results as expected. In addition, it again shows that similarity between the source and target domain is crucial.