Keywords

2.1 Introduction

Mobile applications have become an alternative solution for various needs during the pandemic. Bank Indonesia recorded an increase in transactions through e-commerce applications, namely, to 547 million transactions with a nominal value of IDR 88 trillion per the first quarter of 2021.Footnote 1 Shopee, Tokopedia, and Lazada are the top three e-commerce applications with the highest number of visitors per month.Footnote 2 This high level of use makes mobile applications collect many user opinions on their negative and positive features. Thus, we require machine learning for sentiment analysis on the opinion text data to provide information related to the advantages and disadvantages of the application or service of the mobile applications as a whole [1, 2].

From a machine learning point of view, sentiment analysis can be grouped as supervised learning because of sentiment labels [3, 4]. Deep learning is the primary machine learning method for unstructured data, such as text data. Deep learning extends the standard of machine learning by additional layers to extract a relevant representation of data [5]. The deep learning models widely used in sentiment analysis are convolutional neural networks (CNN) and long short-term memory (LSTM). Their efficiency and development have been mentioned in [6,7,8], including for Indonesian sentiment analysis [9]. The Bidirectional Encoder Representation from Transformers (BERT) model is another deep learning architecture that achieves state-of-the-art performance for many natural language processing problems [10]. BERT also improves the performance of standard deep learning for Indonesian sentiment analysis [11].

Continual learning, also known as lifelong learning or incremental learning, is the ability of a model to continuously learn based on new data while retaining previously learned knowledge [12]. The recommender systems on applications like Netflix and Amazon are well-known examples of continual learning. These applications instantly collect new labeled data as people interact with the applications. Continual learning algorithms have also succeeded in computer vision and clinical applications [13,14,15]. In practice, the main issue regarding continual learning is catastrophic forgetting, i.e., training a model with new information interferes with previously learned knowledge. This phenomenon typically leads to an abrupt performance decrease or, in the worst case, to the old knowledge being entirely overwritten by the new one.

In this paper, we analyze the performance of the BERT as a pretrained model of text data representation for continual deep learning in some domains of Indonesian sentiment analysis. Then it will be compared with two standard text data representations in deep learning: fine-tuned embedding with CNN and fine-tuned embedding with LSTM. Our simulation shows the BERT model gives the best accuracy for the transfer of knowledge. However, the fine-tuned embedding with LSTM model is better for retain of knowledge. Moreover, our simulation shows that the order of the source domains affects the performance of BERT for both transfer of knowledge and retain of knowledge.

The structure of this paper is as follows: in Sect. 2.2, we briefly explain methods. We describe the experiments in Sect. 2.3. Finally, a general conclusion about the results is presented in Sect. 2.4.

2.2 Methods

In this section, the methods used in this research will be explained. They are convolution neural networks (CNN), long short-term memory (LSTM), and Bidirectional Encoder Representations from Transformers (BERT).

2.2.1 Convolutional Neural Network

The convolutional neural network (CNN) is a deep learning model widely used in text classification. CNN uses filters to extract essential features from each region for text classification. During the process of word representation, the input will go through the convolution layer and the max-pooling layer (Fig. 2.1).

Fig. 2.1
An architecture diagram of the convolution neural network. It uses filters to extract essential features from each region for text classification. The input will go through the convolution layer and the max-pooling layer during the process of word representation.

CNN architecture

Convolution Layer

In the convolution layer, the input will be processed by l filters W to find the essential features of each region with a specific region size. Suppose that the vector representation of the i-th word is denoted by xi and the combination of the vectors of the words xi to xi + h − 1 is denoted by X[i : i + h − 1]. Eq. (2.1) calculates the feature vector c = [c1, c2, …, cn − h + 1] for each filter, where j and k represent the rows and columns of the matrix, and f is a nonlinear activation function. The convolution layer’s output is then used as the input for the max pooling layer.

$$ {c}_i=f\left({\sum}_{k=1}^h{\sum}_{j=1}^d{X}_{\left[i:i+h-1\right]k,j}\cdotp {W}_{k,j}\right) $$
(2.1)

Max Pooling Layer

The max pooling layer processes the output of the convolution layer by taking the essential features from each feature vector c, that is \( \hat{c}=\max \left\{\boldsymbol{c}\right\} \). The purpose of this layer is to reduce the dimension of the input, so the CNN will gradually learn to use less information with further iterations.

2.2.2 Long Short-Term Memory

Long short-term memory (LSTM) is a type of recurrent neural network (RNN) that aims to remember long-term information. The LSTM model has reasonable control over what information should be kept and removed at each training stage (time step) (Fig. 2.2). At the t-th time step, LSTM receives two input vectors which are the vector representation of the t-th word in the sentence (xt) and the output vector of the previous hidden state (ht − 1). The model will first determine what information should be removed from the cell state Ct − 1. This process is done at the forget gate (ft) shown in Eq. (2.2).

$$ {\boldsymbol{f}}_t=\sigma \left({W}_{fx}{\boldsymbol{x}}_t+{W}_{fh}{\boldsymbol{h}}_{t-1}+{\boldsymbol{b}}_f\right) $$
(2.2)
Fig. 2.2
An architecture diagram of long short-term memory. It comprises cell c t minus 1, hidden state h t minus 1, input sentence x t, sigma, tan h, hidden state h t, and cell c t.

LSTM architecture

Next, the model will store selected information in the cell state Ct. During this step, the model will also determine the value to be updated through the input gate (it) as shown in Eq. (2.3) and the construction of a new vector that is the candidate cell state value (\( {\overset{\sim }{\boldsymbol{C}}}_t \)) in Eq. (2.4).

$$ {\boldsymbol{i}}_t=\sigma \left({W}_{ix}{\boldsymbol{x}}_t+{W}_{ih}{\boldsymbol{h}}_{t-1}+{\boldsymbol{b}}_i\right) $$
(2.3)
$$ {\overset{\sim }{\boldsymbol{C}}}_{\boldsymbol{t}}=\tanh \left({W}_{cx}{\boldsymbol{x}}_{\boldsymbol{t}}+{W}_{ch}{\boldsymbol{h}}_{\boldsymbol{t}-\mathbf{1}}+{\boldsymbol{b}}_{\boldsymbol{c}}\right) $$
(2.4)

The cell state Ct−1 is updated to a new cell state (Ct) using the outputs of the forget gate, input gate, and candidate vector \( {\overset{\sim }{\boldsymbol{C}}}_t \), as shown in Eq. (2.5).

$$ {\boldsymbol{C}}_{\boldsymbol{t}}={\boldsymbol{f}}_{\boldsymbol{t}}\cdotp {\boldsymbol{C}}_{\boldsymbol{t}-\mathbf{1}}+{\boldsymbol{i}}_{\boldsymbol{t}}\cdotp {\overset{\sim }{\boldsymbol{C}}}_t $$
(2.5)

The last step for the model is to determine the output using the new cell state. This is done at the output gate (ot) shown in Eq. (2.6). The vector ot will then be used with the cell state Ct to determine the hidden state ht in Eq. (2.7). The vector ht will be used in the next time step.

$$ {\boldsymbol{o}}_{\boldsymbol{t}}=\sigma \left({W}_{ox}{\boldsymbol{x}}_{\boldsymbol{t}}+{W}_{oh}{\boldsymbol{h}}_{\boldsymbol{t}-\mathbf{1}}+{\boldsymbol{b}}_{\boldsymbol{o}}\right) $$
(2.6)
$$ {\boldsymbol{h}}_{\boldsymbol{t}}={\boldsymbol{o}}_{\boldsymbol{t}}\ast \tanh \left({\boldsymbol{C}}_{\boldsymbol{t}}\right) $$
(2.7)

where W{fx, fh, ix, ih, cx, ch, ox, oh} is a weight matrix and b{f, i, c, o} is a bias vector [16].

2.2.3 Bidirectional Encoder Representation from Transformers

Bidirectional Encoder Representations from Transformers, commonly called BERT, is a trained language representation model developed by Devlin et al. [10]. Unlike the current language representation model, BERT does not use the traditional left-to-right or right-to-left language model. However, BERT is designed to train a bidirectional representation that simultaneously looks at each layer’s left and right contexts. The main architecture of BERT is the transformer’s encoder layers (Fig. 2.3).

Fig. 2.3
An architecture diagram of bidirectional encoder representations from transformers. Two layers are feed forward and multi-head attention with add and norm.

Transformers encoder layer

BERT comprises 12 layers of transformers encoder, each with a hidden size of 768, and the value of ℎ in the multi-head self-attention layer is 12. The transformer encoder layer consists of two sub-layers in each layer: multi-head attention and position-wise feed-forward network (Fig. 2.4).

Fig. 2.4
An architecture diagram of bidirectional encoder representations from transformers. It comprises 12 layers of transformers encoder, 768 hidden sizes, and 12 heads in the attention sub-layer. Two sub-layers are feed forward network and multi-head attention.

BERT architecture

Multi-head Attention

Multi-head attention is an architecture that simultaneously performs the attention function ℎ times using different Query, Key, and Value matrices. The goal of multi-head attention is to generate as much as different amounts of attention for each word. As the model processes each word (each position in the input sequence), attention allows it to look at other positions in the input sequence for clues that can help better encode this word.

Position-Wise Feed Forward Network

Position-wise feed-forward network is a neural network architecture used to transform the representation of all sequence positions using the same feed-forward network. The feed-forward network architecture consists of two linear transformations with a ReLU (rectified linear unit) activation function between the two linear transformations.

$$ \mathrm{FFN}=\max \left( 0, xW1+b1\right)W2+b2 $$
(2.8)

With x as the input vector, W1 as weight matrices from the first layer, W2 as weight matrices from the second layer, and b as bias.

The BERT model used in this study is IndoBERT-based uncased. IndoBERT-based uncased is the Indonesian version of the BERT model that uses uncased data during pre-training. This model has 12 layers of transformer encoder, 768 hidden sizes, and 12 heads in the attention sub-layer.

2.3 Experiment

In this section, we will describe the process of the experiment. In this study, we will implement continual learning on some domains of Indonesian sentiment analysis using BERT and then compare it to two other models, the fine-tuned embedding with CNN and the fine-tuned embedding with LSTM. The model is trained on personal computer with Intel(R) Core i7, 16GB RAM, an NVIDIA GeForce RTX 3050, and Python 3.7.

2.3.1 Data Sets

There are six data sets used in this study, shown in Table 2.1. Calon Presiden contains tweets about the Indonesian Presidential Elections in 2014, while E-commerce contains tweets about e-commerce’s existence in Indonesia. Four of the data sets, DANA, Shopback, Grab, and Jenius, have Indonesian reviews about applications from Google Playstore.

Table 2.1 Data sets details

2.3.2 Preprocessing

There are several changes applied to the text, such as capital letters being changed to lowercase, the website address is removed, the Twitter username deleted, Hashtag removed, punctuation removed, numbers being deleted, the extra spaces being removed, repeating words being separated by removing the dash, letters that are repeated more than two times are deleted into just two times, words with a single letter are removed, and the “rt” is deleted. The sentiment labels on the data sets are processed by one-hot encoding. Sentiment on the text has a value of −1 or 1, where −1 represents negative sentiment and 1 represents positive sentiment. Through this preprocessing, a sentiment is mapped into a two-dimensional vector. In the negative sentiment, −1, the mapping result is a vector with the first and second elements 1 and 0, respectively. On the other hand, for the positive sentiment, 1, the mapping result is a vector with the first and second elements being 0 and 1, respectively. In this study, the proportion of training data to testing data is 8:2.

2.3.3 Model Implementation

The first step in the BERT model is to change every word in the sentence input into a numerical vector representation which is then entered into the encoder layer. First, BERT uses the WordPiece model as a tokenizer to tokenize a sentence, and the addition of two special tokens, [CLS] is added at the beginning, and [SEP] is added at the end of the sentence. Padding and truncating are performed to ensure each sentence in the data has the same length of tokens. In this study, the maximum number of tokens is 256. Each document with less than 256 tokens will be padded with a special token [PAD] until the document length reaches 25 tokens, and the sentence with more than 256 tokens will be truncated only up to 256 tokens. The next step is embedding, which functions to map each token to a numeric vector with a particular dimension. Each token has three embeddings, token embedding, segment embedding, and position embedding. The illustration of embedding is shown in Fig. 2.5.

Fig. 2.5
An illustration presents the input representation for bidirectional encoder representations from transformers. Four layers are input, token embeddings, segment embeddings, and positional embeddings.

Illustration of input representation for BERT

Finally, as shown in Fig. 2.3, the three vectors are added together after obtaining the numerical representation vectors of the token embedding, segment embedding, and positional embedding to obtain the input for the BERT model.

The simulation performs fine-tuning using BERTForSequenceClassification with batch sizes 16, Adam learning method, the learning rate of 2e−5, and 15 epochs. In this study, the author used early stopping that monitors validation loss with patience =3.

2.3.4 Continual Learning Implementation

Continual learning is implemented in the model with a flowchart shown in Fig. 2.4 using five data sets. The role of the data sets can be seen in Table 2.1. Based on Fig. 2.4, after the model is built using Source Domain 1, the model continues to learn from Source Domain 2. After learning from Source Domain 2, the model is tested for retain of knowledge (loss of knowledge) by evaluating it to data testing of Source Domain 1. In addition, the model is also tested for the transfer of knowledge by considering it to the Target Domain. The following learning of Source Domains 3, 4, and 5 goes through the same steps (Fig. 2.6).

Fig. 2.6
A flow diagram depicts the implementation process of continual learning. It comprises source domains 1 to n for training. The model is tested for loss of knowledge and transfer of knowledge through evaluation.

Continual learning implementation process

2.3.5 Transfer of Knowledge

Firstly, we simulate the performance of BERT for continual learning based on the performance of transfer of knowledge, namely, the performance of BERT in the target domain after learning in a series of source domains. Figure 2.7 compares BERT, the fine-tuned embedding with LSTM (LSTM), and the fine-tuned embedding with LSTM (CNN) for transfer of knowledge.

Fig. 2.7
A point-to-point graph plots the accuracy percentage versus source domains 1 to 5. C N N, 69, 84, 85, 85, 85. L S T M, 64, 78, 84, 85, 84.5. B E R T, 84, 88, 90, 88, 89. Values are provided in percentage. Values are estimated.

The accuracies of BERT, LSTM, and CNN for transfer of knowledge

Based on Fig. 2.7, the BERT model increases accuracy to 89.60%. This accuracy increased by 6.6.7% from the initial accuracy of 82.93%, the LSTM model experienced an increase of accuracy to 84.51% or an increase of 19.86% from the initial accuracy of 64.65%, CNN model experienced an increase in accuracy to 85.86%, or an increase of 17.09% from initial accuracy of 68.77%. Based on these results, we can conclude that BERT provides the highest accuracy for the transfer of knowledge.

2.3.6 Retain of Knowledge

Next, we simulate the performance of BERT for continual learning based on retain of knowledge, namely, the performance of BERT in an initial source domain after learning in a series of other source domains. In this simulation, the initial source domain is set to Source Domain 1. Figure 2.8 compares BERT, the fine-tuned embedding with LSTM (LSTM), and the fine-tuned embedding with LSTM (CNN) for retaining of knowledge.

Fig. 2.8
A point-to-point graph plots the accuracy percentage versus source domains 1 to 5. C N N, 86.5, 74, 73.5, 73.75, 73. L S T M, 86, 85.75, 85, 85.25, 84.5. B E R T, 89, 85, 83, 82.5, 82. Values are provided in percentage. Values are estimated.

The accuracies of BERT, LSTM, and CNN for retain of knowledge

Figure 2.8 shows that the LSTM model retains more knowledge of Source Domain 1 than BERT and CNN. The LSTM model experienced a decrease in accuracy to 83.63% or as much as 2.37% from the initial accuracy of 86.00%, BERT model experienced a reduction in accuracy of up to 80.96% or as much as 8.21% from the initial accuracy of 89.17%, and CNN model experienced a decrease in accuracy to 72.63% or as much as 13.92% from the initial accuracy of 86.55%.

2.3.7 Sequences of Source Domains

Further experiments were conducted on the 120 possible combinations of sequences of the 5 source domains. The experiments aim to see whether or not the order of source domains impacted the accuracy of the BERT model for lifelong learning.

Table 2.2 shows the top five BERT accuracies for transferring knowledge. The highest overall accuracy for transferring knowledge with the BERT model is achieved with the domain sequence of 1-5-4-2-3 and an accuracy of 91.66%. An improvement of 2.06% from an earlier experiment with the sequence of 1-2-3-4-5 that had an accuracy of 89.6%. These simulations show that the order of the source domains affects the performance of BERT for the transfer of knowledge.

Table 2.2 The top five highest accuracies of BERT for transfer of knowledge

We use five scenarios to simulate retain of knowledge, where each source domain becomes the initial source domain. Tables 2.3, 2.4, 2.5, 2.6, and 2.7 show the five highest BERT accuracies for retaining knowledge in each initial source domain. Based on Table 2.3, the highest accuracy of BERT in the source domain Calon Presiden was obtained after studying a series of other source domains, with the order of 1-5-3-2-4 being the highest, with an accuracy of 83.29%. In the source domain of E-Commerce, the highest accuracy is 95.65%, provided by the source domain sequence of 3-4-1-2-5 in Table 2.4. For the rest, the source domains of Shopback, Grab, and Jenius, the highest accuracies resulted from the source domain sequences of 3-4-1-2-5, 4-2-5-1-3, and 5-3-1-4-2, respectively.

Table 2.3 The top five highest accuracies of BERT for retaining knowledge in the source domain of Calon Presiden
Table 2.4 The top five highest accuracies of BERT for retaining knowledge in the source domain of E-commerce
Table 2.5 The top five highest accuracies of BERT for retaining knowledge in the source domain of Shopback
Table 2.6 The top five highest accuracies of BERT for retaining knowledge in the source domain of Grab
Table 2.7 The top five highest accuracies of BERT for retaining knowledge in the source domain of Jenius

These simulations also show that the order of the source domains affects the performance of BERT in retaining knowledge. Moreover, there is no correlation between the order of sources domains that give the highest accuracies for both transfer of knowledge and retain of knowledge.

2.4 Conclusion

In this paper, we analyze the performance of the BERT model for lifelong learning in Indonesian sentiment analysis. Then it will be compared with two standard deep learning models: fine-tuned embedding with CNN and fine-tuned embedding with LSTM. Our simulation shows the BERT model gives the best accuracy for the transfer of knowledge. Lifelong learning increases the accuracy by 6.67% from the initial source domain to the last source domain and achieves the final accuracy of 89.60%.

The fine-tuned embedding with CNN model is the second with a final accuracy of 85.86%, followed by the fine-tuned embedding with LSTM with 84.51%. However, the fine-tuned embedding with LSTM model is the best model for retain of knowledge. The fine-tuned embedding with LSTM model’s final accuracy is 83.63%, while the BERT model only has a final accuracy of 80.96%. Moreover, our simulation shows that the order of the source domains affects the performance of BERT for both transfer of knowledge and retain of knowledge. There is no correlation between the order of sources domains that give the highest accuracies for both transfer of knowledge and retain of knowledge.