Introduction

Social media connects us with other people, sharing aspects of our life that we value. Unarguably, this means of communication has become a part of one’s life in this digital age. In the past decade, the number of social media users has increased tremendously, so has the number of platforms that have been designed and developed to better suit users’ needs. These include social network, microblog, Web forum, photo/video sharing site, etc. The blooming of social media brings an overwhelming amount of data to the network, combinations of text, image, video, etc. People share their experiences and opinions through these means. Therefore, such raw data can provide useful information when extracted. Because of these reasons, social media has become the subject of interest of many researchers [1].

At present, a large amount of raw data in social media appears in the form of text. It is impossible for a human to analyze all of those text messages and manually extract useful information from them. In order to achieve this goal, computational analysis is required to manage the massive data. A specific technique known as natural language processing (NLP) has long been employed to enable computer to understand human languages.

Many companies have invested buckets of their money in digital media marketing, hoping to increase their sales volume [2]. To ensure the worthiness of the budget spent, they also use social monitoring as a tool to trace customers’ satisfaction and other feedbacks as well as keep track of customers’ desire of new services and/or products. To process social monitoring, NLP technique has been applied for sentiment analysis of the text messages that customers have provided. The findings from the analysis allow suppliers to gain better insights into customers’ attitudes, either positive, negative, or neutral, toward their products. In particular in the case of a negative comment, quick response and prompt action from suppliers can demonstrate their trustworthiness. Thus, analysis of customers’ sentiment is necessary not only for improving the products/services but also for the credibility and good image of the companies. Many commercial social media marketing tools have been developed, such as ZocialEyeFootnote 1 and Evolve24,Footnote 2 to assist the companies in keeping track of their customers [3]. More details on sentiment analysis can be found in [4].

At the present, the deep learning model is the predominant technique for constructing prediction models in various fields of study including NLP. Various deep learning algorithms have been adopted to deal with different types of data. For example, researchers use convolutional neural network (CNN), the prevailing algorithm for computer vision and text analysis, to extract local structure while resorting to long short-term memory (LSTM) and bi-directional LSTM (BLSTM) to manage sequential data and various linguistic patterns, respectively [5,6,7]. Furthermore, previous works have suggested that the quality of a prediction model can be improved when multiple features are integrated into an analysis [8,9,10]. This is because each feature can be complimentary to each other. Because of this reason, Pasupa and Seneewong Na Ayutthaya (2019) incorporated word embedding, part-of-speech (POS), and sentic features into various deep learning models that represent words as vectors, identify POS, and associate words with feeling, respectively [11]. This study’s findings reaffirm the aforementioned claim—integrating multiple features can enhance performances—by performing sentiment analysis on Thai children’s tales. Apart from combining features, this study also compared the efficiency of various deep learning models. The comparison showed that CNN is the most efficient model for sentiment analysis. In addition, Seneewong Na Ayutthaya and Pasupa (2018) also attempted to fuse deep learning models, BLSTM and CNN, in order to, firstly, examine sequences of words and, secondly, explore local features of the text [12]. It was found that the fusion of deep learning models resulted in a higher accuracy of the sentiment analysis. However, these two works were only on a single dataset of 40 Thai children’s tales. It should be noted that most Thai sentiment analysis research studies mentioned above and in the section “Related Works” were conducted on only one dataset and undertook different pre-processing steps. Consequently, it is hard to compare all of them together and reveal the most efficient way to construct models appropriate for sentiment analysis research because of those different experimental frameworks. Additionally, deep learning models have already been fused in various ways, e.g., [13,14,15]. All of those works did not compare their combined networks against all of the others but only against individual ones. Thus, models hybridized in different manners should be compared against each other.

In this paper, we aimed to construct a Thai sentiment analysis framework that fuses CNN and BLSTM in different ways on three datasets. According to our literature review, there are a number of research gaps that have been addressed in this work. Our contributions are as follows:

  1. 1.

    To the best of our knowledge, this is the first Thai sentiment analysis study which drew its findings from more than one set of data. Precisely, we performed sentiment analysis on three different datasets that had been collected from different sources: (i) Wisesight dataset collected from various social media platforms such as Facebook, Twitter, and Web forum; (ii) Thai Economy Twitter dataset collected solely from Twitter; and (iii) 40 Thai children’s tales. The first two datasets were from social media while the latter was from the literature. Hence, the writing styles were different.

  2. 2.

    To effectively analyze human sentiment, we propose that instead of relying on only one feature, a combination of features—word embedding, POS, and sentic features—should be incorporated into the analysis to increase the accuracy of human sentiment predictions. This is to strengthen the statement of our previous work [11].

  3. 3.

    According to the literature, there have been different approaches to fusing models, but the hybrid models were only compared against other individual models running on different datasets without comparing them to each other. Therefore, we compared different combinations of deep learning techniques, for example, CNN-BLSTM, BLSTM-CNN, BLSTM+CNN, and BLSTM×CNN on the same framework and with several different datasets. We demonstrated that among different combinations of deep learning techniques, BLSTM-CNN generated the most reliable results and is thus the best method for doing Thai sentiment analysis.

  4. 4.

    The current Thai-SenticNet used in this study is the latest version that we have updated from the previous Thai-SenticNet2, proposed in [16]. Unlike SenticNet2, this corpus drew its information from SenticNet5 and includes more words. To achieve this goal, it incorporated LEXiTRON [16], Volubilis [17], and Thai-Wordnet [18] to render the translation of texts between Thai and English languages. As a result, we successfully constructed the Thai-SenticNet5.

This paper is arranged as follows: “Related Works” describes the papers related to our work in the literature. The proposed framework is explained in “Proposed Framework” that shows the data pre-processing step, feature extraction process, and hybrid deep learning models. Section “Experimental Framework” describes the experimental setup and datasets, followed by the results and discussion in “Results and Discussions.” Finally, we conclude our work in “Conclusion.”

Related Works

Many researchers have investigated sentiment analysis in various types of learning problems, such as supervised learning [11, 12, 19], unsupervised learning [20], semi-supervised learning [21], and reinforcement learning [22]. Regarding sentiment analysis, most studies following this line of research have been conducted with texts in English [23, 24]. Few researches were on other languages, e.g., German, French, Japanese, and Chinese [25,26,27,28]. Recently, a framework for multilingual sentiment analysis called BaBelSenticNet was proposed [29]. It translates SenticNet corpus via statistical machine translation tool into 40 languages based on WordNet and its multilingual version. As for Thai, even though sentiment analysis was introduced to examine Thai texts a decade ago [30], the number of Thai sentiment analysis researches is still limited because to analyze Thai text effectively, a limiting factor is that multiple pre-processing steps are required. Hence, it is a challenge for researchers to create Thai NLP to deal with the lack of word delimiter and sentence boundary marker, nostalgic Thai slang, etc. To do so, a fine-grained corpus of Thai language that enables researchers to identify word segmentation, tag part-of-speech, manage named identity recognition, and analyze syntactic parsing, among others, is essential. Unfortunately, up until now, we still have inadequate tools and limited resources supporting Thai sentiment analysis [31]. In 2010, Thai text sentiment analysis was first conducted using term of frequency as an input feature by [32, 33]. Since then, Thai sentiment analysis research continued [34,35,36,37,38]. Most of the studies rely on corpora that applied dictionary-based technique to create feature extraction. In those corpora, words are categorized into three groups, positive, neutral, and negative, and are tagged with either − 1, 0, or + 1 label. However, the 3 labels are, in fact, insufficient to describe human sentiment. Therefore, Lertsuksakda et al. (2014) proposed using a corpus that incorporates a better defined weight, ranging from − 1 to + 1, in Thai sentiment analysis [16]. The corpus was constructed based on SenticNet2 proposed by Cambria and his colleagues [39]. To translate English terms into Thai and verify the Thai meanings gained from the translation process, this study adopted a bi-directional translation technique. Then, sentic features were extracted from sentences and used to analyze sentiment in Thai children’s tales [19, 40].

Deep learning has played a major role in sentiment analysis tasks. An important feature used with deep learning is word embedding which is a feature that transforms words into vectors. Each dimension of this kind of vectors represents a meaning or a context of the word. Word embedding can be done by a Word2Vec model [41]. Besides word embedding feature, there have been several more features used in sentiment analysis tasks, features such as term frequency, POS, and sentic. One of our previous works used combinations of word embedding with other features such as POS tag feature that identifies the grammatical type of a word in a sentence and sentic feature that represents the emotion of a word in vector form [11]. Those combinations truly improved the performance of our model for sentiment classification in Thai tales. Also, consolidating sentiment information into text embedding process can obtain better representations for sentiment analysis [42].

Conventional deep learning models are such as CNN and LSTM. A CNN model is a feed-forward neural network, a long-time favorite for computer vision task. One of the processes in a CNN model is computation of groups of pixels that are components of image data, but for NLP tasks, it processes groups of words instead [43]. An LSTM model is one of recurrent neural network (RNN) models that can learn sequential data, sequences of words in an NLP task. Normally, an LSTM model learns sequential data in forward direction, but in some cases, a pattern needs to be learned in backward direction, so BLSTM was developed to handle it; BLSTM learns sequential data in both forward and backward directions. Many research studies have shown that BLSTM performed better than LSTM [44,45,46]. CNN, LSTM, and BLSTM models have been used in sentiment analysis tasks. Ouyang et al. (2015) used a CNN model to perform sentiment classification on a movie review dataset [47] and showed that the CNN model was more accurate than shallow classification algorithms such as naïve Bayes or support vector machine [48]. Nowak et al. (2017) compared LSTM against BLSTM models in sentiment classification of an Amazon book review dataset and found that the BLSTM model was more accurate than the LSTM model in this task [49].

Besides research studies using individual models, there have been studies that used combinations of models to improve performance. Wang et al. [13] show that a combination of a CNN model with an LSTM model yielded a lower error measure than individual CNN and LSTM models alone did in predicting the valence-arousal value (dimensional sentiment in numerical form) of Stanford Sentiment Treebank (English language) [50] and Chinese Valence-Arousal Text (Chinese language) [51]. Lin et al. (2017) show that a combination of BLSTM with CNN provided the best performance among BLSTM, CNN, CNN-LSTM, LSTM-CNN, and LSTM in predicting the type of customer feedback in an IJCNLP 2017 Shared Task on Customer Feedback Analysis dataset [14]. Minaee and his colleague [15] show that an ensemble between CNN and LSTM model provided a higher accuracy than individual CNN and LSTM models alone in sentiment classification of an IMDB review dataset [52] and a Stanford Sentiment Treebank dataset.

Sentiment analysis is normally performed at a coarse level, i.e., document or sentence level. Recently, aspect-level sentiment analysis has been proposed [6]. There might be multiple feelings in a sentence, for example, “Bad service but really good food.” In this case, there are two aspects which are “service” and “food.” It is clearly seen that this customer has a positive sentiment toward food but a negative sentiment toward the service of this restaurant. Therefore, aspect-level sentiment analysis aims to understand the sentiment in a certain aspect term. This can be achieved by integrating attention mechanism in learning models [53]. The mechanism imitates human’s attention behavior in reading and focusing on a context word that draws their attention. Wang et al. (2016) employed the attention mechanism on LSTM and proposed a model called Attention-based LSTM with Aspect Embedding [6]. The aspect (target word) embedding is concatenated with word embedding vector and fed into the LSTM layer while it is concatenated with hidden state vector and fed into the attention layer. Ma et al. (2017) proposed an Interactive Attention Network that utilizes two LSTM models to separately learn both context and target words [54]. Then, hidden states of both models are interactively learned through the attention mechanism and combined together. Both researches show that employing attention mechanism in LSTM can clearly improve the overall performance of aspect-level sentiment analysis on SemEval 2014 Task 4 [55].

Proposed Framework

The framework for Thai sentiment analysis in this experiment consisted of 3 main parts: (i) data pre-processing, (ii) feature extraction, and (iii) learning model, as shown in Fig. 1.

Fig. 1
figure 1

Thai sentiment analysis framework

Data Pre-Processing

Data Cleansing

To boost the performance of sentiment classification, a text must be passed through a cleansing process to get rid of noise that can affect the performance of other downline processes, especially before the text is processed by a tokenization process. The text input in our work had to be processed in the following ways: (a) changing any English words from upper case to lower case and (b) since the text data used in this experiment, such as Thai Economy Twitter and Wisesight datasets, were collected from social media, there occurred some Uniform Resource Locators (URLs) in the text; many URLs were written with characters and numbers that were long and did not have any useful meanings, i.e., they are quite noisy, so we changed every URL to “xxurl.”

Word Tokenization

Each word in a sentence must be split before it is fed into the model as input. The process that splits each word in a sentence is called word tokenization. The Thai language is not like the English language in the sense that two adjacent English words are separated by a space. Therefore, to split Thai words in a sentence needs a special technique. In this experiment, we split words in a sentence by using a technique based on a maximum matching algorithm from PyThaiNLP Library [56]. Maximum matching algorithm implemented in PyThaiNLP is a dictionary-based approach. The algorithm scans series of input characters and matches them with words in a dictionary [57,58,59]. Then, it employs breadth-first search to select segmented series that contain a minimum number of word tokens. Word tokens that are not in the dictionary will be segmented into Thai character clusters. A Thai character cluster is an unambiguous unit that is smaller than a word. It is an indivisible unit. This process is performed by the character clustering algorithm proposed by Theeramunkong and his colleagues [60]. The algorithm utilizes a set of simple rules based on types of Thai characters. After a word tokenization process, all tokens (except emoticons) are fed into a spell check algorithm proposed by Peter Norvig [61]. The algorithm will find possible permutations of the original words within a two edit-distance (i.e., inserts, replaces, transposes, deletes). Words that have the highest frequency in a list are selected. It is noted that the list contains words which are edited and matched with words in the dictionary.

POS Tagging

In this experiment, we used a POS tagging process to identify the types of words in a sentence needed for construction of POS tag features. POS tagging used a model based on Perceptron Tagger from PyThaiNLP library to do the tagging. This model categorizes types of words into 47 types based on ORCHID corpus which is a Thai POS-tagged corpus [62]. However, 47 types of words seem too complex and can be difficult for a model to learn. Therefore, the 47 types of words from ORCHID categorization were mapped to 17 types of words based on Universal POS tags known as Universal Dependencies (UD) [63]. Those 17 types of words were simple, easily comprehensible, and preferred in many languages. However, in the mapping process, only 15 types of word were mapped because no type of words based on ORCHID could be mapped into two types of UD: “symbol” (SYM) and “other” (X) as shown in Table 1. Please note that the mapping function was a part of PyThaiNLP library.

Table 1 POS mapping between ORCHID and UD POS tags [56]

Since social media data often expresses sentiment or emotion of the users with emoticons, we added an “EMOJI” type as one of the POS tag types to identify emoticons in sentences. Emoticon is directly associated with emotion and so beneficial to sentiment analysis [64]. All the emoticons were mapped to their name in English via an emoji library [65]. In addition, we used a padding process, explained in “Padding,” and the words padded by this process would be another “PAD” type. Therefore, we categorized POS tag types into 19 types that were not exactly the same as the original 17 UD types.

Regarding the emoticons, their sentic vector was computed according to its name mapped by an emoji library. In the case that an emoticon’s name was longer than one word, that emoticon sentic vector would be represented by an average across all sentic vectors of all words.

Sentic Tagging

A feature that represents the emotion of a word is called a sentic feature. The sentic value of a word was encoded in a 5-dimensional vector that consisted of four affective dimensions and a polarity as presented by [66]. The emotion represented by each dimension is explained in detail in “Sentic.” Sentic vectors that represent English sentiment lexicon can be obtained from SenticNet, which was now in the 5.0 version. For our use in Thai language, we then needed to construct a new Thai-SenticNet. This Thai-SenticNet was constructed based on the two following concepts: bi-directional translation and Thai-Wordnet.

  1. 1.

    Thai words were aligned with English words and their alignment was verified with a bi-directional translation [67] technique based on Bi-LEXiTRON [16] and Bi-Volubilis [17] corpora. Furthermore, several words were added from Thai-WordNet [68], Thai words that aligned with English words in SynsetFootnote 3.

  2. 2.

    Constructing new words by deleting some stop words [69] from our corpus (details about the corpus are below) and adding them to the corpus to make it more comprehensive.

In this work, we used the following corpora to map English sentic vectors to Thai sentic vectors: (i) bi-directional LEXiTRON (Thai-English) [16]; (2) bi-directional Volubilis 11K (Thai-English) [17]; (3) Volubilis-100K [17]; (4) Thai-WordNet [18]; and (5) SenticNet5 [70]. Our final corpus contained 23,093 Thai words that had been successfully verified and 15,247 sentic vectors. We called it a Thai-SenticNet5.

The purpose of Thai-SenticNet5 was to accurately map the sentic values of English words in SenticNet5 to the sentic values of corresponding Thai words in our constructed corpus. There were several Thai to English dictionaries or corpora such as LEXiTRON, Volubilis, and Thai-WordNet. We wanted our constructed corpus to contain as many Thai words as possible, so we combined the three Thai corpora mentioned above under the constraint that each translated word had to be successfully verified by the bi-directional translation technique such that the meaning of every Thai word in the corpus would be the same when it is translated into English, then translated back into Thai [16].

First, we considered Thai words in Volibilis-100K which was the biggest corpus in this experiment with 107,607 entries. Then, we put the same words that had more than one entry in the corpus, i.e., listed in different POS sections in this corpus, in one entry, reducing the number of entries or words in the corpus to 100,107 words. Words in this modified Volubilis-100K were not verified by the bi-directional translation technique; therefore, we selected only the Thai words in this corpus and matched them with translated English words from the following corpora that had already been verified by the bi-translation technique:

  1. 1.

    Thai WordNet corpus created by aligning Princeton WordNet’s Synsets with Thai words by using a bi-lingual dictionary;

  2. 2.

    LEXiTRON-Volubilis-Bi corpus created by merging LEXiTRON-Bi [16] that had 2871 words with Volubilis-Bi [17] that had 11,065 words. The merged corpus was called LEXiTRON-Volubilis-Bi. Please note that the Thai words in Volubilis-Bi were a subset of the 11,820 entries in Volubilis-100K.

Afterwards, we merged only the Thai words from the modified Volubilis-100K that had 100,017 words with a list of Thai words from LEXiTRON-Bi (we did not merge the Thai words from Volubilis-Bi because they were a subset of the words in Volubilis-100K) and got 100,118 Thai words. Furthermore, to get even more Thai words, we used a technique that delete stop words from each relevant entry in the list and added these entries with deleted stop words into the list and got 119,281 Thai words. Let us call this list ThaiWordList. After that, we started to create a verified dictionary with the two corpora by aligning the Thai words in ThaiWordList with a set of English words in Thai-WordNet and LEXiTRON-Volubilis-Bi corpora. We matched each Thai word in ThaiWordList with a set of English words in Thai-WordNet under the condition that the Thai word had a corresponding English word in synset in Thai-WordNet. If the Thai word did not have any corresponding English word in the synset in Thai WordNet, we matched it with a set of English words in LEXiTRON-Volubilis-Bi instead. Finally, we got 23,093 matches and started to create the Thai-SenticNet5 corpus by mapping a set of English words with the corresponding Thai word to SenticNet5 corpus to get a set of sentic values and an average sentic value for a Thai word. That was how we got the 15,247-word Thai-SenticNet5 corpus with a sentic vector for every Thai word in the corpus.

This construction of Thai-SenticNet5 corpus was shown as pseudocode in Algorithm 1. Table 2 shows the number of verified words and the number of words that had a sentic vector in each of the mentioned corpora. It can be seen that the number of verified words that had an associated sentic vector increased after every step of the construction process. Thai-SenticNet5 is available for download at https://github.com/dsmlr/ThaiSenticNet5.

figure a
Table 2 Number of verified words and the number of words that had a sentic vector in each corpus

Padding

A set of words had to be transformed into vector data before they were fed into the model. The vector data were chunks of data (batches) and every sample vector in each batch had to be of the same size. Therefore, we needed to pad some words in the set with 〈PAD〉 in order that every sample vector would be of the same size.

Feature Extraction

Word Embedding

Deep learning model is a subset of neural network models. They are mathematical models that cannot learn directly from raw text data. Indeed, it can only learn from vector data; therefore, a word must be transformed into a vector first. This kind of transformation is called word embedding which can be done by models such as Word2Vec, GloVe, and ULMFiT. Conventional Word2Vec models are such as the following two models: Continuous Bag-of-Words model that uses context words (words surrounding the target word) as input to predict the target word and Skip-Gram model that uses the target word to predict the context words [71]. An efficient Word2Vec model should be trained with a large corpus. GloVe model [72] learns word vectors by using information from word co-occurrence probabilities at the global level (whole dataset) and gives good results when it is trained on a large corpus. However, training a model on large corpus consumes a lot of time. Fortunately, there has been a research study that proposed a technique for fine-tuning a language model that can transfer knowledge gained from one task to any other tasks in NLP which means that the language model does not need to be trained from scratch. This technique is called Universal Language Model Fine-tuning (ULMFiT) [73].

The datasets used in this experiment were relatively small, so it was difficult to train and get an efficient model from scratch; hence, we used a pre-trained language model to transform words into vectors. The pre-trained language model was from thai2fit library [74]. It was an ASGD Weight-Dropped LSTM model [75] trained by a ULMFit method on Thai Wikipedia dataset. The number of dimensions of the pre-trained embedded word vectors was 300.

POS Tagging

In this experiment, we used a POS tag feature that represented a type of word in a sentence in one-hot vector form. The number of dimensions of our one-hot vector was related to the number of types of POS. For each POS tag type, the corresponding one-hot vector would have a value of 1 in a dimension while the other dimensions would have a value of 0. We categorized POS tag into 17 UD POS tag types as shown in Table 1 and two additional types—EMOJI and PAD.

Sentic

Sentic feature is a five-dimensional vector composed of four affective dimensions based on Hourglass of Emotion theory. The theory follows psychological principles that rely on activities of the brain while the condition of the mood changes [76, 77]. The four dimensions are sensitivity (Snst), aptitude (Aptit), attention (Attnt), and pleasantness (Plsnt). We were able to calculate the polarity of a word by:

$$ p = \sum\limits_{i=1}^{N} \frac{Plsnt(\kappa_{i}) + \left| Attnt(\kappa_{i}) \right | - \left | Snst(\kappa_{i}) \right | + Aptit(\kappa_{i})}{3N}, $$
(1)

where N is the total number of word concepts (concepts that describe objects or actions perceived) and κi is the i-th input concept. Here, p was in the range [− 1,1] implying extremely negative to extremely positive emotion.

This research obtained the sentic values from SenticNet5 corpus. The corpus is a sentiment lexicon at the concept level. It employs BLSTM to infer primitives by lexical substitution. Feature vectors were extracted using Thai-SenticNet5 Corpus as explained in “Sentic Tagging.” For any Thai words that do not exist in the corpus, the feature vectors would be represented by 5-D zero vectors.

Learning Model

Bi-Directional Long Short-Term Memory

One of the powerful algorithms in the RNN family is BLSTM which is able to learn sequential data both in forward and backward directions. It is an extension of LSTM. It has been known that RNN has a gradient vanishing problem for long data sequences [78]. Therefore, LSTM has been introduced to deal with this problem.

LSTM processes data in a forward direction with an ability to remember and forget the information. LSTM model consists of the following components: forget gate (ft), input gate (it), input modulation gate (\(\tilde {c_{t}}\)), cell state (ct), output gate (ot), and hidden state (ht). Forget gate (ft) enables the model to be able to reset itself—to forget the old information at an appropriate time. When there is a new incoming sample (xt), the sample will be considered together with the previous hidden state (ht− 1) whether how much information should be forgotten. Sigmoid function is employed for this task. Its value ranges from 0 to 1, corresponding to completely forget or remember the previous information, respectively. This can be explained by:

$$ f_{t} = sigmoid(W_{f} [h_{t-1}, x_{t}] + b_{f}). $$
(2)

Input gate (it) is a gate that decides which information will be updated, again considering together with the previous hidden state (ht− 1). The sigmoid function will decide how much new information should be updated based on values of 0 to 1:

$$ i_{t} = sigmoid (W_{i} [h_{t-1}, x_{t}] + b_{i}). $$
(3)

Input modulation gate (\(\tilde {c_{t}}\)) is similar to candidate cell state that learns both new information (xt) and the previous hidden state (ht− 1). It utilizes the \(\tanh \) activation function and create a vector of new candidate values:

$$ \tilde{c_{t}} = tanh (W_{c} [ h_{t-1}, x_{t} ] + b_{c} ). $$
(4)

Cell state (ct) is a long-term memory cell that is a combination of old information (ct− 1)—that is dropped by a forgot gate—and new information—that is a product of input gate it and input modulation gate (\(\tilde {c_{t}}\)):

$$ c_{t} = f_{t}\cdot c_{t-1} + i_{t} \cdot \tilde{c_{t}}. $$
(5)

Output gate (ot) has a role to decide what the next hidden state should be. It sends information to the hidden state (ht) that is restricted to an interval [0,1] by a sigmoid function:

$$ o_{t} = sigmoid(W_{o} [h_{t-1},x_{t}]+b_{o}). $$
(6)

The last component is hidden state (ht), referred to as an output of LSTM. It carries the information on what LSTM has seen:

$$ h_{t} = o_{t} \cdot tanh (c_{t} ). $$
(7)

On the other hand, BLSTM processes data in both forward and backward directions. The architecture of BLSTM is shown in Fig. 2.

Fig. 2
figure 2

Bi-directional long short-term memory for sentiment analysis

In this study, once a sentence was fed into the model, it went through the embedding layer that converted the sentence into a word embedding feature that was further fed to a dropout layer. Then, it was combined with POS tag and sentic vectors as shown in Fig. 3. Then, we fed the combined features to a BLSTM layer. The hidden states of both forward and backward directions—the last output of BLSTM—would be concatenated and fed to the dropout layer to prevent over-fitting problem [79] before being pushed on to the output layer. Softmax activation function was used to predict output classes as output probabilities range.

Fig. 3
figure 3

Word transform to vector process

Comparing the operation of the BLSTM model to human reading behavior, it would be like reading each word from the beginning to the end of the sentence and analyzing the sentence in both forward and backward directions. This allows human to interpret and analyze the meaning of the sentence including the use of grammar that might contain patterns in both forward and backward directions.

Convolutional Neural Network

CNN is a feed-forward neural network that has, at least, a convolutional layer as a core component that automatically generates feature maps by sliding a filter over an image. Another important component is a pooling layer that is employed to reduce the size of feature map. Therefore, CNN is able to capture local features of text. The architecture is shown in Fig. 4.

Fig. 4
figure 4

Convolutional neural network for sentiment analysis

An input feature vector is first fed into the convolutional layer that allows the model to learn information from groups of words through a striding filter. A striding or sliding filter has a dimension of w × h where w is the length of feature vector and h is the number of words that the filter covers at a time. This leads to an output with a size of s × n where n is the number of nodes in the convolutional model. s is the number of strides that is equal to h − (l − 1) and l is the number of words in a sentence. Then, the output from the convolutional layer passes through Rectified Linear Units (ReLU) activation function [80]. Because the vector from the input to the output layer has to be 1-D vector, 1-D dynamic max pooling with size of s × 1 is required. It strides for n times and gives a 1-D output vector that goes to the dropout layer then to the output layer.

Hybrid Deep Learning Models

We proposed the following four different hybrids of deep learning models.

BLSTM-CNN

BLSTM-CNN is a hybrid deep learning model that combines CNN to BLSTM. The model aims to first learn sequences of words by BLSTM and capture local features by CNN. The model is shown in Fig. 5. After a sentence is input into the model, it is feature-extracted and sent to BLSTM layer to learn the sequence of the sentence in both forward and backward directions. Then the output from the BLSTM—that has long-range dependency information of both directions—goes to CNN in order to extract local features of text.

Fig. 5
figure 5

BLSTM-CNN model for sentiment classification

CNN-BLSTM

CNN-BLSTM is the other way around. The model aims to first learn local features of text by CNN and then long-range dependency between the sequence of words is learned by BLSTM. The model is shown in Fig. 6. The output from convolutional layer goes to ReLU activation function. Then, the output—that has local features embedded—is fed into BLSTM layer to learn sequences in forward and backward directions. The hidden state of forward and backward directions is concatenated before it goes through the dropout layer and to the output layer.

Fig. 6
figure 6

CNN-BLSTM model for sentiment classification

BLSTM+CNN

This model learns the local features of a sequence of words in both directions at the same time. The model is shown in Fig. 7. An input sentence is feature-extracted then fed into BLSTM and CNN layers. The outputs from both layers are concatenated before they go through the dropout and output layers.

Fig. 7
figure 7

BLSTM+CNN model for sentiment classification

BLSTM×CNN

In this type of hybrid model, we simply ensembled both models by a soft voting scheme. The sentiment probability that BLSTM×CNN predicts is calculated by averaging the probabilities given by BLSTM and CNN. The final predicted sentiment is that which has the highest probability. The model is shown in Fig. 8.

Fig. 8
figure 8

BLSTM×CNN model for sentiment classification

Experimental Framework

Datasets

The proposed hybrid models—BLSTM-CNN, CNN-BLSTM, BLSTM+CNN BLSTM×CNN—were compared with their individual counterparts—CNN and BLSTM—on three datasets.

Wisesight Sentiment Dataset

This set, Wisesight, was collected from public pages in Facebook, Twitter, YouTube, Pantip.com, and other Web forums between 2016 and early 2019. A majority of the topics in this dataset were consumer products and services. There were 26,740 messages that were divided into four classes—6800 of negative sentiments, 14,500 of neutral sentiments, 4700 of positive sentiments, and 500 of queries. The length of each message was between 1 and 428 words. It should be noted that this dataset was labelled by a group of annotators. Each message was given only one label by an annotator. The dataset is available to download at https://github.com/PyThaiNLP/wisesight-sentiment.

Thailand Economy Twitter Dataset

This dataset, ThaiEconTwitter, was proposed by [81]. This set was collected from Twitter. Tweets with two hashtags— (stock) and (economic)—between 17 April 2017 and 5 May 2017 were retrieved. It consisted of 2000 sentences and three classes—positive, neutral, and negative sentiments. Each sentence was given a label by one of three experts. In this work, we selected only a set of sentences that were given the same label from all three experts. Therefore, there were only 1041 sentences comprising 608 negative sentiments, 84 neutral sentiments, and 349 positive sentiments.

The 40 Thai Children’s Tales Dataset

This dataset, ThaiTales, was first used in [19]. The dataset collected from 40 Thai tales consisted of 1964 sentences. Each sentence was labelled as one of three classes, i.e., positive, neutral, or negative sentiment by three experts. All of the experts gave the same label to a sentence for only 1115 sentences consisted of 309 sentences with positive sentiment, 508 sentences with neutral sentiment, and 298 sentences with negative sentiment. The dataset is available for download at https://github.com/dsmlr/40-Thai-Children-Stories.

Experiment Settings

The performance of the proposed hybrid models, i.e., BLSTM-CNN, CNN-BLSTM, BLSTM+CNN, and BLSTM×CNN, was compared with individual models, i.e., BLSTM and CNN. Three types of features were also compared together with their various combinations, i.e., word embedding (FW), POS tag (FP), sentic (FS), FW + FP, FW + FS, FP + FS, and FW + FP + FS. The experiments were conducted on three datasets—ThaiTales, ThaiEconTwitter, and Wisesight—that are explained in the previous subsection. Each dataset was split in a stratified manner into three subsets—training, validation, and test sets—at a ratio of 60:20:20. Hence, all of the subsets inherited some of the same characteristics of the original dataset including class distribution and sentence length distribution. We employed Adam optimizer [82] with a learning rate of 0.001. Every model was trained for 300 epochs on ThaiTales and ThaiEconTwitter datasets and 50 epochs on Wisesight dataset. The reasons behind the setting of 50 epochs on Wisesight dataset were (i) the loss function converged at around the 30th epoch as shown in Fig. 9 and (ii) the number of samples was very large leading to high computational cost. We performed grid searches to tune the many parameters of each algorithm. The search settings were as follows:

– BLSTM:

The number of hidden nodes in BLSTM layer was either {16, 32, 64, 128, 256 or 512}.

– CNN:

The number of hidden nodes in the filter was either {16, 32, 64, 128, 256 or 512} and the filter size was fixed at 3.

– BLSTM-CNN:

The number of hidden nodes in BLSTM layer was {16, 32, 64, 128, 256, 512}, the number of hidden nodes in the filter was either {16, 32, 64, 128, 256 or 512}, and the filter size was fixed at 3.

–:

The cases of CNN-BLSTM, BLSTM+CNN, and BLSTM×CNN were similar to that of BLSTM-CNN.

Fig. 9
figure 9

Convergence curve of the loss function for each algorithm and dataset. Black and red indicate training and validation sets, respectively

Dropout layers were employed in all models. They were between the embedding layer and output layer, and their dropout value was set to 0.5. Then, we selected optimal parameters based on F1-score obtained in the validation process and explained in the following subsection. The optimal parameters were used as settings in the optimal model trained with the combined training and validation datasets, then the optimal model was evaluated on the test dataset. The above process was repeated with 10 different random splits.

Performance Evaluation

As our datasets were mostly imbalanced, we used F1 as the performance measure. F1 is a score that seeks to balance between precision and recall. It is calculated from the harmonic mean of precision and recall as follows:

$$ F_{1}= 2\cdot\frac{P\cdot R}{P+R}, $$
(8)

where P is precision and R is recall that can be calculated as (9) and (10), respectively.

$$ \begin{array}{@{}rcl@{}} P&=&\frac{TP}{TP+FP} \end{array} $$
(9)
$$ \begin{array}{@{}rcl@{}} R&=&\frac{TP}{TP+FN} \end{array} $$
(10)

where TP, FP, and FN denote true positive, false positive, and false negative, respectively.

Results and Discussions

The fused deep learning models were evaluated with each of the three different features on the three datasets. Table 3 lists the average F1-score across ten random splits. We first compared the performance of each individual feature, FW, FP, and FS. The best individual feature was FW that yielded F1-score of 0.6576 on average across all datasets, models, and ten random splits. This score is followed by FP (0.4669) and FS (0.4598). When two or more features were combined, they improved the overall performance. FW + FS yielded the best contender at 0.6653, followed by FW + FP + FS (0.6647), FW + WP (0.6593). It can be clearly seen that FW is the most important feature because the top four contenders from all runs by all models on all datasets always included FW as an individual feature or in combination with other feature(s), while this was not true for any other features. Combining FP and FS improved the overall performance to 0.5294 F1-score, better than using each of them as an individual feature. Thus, combining feature led to improvement of overall performance.

Table 3 Performance comparison of all models with different combinations of features on three datasets

There was variation in the performances shown in Table 3. Eighteen judges ranked the performance of each of the 7 features for every model and dataset (denoted as objects) based on F1-scores. The ranks are shown in Table 4. The significance of the ranks in the table was tested by using Kendall coefficient of concordance (W ) which turned out to be 0.8137 (p < 0.01 for 6 degrees of freedom). W is particularly useful for testing inter-judge or inter-test reliability [83]. The sum of the rank of every feature indicates the best overall rank of the objects [83] that suggests the following ordering:

$$ \begin{array}{@{}rcl@{}} F_{W}&+&F_{P}+F_{S} \sim F_{W}+F_{S} > F_{W}+F_{P} > F_{W}\\ &>& F_{P}+F_{S} > F_{S} > F_{P}. \end{array} $$
Table 4 Ranks assigned to 7 features by 18 judges (for each of the models and datasets) from Table 3 were based on F1-score

We further employed multiple t test on the results in Table 3 to test the significance level of the difference between the means of two independent samples [84]. The test shows that each of the possible pairwise combinations is very highly significant (p < 0.001) except FP-vs-FS (p = 0.5122), FW-vs-FW + FP (p = 0.3239) and FW + FS-vs-FW + FP (p = 0.7013) which are less conclusive.

According to the test, fusing features were able to clearly improve the prediction performance. Findings from two of our previous works support this observation [11, 12]. For example, they suggest that it is possible that combining FW and FP could improve prediction performance. FW could capture some syntactic information of a word while WP could directly capture the grammatical type of a word. Pasupa and his colleagues have shown that intransitive verb (vi), transitive verb (vt), adverb (adv), common noun (n), and adjective (adj) were the most affective words that stimulate strong human emotions [19]. In addition, Pasupa and Seneewong Na Ayutthaya showed that simply including POS information for all words in a sentence could improve the prediction performance [11]. It was even better when POS information was included in RNN model with only some selected words based on the five types of POS (n, vi, vi, adj, and adv).

Since Wisesight dataset contained much more samples than the others, we separately evaluated the ranks of the performance that every feature achieved on two different groups of datasets: (i) small-sized dataset group, ThaiTales and ThaiEconTwitter; and (ii) large-sized dataset group, Wisesight. In the analysis of the small-sized dataset group, the ranks of all features led to the value of W = 0.8585 which was significant at p < 0.01. This high value enabled us to report with confidence that the following ranking is valid:

$$ \begin{array}{@{}rcl@{}} F_{W}&+&F_{S} > F_{W}+F_{P}+F_{S} > F_{W}+F_{P} > F_{W}\\ &>& F_{P}+F_{S} > F_{P} > F_{S}. \end{array} $$

In the analysis of the large-sized dataset group, the ranking of all features was:

$$ \begin{array}{@{}rcl@{}} F_{W} &>& F_{W}+F_{P} > F_{W}+F_{P}+F_{S} > F_{W}+F_{S} \\&>& F_{P}+F_{S} > F_{S} > F_{P}. \end{array} $$

This ranking was significant at p < 0.01 with W = 0.8254. According to this ranking, any combinations with FW are in the top ranks. Combining additional information—sentic and POS—was able to improve the performance of FW on small-sized datasets. However, on the large-sized dataset, FW was the best contender. Please note that the good outcomes of combined features were applicable to all models.

Moreover, Table 5 shows F1-score achieved by every model for all datasets (averaged across all feature sets and ten random splits). Focusing on individual models, BLSTM and CNN—CNN outperformed BLSTM on ThaiTales dataset. Moreover, its performance was better than that of BLSTM for all features as shown in Table 3. However, BLSTM achieved 0.4694 F1-score, better than CNN did at 0.4183, on the Wisesight dataset. In addition, it also outperformed CNN for all features. This might be because the Thai tales were simple in vocabulary and grammatical structure; therefore, focusing on learning neighboring words—local features—was more relevant than learning the sequences of sentences. On the other hand, users might use some difficult words or complex sentences in social media data. In ThaiEconTwitter, the F1-score achieved by CNN was 0.6481 on average while that achieved by BLSTM was 0.6502. However, it is inconclusive whether CNN performed worse than BLSTM because its performance was worse than that of BLSTM for only in 4/7 cases.

Table 5 Average F1 scores of every model on three datasets, averaged across all features and ten random splits

The best performer was BLSTM-CNN that achieved 0.6100 F1-score on average. Combining two models improved the performance in most cases, except for BLSTM×CNN on Wisesight and ThaiTales, and CNN-BLSTM on ThaiTales. However, the overall performances of the fused models were better than individual ones on average.

Considering only the fused models, it is clear that BLSTM-CNN was better than CNN-BLSTM in all cases (all features) on ThaiTales and ThaiEconTwitter, while it outperformed CNN-BLSTM in only 1/7 cases (FP + FS) on Wisesight as shown in Table 3. Nonetheless, CNN-BLSTM was the worst combination in all cases in ThaiTales and 5/7 cases in ThaiEconTwitter. Learning with concatenated features and with BLSTM and CNN (BLSTM+CNN) yielded better performances than learning by voting scheme with simple ensemble of both models (BLSTM×CNN) in all cases on Wisesight dataset. On the other hand, BLSTM×CNN performed better than BLSTM+CNN in 5/7 cases on ThaiEconTwitter and 6/7 cases on ThaiTales. The best combination on Wisesight was CNN-BLSTM (for 6/7 cases) and the worst was BLSTM×CNN (for all cases).

We further investigated the reason why BLSTM-CNN performed worse than CNN-BLSTM only on Wisesight. Figure 10 shows sentence length (number of words in an item) distributions of all three datasets. The sentence length distributions of Wisesight and ThaiEconTwitter datasets were clearly skewed to the right; i.e., the sentences in these two datasets were mostly short with a mode value of 5. On the other hand, such distribution of ThaiTales was close to normal (unskewed) with the mean of 16.84, the median of 15, and the mode of 15. According to the figure, the length of sentences in Wisesight varied from 1 to 428 words. Its range was much wider than those of the other two datasets because the longest sentence in ThaiEconTwitter was 74 words and 68 words in ThaiTales.

Fig. 10
figure 10

Sentence length distribution on Wisesight, ThaiEconTwitter, and ThaiTales datasets

Because of this variation, we divided each range into 10 equal intervals and tested the samples in each interval separately for each dataset. Then, the F1-score for every interval was reported on all datasets as shown in Fig. 11. BLSTM-CNN performances were better than those of CNN-BLSTM in all cases on ThaiTales and ThaiEconTwitter. On Wisesight dataset, CNN-BLSTM achieved better performances than BLSTM-CNN only for three intervals—[0–43], [87–129], and [173–215]—but the percentages of the samples in these three intervals were 88.08%, 1.94%, and 0.25%, respectively. Therefore, the overall performance of BLSTM-CNN dropped by a bit, meaning that CNN-BLSTM performed better than BLSTM-CNN for short sentences, while BLSTM-CNN outperformed CNN-BLSTM for long sentences on Wisesight dataset.

Fig. 11
figure 11

F1-Scores for every sentence length interval in 10 test sets on all datasets, averaged across 10 runs and all features

As there was a degree of variation in the ranking of the models, the significance test, Kendall coefficient of concordance, for evaluation of ranking of the models was again conducted. The six models were assigned ranks by 21 judges (across all features and datasets) as shown in Table 6. The computed W was 0.3263. This value was significant at p < 0.01. Given that the significant level of agreement between various rankings of the same set of models has been established, the best overall order of the model was based on the sum ranks. This gives the following ranks:

$$ \begin{array}{@{}rcl@{}} \text{BLSTM-CNN} &>& \text{BLSTM}+\text{CNN} > \text{BLSTM}\times\text{CNN} \\&>& \text{CNN-BLSTM} > \text{CNN} > \text{BLSTM}. \end{array} $$
Table 6 Ranks assigned to 6 models by 21 judges (across all features and datasets) according to their F1-scores listed in Table 3

Again, a multiple t test was conducted to test the significance level of the differences between the means of F1-scores achieved by pairs of models. We tested every possible pair of all models. The test showed that the differences for all pairwise combinations were very significant (p < 0.001), except for those for BLSTM-vs-CNN (p = 0.3354), CNN-vs-CNN-BLSTM (p = 0.1779), and BLSTM+CNN-vs-BLSTM×CNN (p = 0.2866) which were less conclusive.

In addition, the ranking of every model on the small size datasets led to a value of W = 0.6257 (p < 0.01), giving the following overall ranking:

$$ \begin{array}{@{}rcl@{}} \text{BLSTM-CNN} &>& \text{BLSTM}\times\text{CNN} > \text{BLSTM}+\text{CNN} \\&>& \text{CNN} > \text{BLSTM} > \text{CNN-BLSTM}, \end{array} $$

while the ranking of every feature on the large dataset was

$$ \begin{array}{@{}rcl@{}} \text{CNN-BLSTM} &>& \text{BLSTM-CNN} > \text{BLSTM}+\text{CNN} \\&>& \text{BLSTM} > \text{BLSTM}\times\text{CNN} > \text{CNN}. \end{array} $$

The computed value of W was 0.9005 on the large dataset—significant at p < 0.01.

According to [85], the bigger is the sentiment lexicon, the better is the prediction accuracy. Therefore, we plotted the average ratios of the number of words with sentic value to the number of words in the whole sentence for all datasets against the average F1-scores across all models and features for the test sets, shown in Fig. 12. There were no significant correlations between such F1-scores and ratios for each dataset and across all datasets. Such ratios between ThaiEconTwitter and Wisesight datasets were along the same line but the F1-scores were far different. On the other hand, such ratio for ThaiTales dataset was the biggest among the three datasets but such F1-scores for it were smaller than those for ThaiEconTwitter which also had a smaller value of such ratio.

Fig. 12
figure 12

Plot of average ratios of the number of words with sentic values to the number of words in a sentence against average F1-scores achieved by every model and feature across all three datasets

In addition, we show the confusion matrices of BLSTM-CNN in conjunction with FW + FP + FS for each dataset as this combination of classifiers was the best contender according to our analysis, as shown in Fig. 13. Regarding misclassified samples in ThaiTales dataset, the model had a tendency to predict negative samples to neutral rather than positive samples. In addition, positive samples were misclassified as neutral samples. This shows that the classifier tended to predict the model toward the majority class. This observation is also applicable to the remaining datasets. The majority class of ThaiEconTwitter was negative class. Those misclassified samples of positive and neutral samples were classified as negative class. It should be noted that the model performed well in predicting each class on ThaiTales and ThaiEconTwitter datasets, but did not do so on the Wisesight dataset. Most misclassified samples were classified as neutral as expected, i.e., toward the majority class. The model correctly classified 4741 positive samples from the total of 9820 positive samples (48.3%), while it classified 4165 positive samples as neutral samples (42.4% of positive samples). Likewise, only 283 question class samples from a total of 1100 samples were correctly classified (25.5%), while 617 neutral samples were correctly classified (55.6% of question class samples). Overall, all tested classifiers tended to bias toward the majority class.

Fig. 13
figure 13

Confusion matrices of BLSTM-CNN with FW + FP + FS features on test sets—from a total of 10 runs with different random splits

We further performed an error analysis on what caused the error in the best model—BLSTM-CNN with FW + FP + FS feature. Examples of prediction errors are shown in Fig. 14.

Fig. 14
figure 14

Examples of sentences that BLSTM-CNN predicted a wrong sentiment

The sentence in Example 1 can be divided into two parts: (1) and (2) . The first part is a sarcastic remark that the Japanese consume too much of only Japanese-made products, with a negative sentiment of over-consumption from the Thai word “ ,” which is a shortened pronunciation of the English word “over.” The second part, is a remark with a positive sentiment that supports the idea that nationalism benefits domestic economy. The sentiment of this two-part sentence should be positive, but BLSTM-CNN predicted that it was negative because it focused on the first part and did not comprehend that that part was a sarcastic remark.

The sentence in Example 2 also can be divided into two parts: (1) and (2) global . The words in the first part clearly convey a negative sentiment (2) of getting annoyed (by), while the words in the second part state that Olympic game is a world-class athletic event that every country wants to host to benefit its own economy, which conveys a positive sentiment. The sentiment of the whole sentence should be negative because of the annoyance, but BLSTM-CNN focused on the second part and predicted that the sentiment of the whole sentence was positive.

Conclusion

This paper proposes a Thai sentiment analysis framework that includes data pre-processing steps, feature extraction, and model construction to perform the task. In addition, we propose a Thai-SenticNet5 corpus built on SenticNet5 in association with LEXiTRON, Volubilis, and Thai WordNet. Furthermore, three hybrid deep learning models—BLSTM-CNN, CNN-BLSTM, BLSTM+CNN, and BLSTM×CNN—are proposed and evaluated on three datasets: ThaiTales, ThaiEconTwitter, and Wisesight. Three different types of features—word embedding, POS tag, and sentic—were used to represent meaning, POS, and sentiment of a word, respectively. Apart from these three sets of features, we also evaluated all of their combinations. The results show that feature combination was able to improve the overall performance of sentiment analysis. The best candidate feature was a combination of word embedding, POS, and sentic features that led to the highest F1-score. Moreover, the results demonstrate the enhancement of task performance aided by hybrid deep learning models. The experimental results show that BLSTM-CNN was the overall best contender.

As mentioned, our datasets were mostly imbalanced, but the current models did not consider the class imbalance. This problem cause the model to bias toward the majority class. In future work, we will consider applying focal loss that can handle the class imbalance problem as it was successfully evaluated on image data, e.g., red blood cell classification [86]. Also, a more recent version of SenticNet integrates symbolic models and subsymbolic methods to encode meaning and learn syntactic patterns from data [87]. It can be employed to improve the overall performance of sentiment analysis task.