1 Introduction

Recently, the world has witnessed the explosion of social networks (SNs) such as Twitter, Facebook, Instagram, etc, which have attracted a wide portion of internet users to interactively collaborate and globally communicate with each other, in SNs people can express and share their opinions and experiences using different types of social data such as textual data (e.g., comments, tweets, reviews, etc.), visual data (e.g., shared and liked images), in addition to multimedia data (e.g., videos and sounds). A huge volume of data is generated out of SNs on a daily basis, this data reflect the sentiment tendencies of the audience towards different life aspects such as political, business, social subjects, etc. For the researchers, this data contain valuable information that can be used on products/services quality improvement and adaptation, predicting upcoming marketing trends, changing sales strategies, etc (Birjali et al. 2017). Social data are described as informal, unstructured, and rapidly evolving contents, therefore, processing and analyzing this data using conventional analysis methods is a very time-consuming and resource-intensive task (Elouardighi et al. 2017). Natural Language Processing (NLP) is theory-motivated computational technique that enables the computers to smartly understand, analyze, and derive meaning from human’s natural languages (Tarwani and Edem 2017) which is complicated in its sequential and hierarchical structure. NLP algorithms enable to perform different natural language-related tasks such as part-of-speech (POS) tagging, parsing, machine translation, and dialogue systems. According to Al-ayyoub and Nuseir (2016), sentiment analysis (SA) is a hot-trend research area of NLP which is concerned with classifying the opinions or emotions towards a product, service, topic, etc, into certain sentiment label. SA-based textual data aim to use text mining, linguistics, and statistical knowledge techniques to automatically assign predefined sentiment labels (e.g., negative, positive, or neutral) to the text generated by online users (Alowaidi et al. 2017). However, labels vary according to the context of sentiment analysis. Sentiment analysis provides different subtasks such as polarity classification, subjectivity detection, humor detection, etc, which can be conducted either at sentence-level, document-level or aspect-level (Mostafa 2017; Lu et al. 2018).

For decades, many machine learning algorithms such as SVM and logistic regression have been proposed to address different NLP problems. Recently, neural networks based on dense vector representations have achieved state-of-the-art performances in every NLP-related task (Sze et al. 2017; Haydar et al. 2018) due to their effectiveness and automatic learning capabilities (Ain et al. 2017). Deep learning neural networks have achieved impressive advances in pattern recognition and computer vision. Following this trend, several complex deep learning algorithms have been introduced to perform difficult NLP-related tasks particularly sentiment analysis. Sentiment analysis has gained considerable research attention. Many researches have been conducted on the English since it is the dominant language of sciences, besides other Indo-European languages. Recently, Arabic language has recorded an explosive growth rate in the number of internet users (population) (Boudad et al. 2017; Alsmearat et al. 2015). Figure 1 illustrates the top ten languages based on the percentage of internet users, according to the Internet World Stats ranking, Arabic language ranks fifth among  top five internet using languages with more than 168.1 million native speakers (Alowaidi et al. 2017).

Fig. 1
figure 1

Top ten internet using languages

However, very few researches have investigated sentiment analysis on Arabic text compared with other languages due to the challenging nature of the Arabic language (Alowaidi et al. 2017; Guellil et al. 2019) such as the dialectal varieties and morphological complexities that require heavy preprocessing and advanced dictionaries (lexicons) more than other languages (Altrabsheh et al. 2017). According to Al-kabi et al. (2014), one Arabic sentence can have several inflectional and derivation forms, for instance, the positions of the words in the sentence and the type of sentence itself whether it is verbal or nominal may change the transitional meanings of the words. Therefore, Arabic text opinion mining is subjective to the context and the domains, also one word can be used to express different polarity classes for different contexts. The Arabic language is diversified in terms of words suffixing, prefixing, and affixing which have a direct impact on words and sentence representation (Boudad et al. 2017). Common spelling mistakes and lack of available corpora are additional challenges in the Arabic language. Therefore, efficient algorithms and tools are required to perform effective and automated features extraction.

The main contribution of this paper is to propose a novel deep learning model based on convolution neural network and long short-term memory for Arabic sentiment analysis based on user’s generated textual contents, and also this study aims to demonstrate comparative evaluation using FastText (Skip-gram and CBOW), Word2Vec and AraVec words embedding models on Arabic text classification.

The rest of this paper is organized as follows: Sect. 2 presents the related works done in SA. Section 3 introduces our proposed approach. Section 4 presents the experimental settings. Section 5 presents the experimental results and the evaluation of the proposed model. Section 6 concludes the paper and provides some future works.

2 Related works

Mainly, there are two mainstream solutions for sentiment analysis: supervised (corpus-based) and unsupervised (lexicon-based) approaches (Ravi and Ravi 2015). This section presents different SA approaches in different languages.

2.1 Unsupervised based approach

Clustering-based approach depends on calculating the TF-IDF criterion for features extraction, TF is proportional to the frequency of terms in a document, and IDF is used as a weighting factor. Potential features are the terms with the highest TF-IDF values (Hemmatian and Sohrabi 2017). Claypo and Jaiyen (2014) used K-means clustering algorithm and MRF feature selection for SA. MRF was utilized to select only the most relevant features, then K-means was used for the final classification. K-means achieved the best performance against Hierarchical Clustering and Fuzzy C-Means. Taj et al. (2019) utilized TF-IDF to determine the frequently used terms and their weights, then WordNet was employed to assign sentiment scores to the keywords, and an operator was used to predict the final sentiment label. Huang et al. (2017a) presented a multi-modal which joins sentiment and topic classification tasks based on latent Dirichlet allocation, and this model was evaluated on a multifarious dataset.

On the other hand, lexicon-based approach (e.g., Keyvanpour et al. 2020; Elhawary and Elfeky 2010) is a popular practical approach to perform sentiment analysis, and this approach utilizes a weighted dictionary to detect the semantic polarity of the words. Lu et al. (2010) evaluated the sentiment polarity strength of the reviews by multiplying the strength of an adjective and adverb words, the strength of an adverb was calculated manually, then the strength of an adjective was determined using progressive relation rules of adjectives and propagation algorithm. Eirinaki et al. (2012) presented High Adjective Count algorithm to identify the nouns and their respective scores which are the number of adjectives associated with that noun, and Max Opinion Score Algorithm to rank the nouns according to their scores, nouns with highest values are selected as potential features. Sasmita et al. (2017) performed an aspect extraction using indicator words constructed using seed words, and the extracted pronoun or noun is compared against the indicator words. An opinion lexicon was used to determine the sentiment orientation of a particular opinion term. Also, Blair et al. (2017) performed SA using lists of positive and negative seed words and the number of topics. Three functions were introduced: objective topics detection, positive and negative sentiment detection, and sentiment classification functions. Pawar and Deshmukh (2015) proposed a hybrid SA approach. N-gram and POS features were extracted using rule-based learning. After calculating the sentiment scores, a threshold was used for the final classification. Also, NB, QDA, and RF ML classifiers were used to classify the tweets into their respective class. ML approach achieved the best results.

2.2 Supervised based approach

According to Kim (2014), Deep Learning has presented remarkable contributions in named-entity recognition (e.g., Chiu and Nichols 2015), computer vision (e.g., Krizhevsky et al. 2012), and speech recognition (e.g., Graves et al. 2013). Unlike conventional machine learning-based NLP models, DL models can perform multi-layers automatic features representation (Young et al. 2018; Chen and Zhang 2018), which makes a simple DL model achieve superior performance over the state of the art in AI tasks (Sohangir et al. 2018). Inspired by the humans' brain, DL model is a complex neural network or machine learning architecture composed of several layers of perceptron (Glorot et al. 2011). DL algorithms are effective in extracting the implicit semantic features which would help in transferring across domains. The application of these algorithms on the SA tasks has reduced the human intervention, computation time, and feature engineering processes (Vateekul and Koomsubha 2016). Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) are the most commonly used DL models for features representation and classification (Hassan and Mahmood 2017).

CNN is a type of feed-forward NNs that requires less training data (Shickel et al. 2018). CNN has presented remarkable performance in different NLP-related tasks due to its ability to capture the syntactic and semantic features for a specific task. CNN applies convolutional operations over the input layer to automatically extract the local features, and the learning capabilities of the CNN is increased due to the weight sharing across all neurons (Ombabi et al. 2017). According to Ravuri and Stoicke (2016) CNN can achieve superior classification performance over strongly competitive neural networks including FNN, RNN, and LSTM. Wint et al. (2018) utilized two parallel CNN layers with BLSTM for sentence-level SA. The feature maps of the two layers are interleaved at a pooling layer, and sigmoid function was used to classify the reviews into (bullied/no bullied) and (positive/negative) labels. This model outperformed different baseline models. Also, Huang et al. (2017b) proposed deep learning model based on CNN and LSTM, and the proposed model was supported by a pre-trained word representation model. Ouyang et al. (2015) incorporated Word2Vec over 7-layers CNN which contains 3 pairs of convolutional and pooling layers, with adopted PReLU, normalization and dropout functions; this model achieved the best classification accuracy over RNN and MV-RNN models.

Recurrent Neural Network is mainly used in text data classification due to its ability to capture long-term dependencies and to maintain the sequence of variable length data. RNN maintains a connection between the current hidden states and the output of the previous hidden layer, i.e., RNN takes the output of the previous hidden layer as inputs to the current hidden layer. However, RNN suffers from vanishing gradient problem (Al-Smadi et al. 2018). Preethi and Krishna (2017) explored the application of RNN in sentiment analysis. They aimed to provide optimized place recommendations services based on SA, and experiments on Amazon dataset showed the improved classification performance of this approach.

Li et al. (2017) proposed a hybrid neural network architecture based on BTM, RSM, and Latent semantic machines. A regularized transfer learning model was used to incorporate the semantic domain knowledge into the NN and to boost the classification performance. Wang and Cao (2017) proposed SA approach based on LSTM with L2 and Nadam optimizer for Chinese text SA, and this model was evaluated on online-shopping reviews. Results indicated that the adopted loss and optimization functions can improve the classification accuracy. Ghosh et al. (2016) proposed a deep learning model for SA based on Probabilistic Neural Network with two-layered Restricted Boltzmann (RBM). TF-IDF was used for the data representation, and PNN was employed for the predict the final sentiment class. Lalji and Deshmukh (2016) proposed a hybrid model based on lexicon and machine learning approaches for sentiment analysis. Tree Tagger and POS Tagging techniques were used for features extraction. NB, RF, SVM, and LDA ML classifiers were used to classify the tweets into a certain label.

There are very few researches have investigated the application of deep learning techniques in the Arabic NLP, particularly SA, Dahou et al. (2016) proposed CNN and neural words embedding architectures for Arabic sentiment analysis. The proposed architecture outperformed different existing approaches. Al-Smadi et al. (2018) introduced an aspect-based SA approach that contains two implementations: aspect opinion target expression extraction (OTEs) using character-level BLSTM with (CRFs) classifier, and aspect sentiment polarity classification using aspect-based LSTM. The proposed approaches achieved significant performance improvement over the baseline approaches. Alayba et al. (2018) integrated CNN with SemEval-2016 Arabic Twitter, and the Arabic Health Twitter Lexicons to perform Arabic SA. Word2Vec (CBOW) was utilized as the embedding model, and this approach obtained promising results. Hassan and Mahmood (2018) described a joint CNN and RNN framework stacked over unsupervised words embedding model, and in this framework, the former information was combined with the feature sets extracted using convolutional layer. This approach outperformed several existing approaches in terms of accuracy. Table 1 shows a summary of supervised and unsupervised Learning related works presented in this study.

Table 1 Summary of supervised and unsupervised related works

Unsupervised approaches are commonly used in SA. However, keywords vagueness and ambiguity can decrease the accuracy of predictions. These approaches cannot consider the semantic relationships between words in the sentences. For Arabic sentiment analysis, unsupervised approaches cannot be effective due to the numerous words from several dialects to be included in the lexicons. Also, it is observed that using only CNN or using only LSTM is inadequate to achieve the desired results on Arabic sentiment analysis (Huang et al. 2017b), this is because CNN fails to maintain long-term dependencies, and LSTM is weak to capture local features. Unlike other deep learning approaches, in this work, we propose a new architecture based on deep learning of features representation and features classification. The proposed architecture uses the recent FastText model which can generate the corresponding vectors for the Out-Of-Vocabulary words (OOV) and rare words. Convolutional neural network architecture is used for n-gram local-region features and information extraction. The performance of the CNN is improved using two stacked LSTM layers to address the difficulties of training CNN to capture long-term dependencies. Finally, the feature maps learned by CNN and LSTM are passed to the SVM classifier to generate the final sentiment labels.

3 Proposed approach

In this study, we propose a novel deep learning model for Arabic SA namely (Deep CNN–LSTM Arabic-SA) which joins FastText words representation model over one layer CNN architecture which inspired by Kim’s work (Kim 2014). Due to the locality of the convolutional and pooling layers, CNN cannot capture long-distance dependencies the input sentences, however one single recurrent layer can effectively overcome this limitation (Hassan and Mahmood 2017), therefore, we propose to utilize two LSTM layers to minimize the local information loss. Finally, SVM classifier is used to classify the sentences into a certain sentiment label (positive or negative). Figure 2 illustrates the overall processes of Deep CNN–LSTM Arabic-SA, Fig. 3 visualizes the fundamental architecture and the information flow, and Fig. 4 presents the architecture of the CNN and LSTM used in Deep CNN–LSTM Arabic-SA.

Fig. 2
figure 2

Overview of the proposed Deep CNN–LSTM Arabic-SA model

Fig. 3
figure 3

Flow diagram of the proposed Deep CNN–LSTM Arabic-SA

Fig. 4
figure 4

Proposed Deep CNN–LSTM Arabic-SA architecture

3.1 Word embedding

This architecture takes advantage of using the recent FastText model proposed by Mikolov et al. (2017) for word embedding, FastText is trained on a wide range of languages including English and Arabic. As in word2vev, FastText provides two models: Skip-gram model which is used to predict a target word using the closed neighboring words, while CBOW uses the surrounding words in the context to predict the target word, both methods generate a text file which contains numerical representation (vectors) of the learned words. In this study, the FastText skip-gram model is used in which each word is represented as a bag of character n-grams; sentences are then represented as a summation of their words vectors. FastText is run in its default configurations: 100-dimension vector space, sub-word size is 3-6 characters which is appropriate for Arabic text because any Arabic word has three letters root Altowayan (2017).

3.2 Convolutional neural network

let \(X_i\in R^k\) denotes the k-dimension word vector equivalent to the ith word in a sentence with length (n) which is represented as a concatenation of its words vectors see Eq.(1), zero padding is applied to the sentences with length less than (n).

$$\begin{aligned} X_{1:n}=X_1\ \oplus {\ X}_2\cdots .\ \oplus \ X_n \end{aligned}$$
(1)

\(\oplus\) is a concatenation operator. Let \(X_{i:i+j}\) denotes the concatenation of the words \(X_i,X_{i+1},\cdots X_{i+j}\), the convolution filter \(W\in R^{hk}\) is applied to a window of h words in a sentence representation matrix of a shape \(n\times k\) to generate a new features matrix, \(X_{i:i+j}\) is the basic element from the ith to the \((i+j)\)th which represent the local feature matrix from the ith line to the \((i+j)\)th line of the current sentence vector. A feature \(C_i\) (i-th feature value) can be generated from a window of words \(X_{i:i+h-1}\) using Eq. (2).

$$\begin{aligned} C_i=f(W . X_{i:i+h-1}+b). \end{aligned}$$
(2)

b refers to bias term, where \(b \in\) R, f is a nonlinear activation function such as sigmoid and hyperbolic tangent. b and W are learned during the training. The filter is convoluted on every window of words in the input sentence \({X_{1:h},X_{2:h+1}, X_{n-h+1-n}}\) to produce a features map using Eq.(3).

$$\begin{aligned} C=[C_1,C_2,C_{n-h+1)}] \end{aligned}$$
(3)

with \(C\in R^{n-h+1}\).

We just explained the process of producing one feature map that is captured from one filter. Note that, convolution layer with m filters will produce \(m(n-h+1)\) features. Max-overtime pooling is not applied over the feature maps, because features sampling can effect the sequence organization before the LSTM layers. The feature maps are directly fed into the LSTM layers to encode the temporal patterns.

3.3 Capturing long-term dependencies

LSTM can efficiently control the information by preventing vanishing gradient and capturing long-term correlations in sequences with arbitrary length (Yuan et al. 2018). As shown in Fig. 5, LSTM architecture contains a newly added memory cell to selectively maintain the information for a longer time without degeneration. In addition to input, output, forget gates.

Fig. 5
figure 5

Long-short term memory architecture with memory cell

To process the input vectors, LSTM applies recursive execution of the current cell block using the old hidden state \((h_{t-1})\) and the current input \(x_t\), where (t) and \({(t-1)}\) refer to the current time and the former time, respectively. Now \(i_{t}\), \({f_t}\), and \({(o_t)}\) are the input gate, forget gate, and output gate, respectively, and \({ {\tilde{C}}_t}\) refers to the current memory cell state at a time (t) in cell block, the operational principle of the LSTM can be described as follows: Using Eqs. (4) and (5) the values of \(i_{t}\), and \({\tilde{C}}_{t}\) are computed for the memory cells states at a time (t).

$$\begin{aligned} i_t & = \sigma ( W_i x_t+U_i h_{t-1}+b_i ) \end{aligned}$$
(4)
$$\begin{aligned} {\tilde{C}} & = tanh ( W_c x_t+U_c h_{t-1}+b_c) \end{aligned}$$
(5)

Equation (6) calculates the activation value \(f_{t}\) of the forget gate at time (t):

$$\begin{aligned} f_t=\sigma ( W_f x_t+U_f h_{t-1}+b_f) \end{aligned}$$
(6)

Equation (7) calculates the new state \(C_t\) of the memory cell at a time (t):

$$\begin{aligned} C_t= i_t*{\tilde{C}}+f_t* C_{t-1} \end{aligned}$$
(7)

Memory cells output gates values are computed for the new state using \(C_t\) as in Eqs. (8) and (9).

$$\begin{aligned} o_t= \sigma (W_o x_t+ U_o h_{t-1 }+V_o c_{t }+ b_o) \end{aligned}$$
(8)
$$\begin{aligned} h_t= o_{t }*tanh (C_t) \end{aligned}$$
(9)

where \(x_t\) refers to the input of the memory cell at t. \(W_i\), \(W_C\), \(W_f\), \(U_i\), \(W_o\), \(U_C\), \(U_f\), \(U_O\), and \(V_o\) are the weights matrices. \(b_i\), \(b_f\), \(b_c\), \(b_o\) are the bias vectors. \(\sigma\) is a logistic sigmoid function, o is an element-wise multiplication. During the training the model learns the values of \(W_{i}\) and \(U_{i}\). The values of \(f_ t\), \(i_t\) and \(o_t\) are in [0, 1]. In this architecture, the output of the first LSTM layer is passed to the second LSTM layer which produces deep representation of the original sentence. The final outputs of the LSTM layers are merged into one matrix; this matrix is passed to a fully connected layer.

4 Experiments

This section presents the experimental settings and configurations of Deep CNN–LSTM Arabic-SA on different datasets. The experiments are conducted on the TensorFlow framework running on Python.

4.1 Datasets

Data acquisition and annotation are the most difficult tasks in the Arabic sentiment analysis as presented in Sect. 1, therefore, we relied on the previously published works to construct a multi-domain sentiment corpus which contains positive and negative reviews on five topics. We have sampled subsets from: the corpus collected by ElSahar and El-Beltagy (2011) which scrapped from different websites, also the corpus collected by Aly and Atiya (2013) which is the largest sentiment corpus for Arabic text; it contains 63,000 books reviews. As presented in Table 2, the total size of the constructed training set is 15.100 of equally size positive and negative reviews. An Arabic NLTK was used to automatically correct the misspelled words, remove the stop-words and the duplicated letters (e.g., Beauuutiffulll= ). Then letters such as: alif is normalized to and ta’a is normalized to ( ), also non-Arabic contents are filtered out. For testing and validation, we used a dataset of 4.000 reviews distributed as 2.000 positive and 2.000 negative.

Table 2 Constructed training set statistics

Table 3 shows sample of positive sentiment reviews from hotels domain with its English translation, and Table 4 shows sample of negative sentiment reviews.

Table 3 Sample of positive reviews
Table 4 Sample of negative reviews

4.2 Model hyper-parameters

Empirically, different hyper-parameters and settings have been tested. For the CNN configurations, the adopted convolutional layer used multiple fitters of width (3, 4, 5), 256 feature maps, (ReLus) as activation function, dropout was set to 0.5 before the recurrent layer to minimize the overfitting. Padding was set to zero when needed. For the LSTM configurations, the hidden state dimensionality was configured to 128, and sigmoid function was used as an activation function. The number of epochs was set to (5–10) in the entire architecture.

5 Results and discussion

Deep CNN–LSTM Arabic-SA was trained on a multi-domain sentiment corpus in Table 2, then the classification performance is evaluated using the testing set. The confusion matrix is a measure that is used to assess the correctness of classification. The obtained confusion matrix of this experiment is presented in Fig. 6 where 89.10% of the positive reviews are correctly classified as positive, with only 10.90% which misclassified as negative. 92.40% of the negative reviews are correctly classified as negative, with only 7.60% which misclassified as positive by Deep CNN–LSTM Arabic-SA.

Fig. 6
figure 6

Confusion matrix

We followed the conventions to report the classification performance of Deep CNN–LSTM Arabic-SA using precision, recall, F1-score, and accuracy measures. As presented in Table 5, Deep CNN–LSTM Arabic-SA achieved competitive classification performance with 89.10%, 92.14%, and 90.44% of precision, recall, and F1-score respectively. Deep CNN–LSTM Arabic-SA achieved 90.75% which is significant accuracy improvement over only CNN models in the Arabic sentiment classification.

5.1 Best performing classifier

The classifier performance eventually determines the quality of the word embedding and features extraction approaches, therefore, we intensively evaluated the performance of Deep CNN–LSTM Arabic-SA using Naive Bayes (NB), K-Nearest Neighbor (KNN (\(K=10\))) classifiers, in addition to Softmax as classification function after the fully connected layer against SVM using the same training parameters and dataset splits, as presented in Table 5 and Fig. 7, SVM achieved superior performance over NB, Softmax, and KNN. Based on these results SVM classifier is more reliable for Arabic text classification which is consistent with (Nabil et al. 2015; Aly and Atiya 2013).

Table 5 Classification performances with different classifiers
Fig. 7
figure 7

Classification performances using different classifiers

5.2 Optimal number of LSTM layers

However, LSTMs are considered as a deep feed-forward neural network architecture, we have validated the effect of the number of LSTM layers on the classification performance. We have experimented Deep CNN–LSTM Arabic-SA using one LSTM layer compared with two LSTM layers, where each LSTM layer has 128 units. The confusion matrix of this experiment is presented in Fig. 8. As shown in Table. 6, two stacked LSTM layers can help to improve the classification performance with + 2.77% in terms of accuracy, in addition to + 3% and + 2.69% in terms of precision and recall, respectively over one layer LSTM. Therefore, two LSTM layers are appropriate for producing more higher-order feature representations of Arabic sentences to be more easily separable into different classes. This result is consistent with Pal et al. (2018) which deduced that stacking LSTM layers one upon another can increase the classification accuracy.

Fig. 8
figure 8

Confusion matrix of Deep CNN–LSTM Arabic-SA using single LSTM layer

Table 6 Effects of the number of LSTM layers

5.3 Best performing embedding model

The classification performance of Deep CNN–LSTM Arabic-SA is examined using two other pre-trained word representations models: word2Vec which is introduced by Mikolov et al. (2013), it uses two-layers Neural Network to construct distributed representation of words. It contains 3 million words represented in a 300-dimensional vector space. AraVec which is introduced by Soliman et al. (2017), it is a pre-trained distributed word representation model for Arabic language, it provides two architectures: CBOW and Skip-gram with 300 dimension vector space. To gain more insight into the performance of Deep CNN–LSTM Arabic-SA, Table 7 and Fig. 9 show the test accuracy comparisons of Word2Vec-CNN–LSTM and AraVec-CNN–LSTM against FastText-CNN–LSTM models, according to the results, FastText (Skip-gram and CBOW) based methods achieved superior performance with an accuracy of 90.75% and 88.90% respectively, which outperformed Word2Vec and AraVec with up to + 3.3% and + 8.8% accuracy improvements, respectively. On the other hand, the FastText Skip-gram model achieved the best classification accuracy, which is consistent with Bojanowski et al. (2017) that FastText skip-gram can produce high quality vectors representations using the semantic and syntactical information from the texts, also it can cover the out-of-vocabulary words.

Table 7 Classification accuracy using different embedding models
Fig. 9
figure 9

Classification performances using different embedding models

5.4 Comparison with the state-of-the-art

To validate the performance of Deep CNN–LSTM Arabic-SA against the state of the art, we performed different experiments on different datasets: Large Scale Arabic Book Reviews (LABR) dataset constructed by Aly and Atiya (2013), it contains 63,000 book reviews that have been collected from Goodreads. Arabic Sentiment Tweets Dataset (ASTD) collected by Nabil et al. (2015), it contains 10.000 Arabic tweets. Arabic sentiment analysis Twitter dataset collected by Abdulla et al. (2013), it contains 2.000 positive and negative tweets. The performance of Deep CNN–LSTM Arabic-SA is compared with: Dahou et al. (2016) used one layer CNN Architecture over Word2Vec model. Altowayan (2017) experimented FastText with SVC and Logistic Regression classifiers on LABR and ASTD datasets. Altowayan and Tao (2016) incorporated POS tags and word stemming features with Logistic Regression on both LABR and ASTD datasets. ElSahar and El-Beltagy (2011) utilized three feature representation techniques: Delta-TF-IDF, TF-IDF and Count, with Linear SVM for features selection and classification. Abdulla et al. (2013) proposed a lexicon-based, and supervised-based approach (SVM classifier). And Nabil et al. (2015) which used token counts and the TF-IDF with SVM classifier. And Nabil et al. (2015) which used token counts and the TF-IDF with SVM classifier. Table 8 and Fig. 10 show the classification accuracy of Deep CNN–LSTM Arabic-SA in each dataset against other approaches listing their best classification accuracy.

Table 8 Accuracy comparison with the existing methods
Fig. 10
figure 10

Comparison of accuracy with other methods

On LABR dataset, Deep CNN–LSTM Arabic-SA achieved 90.20% classification accuracy which outperformed the baseline results by up to + 11.6% in terms of accuracy. Deep CNN–LSTM Arabic-SA reached its highest accuracy on this dataset due to the sufficient dataset size and the balanced distribution of labels. On ASTD dataset, Deep CNN–LSTM Arabic-SA achieved significant accuracy increase of + 10.65% over only CNN model proposed by Dahou et al. (2016), and up to + 20.71% accuracy improvement over the three other approaches. For tweeter dataset (Ar-Twitter) Deep CNN–LSTM Arabic-SA as the best performing model achieved an accuracy of 88.52% with + 3.51% and + 1.32 % of accuracy improvement over Dahou et al. (2016) and Abdulla et al. (2013) respectively.

Table 9 presents deeper details about the performance of Deep CNN–LSTM Arabic-SA in each dataset with the performances of the other approaches in terms of precision and recall. Deep CNN–LSTM Arabic-SA achieved the best performance in all datasets with 89.79% and 85.92% of precision and recall respectively, this evaluation proved the reliability of the proposed deep learning model for Arabic text sentiment analysis. According to the obtained results, one convolutional layer CNN architecture supported by two layers LSTMs can improve the process of Arabic features representation and classification as confirmed by Hassan and Mahmood (2017). Moreover, these results confirmed that generating word vectors using FastText acts better than using word2vec and AraVec models on the word-level as it helps to better learn the hidden features about the language, and the out-of-vocabulary words.

Table 9 Performance comparison with the existing methods

Due to the large dataset size and balanced destitution of data, Deep CNN–LSTM Arabic-SA reached its highest performance on LABR dataset with up to + 15.60% and + 3.87% of improvement over the baseline performance. On ASTD dataset, Deep CNN–LSTM Arabic-SA ranks top the list of recent works and achieved + 3.36% and + 3.40% of performance enhancement. On Ar-Twitter dataset, Deep CNN–LSTM Arabic-SA increase the performance baseline with + 4.87% and + 11.01%.

6 Conclusion

Recently, social media have witnessed exponential growth in user-generated content which contains enormously valuable information for different applications. Sentiment analysis is concerned with analyzing social data to identify the inclinations of the public audience. For Arabic, it is challenging to perform sentiment analysis regardless of deep considerations of semantic and syntactic rules, in addition to terms dependencies of the input sentence. Thus, this paper proposed a deep learning model for Arabic sentiment analysis, and this model skillfully joint one-layer CNN architecture with two LSTM layers. This architecture is supported by FastText word embedding model as the input layer. The experiments on a multi-domain corpus showed the remarkable performance of this model with 89.10%, 92.14%, 92.44%, and 90.75% in terms of precision, recall, F1-Score, and accuracy, respectively. This study extensively validated the effect of the words embedding techniques on the Arabic sentiment classification and deduced that the FastText model is a more relevant alternative to learn semantic and syntactic information. Furthermore, the performance of the proposed model is evaluated using NB and KNN classifiers. The results showed that SVM is the best performing classifier with up to + 3.92% accuracy improvement. Due to the efficiency of the CNN in features extraction and the recurrent nature of LSTM, the proposed model achieved encouraging results and outperformed state-of-the-art methods on several benchmarks with up to + 11.6% of accuracy improvement.

For future research, it is worth investigating the application of deep learning architectures in the user’s interests discovery and recommendation, and to improve the quality of word embedding by integrating WordNet lexical database with the input layer.