1 Introduction

Humans experience different emotions on a daily basis. For example, we show our emotions when there is something to be happy about, something to be afraid of, or someone whom we are angry at [51, 71]. Emotions have been studied for a very long time. Darwin [34] studied emotion expression on the face and through body gestures. His study of emotion expression not only focused on humans but also included animals’ expressions of emotions and how both have similar emotion expressions.

Plutchik [120] estimated that more than 90 definitions of emotion exist. Kleinginna and Kleinginna [84] classified these definitions into eleven categories: affective, cognitive, external emotional stimuli, physiological, emotional/expressive behavior, disruptive, adaptive, multi-aspect, restrictive, motivational, and skeptical. An emotion can simply be defined as a specific feeling that describes a person’s state of mind, such as joy, love, anger, disgust, and fear [2]. In other words, emotions are intense feelings directed at something or someone [51, 71]. There are two terms that are closely related to emotion and often mistaken for emotion: affect and mood. Affect is a broad range of feelings that people experience. It includes both emotions and moods [71]. Moods are feelings, but they are less intense than emotions and often lack a contextual stimulus [71, 163].

To explain emotion and emotion expressions, researchers have proposed different emotion models. The circumplex model of emotion was developed by Russell [131]. This model represents emotions in a two-dimensional circular space, which contains valence as the horizontal axis and arousal as the vertical axis [121]. Plutchik [120] found that the primary emotions can be conceptualized in a manner analogous to a color wheel—similar emotions are placed close together, while opposite emotions are placed 180 degrees apart. Thus, he extended the circumplex model into a third dimension to represent the intensity of emotions; the resulting structure is shaped like a cone. Inspired by the work of Plutchik [120], Cambria et al. [24] developed the Hourglass of Emotions. This model represents affective states through labels and through four affective dimensions: pleasantness, attention, sensitivity, and aptitude.

Currently, people are increasingly relying on computers to perform their daily tasks, which have increased the need to improve human–computer interactions. The lack of commonsense knowledge makes emotion difficult for a computer to recognize and generate. Therefore, a substantial research on emotion recognition has been conducted. Emotion recognition is divided into three main categories: emotion recognition from facial expressions, emotion recognition from speech, and emotion recognition in text. Emotion recognition in text refers to the task of automatically assigning an emotion to a text selected from a set of predefined emotion labels. Emotion recognition in text is important because text is the main medium of human–computer interactions in the form of emails, text messages, chat rooms, forums, product reviews, Web blogs, and other social media platforms, including Twitter, YouTube, and Facebook. Applications of emotion recognition in text can be found in business, psychology, education, and many other fields where there is a need to understand and interpret emotions [9].

Emotion recognition in text, particularly implicit emotion recognition, is one of the difficult tasks in natural language processing (NLP), and it requires natural language understanding (NLU). There are different levels of text emotion recognition: document level, paragraph level, sentence level, and word level. The difficulty starts at the sentence level, where an emotion is expressed through the meanings of words and their relations; as the level increases, the complexity of the problem increases. Nevertheless, not all thoughts are expressed clearly, there are metaphors, sarcasm, and irony.

Fig. 1
figure 1

The structure of the paper

Different approaches have been used to recognize emotions in text. Keyword-based approaches for explicit emotion recognition have been investigated [90, 116]. For example, the sentence “Sunny days always make me feel happy” explicitly expresses happiness and includes the emotion keyword “happy.” A keyword-based approach would be able to recognize the emotion successfully. However, the presence of an emotion keyword does not always match the expressed emotion. For example, the sentences “Do I look happy to you!” and “I am not happy at all” include the emotion keyword “happy” but do not express that emotion. Additionally, a sentence can express emotion without the presence of an emotion keyword. Other approaches, namely rule-based approaches [86, 155], classical learning-based approaches [4, 6, 105], deep learning approaches [18, 48, 93], and hybrid approaches [8, 57, 136], were specifically introduced for recognizing implicit emotions in text.

Several survey papers regarding emotion recognition in text have been published. In general, these survey papers did a shallow investigation; none of them reported the results or evaluated the reviewed papers. Moreover, Kao et al. [79] did not review any papers; they only discussed the limitations of the keyword-based, learning-based, and hybrid approaches and suggested a solution to overcome these limitations. Canales and Martínez-Barco [25] classified the work published in emotion recognition based on the used emotion model and approach. For emotion models, they included the categorical approach and the dimensional approach. However, there was no mention of the componential approach. The focus of the survey papers of Jain and Sandhu [75] and Deborah et al. [36] was only on learning-based approaches. In both papers, there were no evaluations of the reviewed papers and no reporting on the features and results. In this survey paper, we review the following explicit and implicit emotion recognition approaches: keyword-based approaches, rule-based approaches, classical learning-based approaches, deep learning approaches, and hybrid approaches. Nevertheless, the main focus is on implicit emotion recognition approaches. We include the strengths and limitations of the reviewed papers, compare them within tables, and discuss some open problems. We review more papers than the previous survey papers and cover emotion recognition in different languages, including Arabic, Chinese, English, bilingual (English and Hindi), Indonesian, and Spanish. We also present emotion modeling approaches and resources (corpora and affect lexicons) available for emotion recognition in text.

The remainder of this paper is organized as follows. Section 2 presents the emotion modeling. Section 3 presents different resources for emotion recognition in text. Section 4 investigates prior work related to emotion recognition in English text and other languages. Section 5 discusses the advantages and limitations of the state-of-the-art approaches. Finally, the main conclusions are presented in Sect. 6. Figure 1 illustrates the paper structure.

Table 1 Summary of the three dominant emotion modeling approaches [70]

2 Emotion modeling

Psychology research has distinguished three major approaches for emotion modeling [60, 62]. Table 1 summarizes the three dominant emotion modeling approaches:

  • Categorical approach This approach is based on the idea that there exist a small number of emotions that are basic and universally recognized [62]. The most commonly used model in emotion recognition research is that of Paul Ekman [44], which involves six basic emotions: happiness, sadness, anger, fear, surprise, and disgust.

  • Dimensional approach This approach is based on the idea that emotional states are not independent but are related to each other in a systematic manner [62]. This approach covers emotion variability in three dimensions [20, 78]:

    • Valence: This dimension refers to how positive or negative an emotion is [62].

    • Arousal: This dimension refers to how excited or apathetic an emotion is [62].

    • Power: This dimension refers to the degree of power [62].

  • Appraisal-based approach This approach can be considered as an extension of the dimensional approach. It uses componential models of emotion based on appraisal theory [132], which states that a person can experience an emotion if it is extracted via an evaluation of events and that the result is based on a person’s experience, goals, and opportunities for action. Here, emotions are viewed through changes in all significant components, including cognition, physiology, motivation, motor, reactions, feelings, and expressions [62].

In the categorical approach, the emotional states are limited to a fixed number of discrete categories, and it may be difficult to address a complex emotional state or mixed emotions [172]. However, these types of emotions can be well addressed in the dimensional approach, although the reduction in the emotion space to three dimensions is extreme and may result in information loss. Furthermore, not all basic emotions fit well in the dimensional space, some become indistinguishable, and some may lie outside the space. Regarding the advantage of componential models, they focus on the variability of different emotional states due to different types of appraisal patterns [62].

3 Resources

The following section presents the resources (corpora and lexicons) available for emotion recognition in text.

3.1 Corpora

  • AlmFootnote 1 [4]: This corpus consists of approximately 185 children’s stories, including those of Grimm, H.C. Andersen, and B. Potter. The annotation is performed at the sentence level with one of the following labels: neutral, anger-disgust, sadness, fear, happiness, positive surprise, and negative surprise. For the annotation, annotators work in pairs on the same stories. To avoid any bias, each annotator is trained separately and works independently. If a disagreement occurs between the annotators, the paper’s first author chooses one of the selected emotion labels.

  • Aman [6]: This corpus consists of blog posts that are retrieved using seed words that represent Ekman’s six basic emotions. (For example, the words happy, pleased, and enjoy are selected as seed words for the happiness emotion category.) The annotation is performed at the sentence level with eight emotion categories; two new categories were recently added: mixed emotion and no emotion. Regarding the annotation, four annotators manually annotate the corpus.

  • ISEARFootnote 2 [133]: This corpus is the result of the international survey on emotion antecedents and reactions (ISEAR) project. A large group of psychologists from around the world worked on this project, and approximately 3000 students participated by reporting on situations in which they experienced the following emotions: joy, fear, anger, sadness, disgust, shame, and guilt.

  • SemEval-2007Footnote 3 [148]: This corpus consists of news headlines taken from newspapers, such as BBC News, CNN, and the New York Times, in addition to the Google News search engine. The structure of headlines allows for performing sentence-level annotations. Each headline is annotated with one or more of the following emotions: anger, disgust, fear, joy, sadness, and surprise. Two datasets are developed: one for training (with 250 annotated headlines) and one for testing (with 1000 annotated headlines).

  • SemEval-2018Footnote 4 [104]: This corpus consists of tweets. Each tweet is either neutral or expresses one or more of eleven given emotions, which are anger, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, and trust. Separate training, trial, and test datasets for English, Arabic, and Spanish tweets are provided.

  • SemEval-2019Footnote 5 [29]: This corpus consists of textual dialogues between two individuals. The first individual starts the conversation; then, it is the second individual’s turn, and the turn comes back to the first individual. Each conversation is either labeled as joy, anger, sadness, or others. The classification of emotion labels is based on the third turn of the conversation. Separate training, trial, and test datasets are provided.

  • Neviarouskaya: The annotation is performed at the sentence level with the ten labels defined by Izard [73]: anger, disgust, fear, guilt, interest, joy, sadness, shame, surprise, and neutral. The annotation process is performed by three annotators [28].

    • Dataset 1 [108] consists of 1000 annotated sentences collected from stories in 13 different categories grouped by topic.

    • Dataset 2 [107] consists of 700 annotated sentences collected from diary-like blog posts.

Table 2 presents the available datasets for emotion recognition in text, the recognized emotions in each dataset, and the number of instances in each emotion class. Noting that the SemEval-2007 dataset and SemEval-2018 dataset are multi-label multiclass datasets; thus, the total number of instances may appear less than the summation of the number of instances in each emotion. Each dataset is built from a different source. The sources vary in the style of writing (formal, informal), the quality of the text (with/without spelling errors, grammatical mistakes), and the use of special symbols, such as emojis, emoticons, and hashtags. Additionally, Alm, Aman, and ISEAR only provide a single dataset, while SemEval-2007 provides two (trial and test) datasets, and both SemEval-2018 and SemEval-2019 provide three (train, trial, and test) datasets. All of the datasets have anger, sadness, and joy. ISEAR is the only dataset that does not include surprise and includes shame and guilt. SemEval-2018 is the only dataset that includes anticipation, optimism, pessimism, trust, and love. Of the group, Alm and SemEval-2007 have the least number of instances, and ISEAR is the only balanced dataset. The Alm dataset joins anger and disgust in one class; although they share similar characteristics, they are different emotions. Even if the reason behind this choice is the low number of instances in each class, the emotions should have been represented separately to measure the ability of a model to recognize each one accurately. SemEval-2019 is the only dataset of textual dialogues. The first step in any NLP task is preprocessing the data, and the success of any emotion recognition model is dependent on this step. In the SemEval-2018 competition, the highest ranked teams used Twitter-specific preprocessing to accommodate the special characteristics of tweets. Standard tokenization, segmentation, and part-of-speech (POS) tagging are not implemented to handle emojis, emoticons, hashtags, and informal styles of writing with many grammatical and spelling mistakes.

Table 2 Corpora for emotion recognition in text

3.2 Lexicons

  • WordNetFootnote 6 [50]: This is an online English lexical database. It groups verbs, nouns, adjectives, and adverbs into sets of synonyms called synsets.

  • WordNet-AffectFootnote 7 [149]: This lexicon is an affective extension of WordNet. A subset of WordNet synsets, words that either express direct or indirect emotions, are annotated using semantic labels.

  • SentiWordNetFootnote 8 [10, 47]: This lexicon assigns one of three sentiment scores, namely positivity, negativity, or objectivity, to each synset of WordNet.

  • SenticNetFootnote 9 [19]: This is a lexicon of concepts with their respective emotions.

  • Multi-perspective Question Answering (MPQA) Subjectivity LexiconFootnote 10 [164]: This lexicon consists of over 8000 subjectivity single-word clues; each clue is classified as positive or negative.

  • Bing Liu LexiconFootnote 11 [68]: This lexicon consists of a list of positive and negative opinion words or sentiment words.

  • AFINNFootnote 12 [109]: This lexicon consists of manually rated words for valence with an integer between minus five (negative) and plus five (positive).

  • NRC Word-Emotion Association LexiconFootnote 13 [102, 103]: This lexicon is manually annotated using Amazon’s Mechanical Turk. Eight emotions, which are anger, anticipation, disgust, fear, joy, sadness, surprise, and trust, and two sentiments, positive and negative, are included.

  • NRC Affect Intensity LexiconFootnote 14 [100]: This lexicon provides real-valued intensity scores for the emotions anger, fear, sadness, and joy.

  • NRC Valence, Arousal, and Dominance (VAD) LexiconFootnote 15 [99]: This lexicon includes a list of more than 20,000 words and their valence, arousal, and dominance scores. The scores range from 0 to 1.

  • NRC Hashtag Emotion LexiconFootnote 16 [98, 101]: This lexicon is automatically generated from tweets that include emotion word hashtags, such as #happy. It associates the words with the emotions anger, disgust, fear, sadness, anticipation, surprise, joy, and trust.

  • NRC Hashtag Sentiment LexiconFootnote 17 [83]: This lexicon is generated automatically from tweets that include sentiment-word hashtags such as #amazing. It associates words with a positive or negative sentiment.

  • Sentiment140 LexiconFootnote 18 [83]: This lexicon is generated automatically from tweets with emoticons.

AffectiveTweetsFootnote 19 is a WEKAFootnote 20 (Waikato Environment for Knowledge Analysis) package for analyzing emotion and sentiment of tweets. What follows is the most used filters by the participants who achieved high ranking in the SemEval-2018 competition:

  • TweetToLexiconFeatureVector: calculates features from a tweet using the lexicons:

    • MPQA: counts the number of positive and negative words from the Multi-perspective Question Answering (MPQA) Subjectivity Lexicon.

    • Bing Liu: counts the number of positive and negative words from the Bing Liu Lexicon.

    • AFINN: calculates positive and negative variables by aggregating the positive and negative word scores provided by the AFINN lexicon.

    • Sentiment140: calculates positive and negative variables by aggregating the positive and negative word scores provided by the Sentiment140 Lexicon.

    • NRC Hashtag Sentiment Lexicon: calculates positive and negative variables by aggregating the positive and negative word scores provided by the NRC Hashtag Sentiment Lexicon.

    • NRC Word-Emotion Association Lexicon: counts the number of words matching each emotion from the NRC Word-Emotion Association Lexicon.

    • NRC-10 ExpandedFootnote 21 [23]: adds the emotion associations of the words matching the Twitter-specific expansion of the NRC Word-Emotion Association Lexicon.

    • NRC Hashtag Emotion Association Lexicon: adds the emotion associations of the words matching the NRC Hashtag Emotion Association Lexicon.

    • SentiWordNet: calculates positive and negative scores using SentiWordNet.

    • Emoticons: calculates a positive and a negative score by aggregating the word associations provided by a list of emoticons.

    • Negations: counts the number of negating words in the tweet.

  • TweetToInputLexiconFeatureVector: calculates features from a tweet using a given list of affective lexicons in arff format. The NRC affect intensity lexicon is used by default.

  • TweetToSentiStrengthFeatureVector: calculates positive and negative sentiment strengths for a tweet using SentiStrengthFootnote 22 [152].

4 Emotion recognition in text

The following section investigates prior work related to both explicit and implicit emotion recognition in English and other languages. Our comprehensive review of the literature has lead us to distinguish between five classes of approaches for recognizing emotions in text, including keyword-based approaches, rule-based approaches, classical learning-based approaches, deep learning approaches, and hybrid approaches. The articles are classified based on the proposed approach. Using such classification will help evaluate these approaches based on their performance, strengths, and limitations and draw a comparison between them. Explicit emotion is mainly recognized with keyword-based approaches. The rule-based approaches, classical learning-based approaches, deep learning approaches, and hybrid approaches have mainly been introduced to recognize implicit emotions in text, even though they have been used for explicit emotion recognition.

Fig. 2
figure 2

Main steps of a keyword-spotting technique

4.1 Keyword-based approaches

A keyword-based approach relies on finding occurrences of keywords in a given text and assign an emotion label based on the detected keyword. The most used approach is the keyword-spotting technique; Fig. 2 outlines the main technique. First, a list of emotional words for each emotion label is defined using lexicons such as WordNet or WordNet-Affect. Then, text preprocessing, which includes tokenization, stop words removal, and lemmatization, is performed on the emotion dataset. The next step is to spot the emotion keywords present in the text using the predefined emotion keyword list. After that, the intensity of the emotion is analyzed. Then, negation checking is performed. Finally, the emotion label for each sentence is determined.

Tao [151] created a lexicon in which each word was classified as either a content word or an emotion functional word (EFW). The EFWs were then classified as an emotion keyword, modifier word, or metaphor word. The emotion keyword class consists of six labels of emotions and their corresponding weights. The modifier word class consists of words that emphasize the emotion by making it stronger or weaker. The metaphor word class consists of words that either show spontaneous expressions or show personal character. A coefficient is assigned to each word that was classified as a modifier word or a metaphor word. To obtain the relationship between the content word and the EFWs, POS tagging, a semantic tree, and HowNet [40], which is a Chinese knowledge database, were used. To recognize emotions, the first step is to apply the POS tagger, check for EFWs, and assign an emotion rating. The second step is to assign the weights for each emotion keyword and construct the link between the EFWs. The final step is to sum the weights of the emotion keywords across all sentences, run the scores through a fuzzy-logic process to determine the overall score, and assign each sentence a suitable emotion. Although their approach was able to classify the emotion conveyed in sentences correctly, many mislabeled emotions still occurred.

Ma et al. [90] proposed a model to recognize emotions in text messages in a chat system. First, they defined emotion keywords. Then, WordNet and WordNet-Affect were used to find the synonyms of the selected keyword. Each word was assigned a weight based on its sense. After building the affective lexicon, the overall emotion estimation was calculated by summing the weights of the matched keywords. Finally, sentence-level processing, which includes sentence splitting, POS tagging, and negation detection, was applied. The strategy used to address negation, which involved flipping the polarity of an emotion word, is not practical and will cause errors.

Perikos and Hatzilygeroudis [116] utilized NLP techniques, including POS tagging and parsing, to analyze the structure of a sentence. The emotion words were recognized using WordNet-Affect. The overall emotion of a sentence was selected based on the sentence dependency graph. The performance was tested on a corpus created by the authors. The corpus consists of 180 sentences, 120 of which convey emotion. Although the results were promising, the model must be tested on a known emotion corpus to truly measure its performance. Shivhare et al. [137] proposed a model that used an ontology with the keyword-spotting technique. The emotion ontology consists of three levels based on the emotion hierarchy presented by Parrott [114], and the ProtégéFootnote 23 application was used to create it. The results showed that adding the ontology improves the accuracy but does not overcome all of the limitations of the keyword-spotting technique.

Fig. 3
figure 3

Main steps of a rule-based approach

4.2 Rule-based approaches

A rule-based approach is based on the manipulation of knowledge to interpret information in a useful way; Fig. 3 outlines the main steps of a rule-based approach. First, text preprocessing is performed on the emotion dataset. The preprocessing steps may include tokenization, stop words removal, lemmatization, POS tagging, and dependency parsing. Then, the emotion rules are extracted using linguistic, statistics, and computational concepts. The best rules are then selected. Finally, the rules are applied to the emotion dataset to determine the emotion labels.

Lee et al. [86] proposed a rule-based model for recognizing emotion cause events in Chinese. Cause events refer to the explicitly expressed opinions or events that evoke a corresponding emotion. First, an annotated emotion causes corpus is constructed. Second, the distribution of cause event types, the position of cause events relative to emotional experiences, and keywords are calculated for each emotion class. Then, seven groups of linguistic cues are identified, and two sets of linguistic rules for recognizing emotion causes are generalized. Finally, based on the linguistic rules, a system that recognizes the causes of emotions is developed. The experiments showed that the system has promising performance in terms of cause occurrence recognition and cause event recognition.

Udochukwu and He [155] proposed an emotion recognition model based on the emotion model that was created by Ortony et al. [110] and modified by Steunebrink et al. [146]. The Ortony, Clore, and Collins (OCC) model consists of five variables: direction, tense, overall sentence polarity, event polarity, and action polarity. To fill the OCC model variables, the data must first be preprocessed. The following techniques were used for this purpose: sentence splitting and tokenization, POS tagging, word sense disambiguation (WSD), dependency parsing, sentence tense detection based on the POS tags, and polarity detection using the majority vote based on the lexicon matching results obtained from SentiWordNet [46], AFINN, and the subjectivity lexicon [164]. Because the goal was to recognize implicit emotions, sentences that express explicit emotions via emotion words were filtered. Thus, any sentence that contains an emotion word found in WordNet-Affect was deleted. The results showed that their approach is very sensitive to the text quality.

4.3 Classical learning-based approaches

A classical learning-based approach provides systems the ability to automatically learn and improve from experience. Machine learning algorithms are often categorized as supervised or unsupervised. The most used classification algorithm in the reviewed papers is SVM, which is a supervised machine learning algorithm. Figure 4 outlines the main steps of SVM for emotion recognition. First, text preprocessing is performed on the emotion dataset. The preprocessing steps may include tokenization, stop words removal, lemmatization, and POS tagging. The next step is to extract useful features. Then, the features with the most information gain are selected. Given the feature set and emotion labels, the SVM algorithm outputs an optimal hyperplane. Finally, the trained SVM model is used to classify emotions in unseen text.

Fig. 4
figure 4

Main steps of SVM

One of the earliest works on emotion recognition in text was conducted by Alm et al. [4]. They proposed a supervised machine learning approach using a variation of the Winnow update rule implemented in the SNoW learning architecture [26]. Three experiments were performed. The first experiment tested whether a sentence was neutral or emotional. The second experiment tested whether the sentence was neutral or conveyed a positive or negative emotion. The results were affected by the size of the dataset. The worst result was obtained when classifying a positive emotion because only 9.87% of the sentences were annotated as this class. The third experiment tested the performance when different configurations of features were selected. The authors concluded that features interact with each other and that none of the features are independent of each other; hence, selecting the best feature set is challenging.

Aman and Szpakowicz [6] annotated a corpus for text emotion recognition. In their experiment, only the sentences for which all annotators agreed on regarding their emotion categories were selected. However, the focus was to recognize emotional sentences regardless of their emotion category. Thus, there were two classes: one representing all nonemotional sentences and one containing all the sentences labeled with one of Ekman’s emotions. Different feature sets were tested, including features from the General Inquirer [147] only, features from WordNet-Affect only, and combined features from the previous lexicons with and without emoticons, exclamation marks and question marks. The authors used naïve Bayes and SVM for the classification. The best result was achieved using the SVM, and although the nonlexical features did not improve the results of the SVM, they did improve the accuracy of the naïve Bayes classifier. Moreover, Aman and Szpakowicz [7] conducted an experiment using different sets of features, including corpus-based unigram features, features derived from an emotion lexicon constructed based on Roget’s Thesaurus [77] structure, and features extracted from WordNet-Affect. The best result was obtained when all three features were combined.

Danisman and Alpkocak [33] proposed a vector space model (VSM), where each class of emotion is represented by a set of documents. To classify an input text, the similarity between each emotion class document and the input text is calculated by considering the cosine angle between them. The emotion class with the maximum similarity value is selected to be the label of the input text. The model was compared to ConceptNet [89], naïve Bayes, and SVM classifiers [64]. The experimental results showed that the VSM classifier performs better than all three classifiers.

Ghazi et al. [55] proposed a novel hierarchical approach to emotion recognition. First, the input text is classified into two categories: emotion sentences or nonemotion sentences. Second, the polarity of the emotion sentences is taken. Positive polarity represents the happiness emotion, while negative polarity represents the other five emotions: sadness, fear, anger, surprise, and disgust. The final step is to classify the emotions of the sentences with a negative polarity. Two experiments were performed: one that involved two-level classification and one that involved three-level classification. The main focus was to compare the hierarchical and flat classifications. The experimental results showed that this hierarchical approach outperforms the flat classification approach.

Ghazi et al. [56] aimed to take the context of a sentence into consideration by using different emotion lexicons and NLP techniques to extract meaningful feature sets. The performances of the features were tested using two classification algorithms: SVM and logistic regression. The logistic regression performed better than the SVM, and both performed better than the baseline, which used the SVM with BOW. Additionally, the features were grouped by similarity to test their contribution and significance. The results showed that lexical, POS, dependency, and negation features significantly improve the performance.

Xu et al. [167] proposed a hierarchical emotion classification for a Chinese microblog. In the first level, the input text is classified as neutral or emotional. The second level finds the polarity of the emotional sentences. The third level classifies the sentences with the negative polarities as either distress, surprised, fearful, angry, or disgusted and classifies the positive sentences as either fond or joyful. Then, each emotion class of the third level is divided into a number of emotion classes, resulting in 19 different classes of emotions in the fourth level. The support vector regression (SVR) algorithm is used for classification. Moreover, Zhang et al. [174] proposed a knowledge-based topic model (KTM) to identify implicit emotion features. Additionally, a hierarchical emotion structure was employed to classify emotions into 19 classes of four levels using an SVM. The authors achieved good results. However, the tree-structure classification was a time-consuming process.

As mentioned in Sect. 2, there are three major approaches for emotion modeling. Kim et al. [81] presented an evaluation of the categorical model and the dimensional model. For the categorical model, features were derived from WordNet-Affect, and the VSM was used for text representation. To reduce the VSM representation, three dimensionality reduction techniques were used: latent semantic analysis (LSA), probabilistic latent semantic analysis (PLSA), and nonnegative matrix factorization (NMF). Regarding the dimensional model, features were derived from the affective norm for English words (ANEW) [22]. The experimental results showed that the categorical NMF model and the dimensional model achieved the best results.

Chaffar and Inkpen [28] investigated the use of a heterogeneous emotion-annotated dataset, which included the SemEval-2007 dataset, Alm’s dataset, Aman’s dataset, and the Neviarouskaya dataset. The best results were obtained using the SVM classifier. Moreover, the results showed that using n-gram features for the SemEval-2007 dataset yields better results than those obtained using BOW. However, the opposite is true for the Neviarouskaya dataset. Moreover, using features extracted from WordNet-Affect does not improve the accuracy.

Ho and Cao [66] developed an emotion recognition model based on two ideas: emotions depend on the mental state of humans [118], and emotions are caused by emotional events [69], which means that when a certain event occurs, the mental state of a human transitions from one state to another. The authors implemented this idea using a hidden Markov model (HMM), where each sentence consists of multiple sub-ideas and each idea is considered to be an event that causes a transition to a certain state. The states of the HMM were automatically generated based on the dataset, and its parameters were estimated during training. Compared to the other models, the results were not promising. The results could be improved by using a better dimensionality reduction method and including more linguistic information.

Bandhakavi et al. [15] created an emotion lexicon from a labeled emotion corpus. To show that a domain-specific emotion lexicon (DSEL) is more suitable than a general-purpose emotion lexicon (GPEL), the authors tested the quality of features extracted from their emotion lexicon against features extracted from a GPEL, such as WordNet-Affect, the NRC Word-Emotion Association Lexicon, and a lexicon learned using a point mutual information (PMI). The results showed that their features outperform the GPEL features and those of BOW. Moreover, the BOW features were better than the GPEL features, thus revealing that the use of the GPEL is not sufficient for a specific domain such as Twitter.

Anusha and Sandhya [9] developed a model that uses NLP techniques to improve the performance of learning-based approaches by including the syntactic and semantic features of text. Two classification algorithms were trained: naïve Bayes and SVM. The authors performed two experiments: one classified the emotions as either positive or negative and tested the sentence polarity, and the other tested the model performance on emotion recognition. Each experiment was repeated twice: once with the use of NLP techniques to preprocess the data and once without this step. The difference in the results when using the NLP techniques was significant. This outcome showed that applying methods that select the important part of a sentence is essential for improving the results.

Thomas et al. [153] investigated the use of multinomial naïve Bayes (MNB) with unigram, bigram, and trigram features from English text. The unigram features provided better results compared to the other two features. Later, Yuan, and Purver [173] utilized an SVM with high-order n-gram features for Chinese text. Character-based 4-gram features were the most effective. The results showed that the performance in terms of classifying the emotions varies among the emotions; the highest accuracy was achieved for happiness. Note that the size of the labeled data was not the same for each emotion and that happiness had the largest size.

Due to the importance of features and how they can affect the results, Gao et al. [52] proposed a feature extraction method that takes the syntactic and grammar structures of a sentence written in Chinese into account. First, they expanded the standard emotion lexicon, which was manually annotated by three annotators, using a Chi-square test and PMI with word2vecFootnote 24. Second, the quality of the selected feature was improved by using POS tagging and dependency parsing. The SVM was used for classification because it performs well and has been widely used. The results showed that using features with syntactic and grammar structures improves the accuracy.

Emotions are not limited to 6, 7, or 8 emotions. People express themselves using a wide range of emotions. Desmet and Hoste [38] proposed a binary SVM to recognize 15 emotions. They defined seven feature groups, and to determine the optimal feature combination, they combined the seven feature groups into 17 feature sets and used bootstrap resampling. The input text was checked for spelling errors, and these errors were corrected. The experiments showed that applying spelling correction improves the results. The results also showed that the performance improves if the number of emotions is reduced. Through experimentation, the best result was obtained when retaining the following seven emotions: blame, guilt, hopelessness, information, instruction, love, and thankfulness. Yan and Turtle [169] proposed two learning-based approaches to recognize 28 emotions. The experiments demonstrated that the SVM and Bayesian networks consistently provide good performances.

Douiji et al. [41] proposed an unsupervised machine learning algorithm based on the previous work of Agrawal and An [2]. YouTube comments were used as the data corpus because of the similarity between the writing styles of YouTube comments and instant messages. To recognize the emotion of a text entry, the similarity between the text and each target emotion was computed using the normalized version of PMI. Then, the average PMI values were computed, and the emotion category with the highest average value was assigned to a sentence. Since an unsupervised approach was used, the corpus required no labeling.

Muljono et al. [105] proposed a model to recognize emotions from Indonesian text. The following preprocessing steps were performed to extract the features: tokenization, case normalization, stop word removal, stemming, and term frequency–inverse document frequency (TF-IDF). Four classification methods, i.e., naïve Bayes, J48, k-nearest neighbor (KNN), and support vector machine-sequential minimal optimization (SVM-SMO), which were performed using WEKA, were evaluated. The best result was achieved using SVM-SMO. Jain et al. [76] proposed a multilingual English–Hindi emotion recognition framework, and two classification methods were tested: naïve Bayes and SVM. The best results were obtained using the SVM.

Mulki et al. [106] formulated emotion recognition as a binary classification problem. Different preprocessing steps were tested. The preprocessing pipeline used in their highest achieved result for Arabic was replacing an emoji with an emotion tag, stemming, and stop word removal. For English and Spanish, the preprocessing pipeline included replacing an emoji with an emotion tag, lemmatization, and stop word removal. TF-IDF was used to generate features. The classification was performed using one-vs-all SVM clarifier with a linear kernel. Their model achieved 3rd rank for Arabic, 14th rank for English, and 3rd rank for Spanish among the teams in the SemEval-2018 competition.

Xu et al. [168] proposed a model to recognize multi-label emotion recognition in English tweets. The proposed model used different types of features including linguistic features, sentiment lexicon features, emotion lexicon features, and domain-specific features. Additionally, different classification algorithms were tested: logistic regression, SVR, bagging regressor (BR), AdaBoost regressor (ABR), gradient boosting regressor (GBR), and XGBoost regressor (XGB). The combination of all five types of features and logistic regression obtained the highest results. Their model achieved 13th rank among the teams in the SemEval-2018 competition. Deborah et al. [37] proposed a simple multilayer perceptron (MLP) for multi-label emotion recognition in English tweets. The MLP had an input layer, two hidden layers with 128 and 64 neurons, and an output layer. The model used a Nadam optimizer with 0.01 as the learning rate. Their model achieved 18th rank among the teams in the SemEval-2018 competition.

Plaza-del-Arco et al. [119] proposed a model for multi-label emotion recognition in English and Spanish tweets. First, text preprocessing was performed. The natural language toolkit (NLTK) TweetTokenizerFootnote 25 was used for tokenization, NLTK Snowball stemmerFootnote 26 was used for stemming, stop words were removed (only for English), and all letters were converted to lowercase. Then, different lexicons were tested including Spanish emotion lexicon [139], NRC Word-Emotion Association Lexicon [102], and WordNet-Affect. The information extracted from these lexicons with the TF-IDF representation of the tweets was used as the features. Finally, the authors used the random forest (RF) algorithm for classification. Their model achieved 25th and 5th ranks in the SemEval-2018 competition for English and Spanish, respectively. Although they ranked high in Spanish, a small number of teams participated in this language compared to English.

Singh et al. [140] proposed a two-stage text feature selection method to identify significant features for emotion recognition. First, they extracted meaningful words, namely nouns, verbs, adverbs, and adjectives, using a POS tagger. Then, a Chi-square method was employed to compute the statistical significance score for each word. The words with a low statistical score were removed. They used an SVM with radial basis kernel function to build the classification model. The results show that there is a significant improvement with the proposed approach compared with only using POS or statistical method.

Fig. 5
figure 5

Main steps of LSTM

4.4 Deep learning approaches

Deep learning is a branch of machine learning in which programs learn from experience and understand the world in terms of a hierarchy of concepts, where each concept is defined in terms of its relation to simpler concepts. This approach allows a program to learn complicated concepts by building them based on simpler ones [59]. The most used deep learning model here is long short-term memory (LSTM). LSTM is a special form of recurrent neural network (RNN) with the capability of handling long-term dependencies. LSTM overcomes the vanishing or exploding gradient problem common in RNNs. Figure 5 outlines the main steps of LSTM for emotion recognition in text. First, text preprocessing is performed on the emotion dataset. The preprocessing steps may include tokenization, stop words removal, and lemmatization. After that, the embedding layer is built and is fed into one or more LSTM layers. Then, the output is fed into a dense neural network (DNN) with units equal to the number of emotion labels and a sigmoid activation function to perform the classification.

Wang et al. [159] utilized a convolutional neural network (CNN) to solve multi-label emotion recognition. The experiments were conducted on the NLPCC 2014 Emotion Analysis in Chinese Weibo Text (EACWT) taskFootnote 27 [158] and the Chinese blog dataset Ren_CECps [122]. The experimental results showed that the CNN with the help of word embedding outperforms strong baselines and achieves excellent performance.

Baziotis et al. [18] proposed a deep learning model for multi-label emotion recognition English in tweets. Their model consisted of a two-layer bidirectional long short-term memory (Bi-LSTM) equipped with multilayer self-attention mechanism. They utilized the ekphrasisFootnote 28 [17] tool to process the text. The preprocessing steps included Twitter-specific tokenization, spell correction, word normalization, word segmentation, and word annotation. Due to the limited amount of training data, they utilized a transfer learning approach by pretraining the Bi-LSTMs on the SemEval-2017, Task 4A [129] dataset. They also collected a dataset of 550 million English tweets to be used for calculating word statistics necessary in the text preprocessing, training word2vec embeddings [96], and affective word embeddings. The experimental results showed that transfer learning did not outperform the random initialization model. Their model achieved 1st rank among the teams in the SemEval-2018 competition.

Meisheri and Dey [93] proposed a robust representation of a tweet. Two parallel architectures were designed to generate the representation using various pretrained embeddings. The first architecture generated the embedding matrix from emoji2vec [43], GloVe [115], and Character-level embeddingsFootnote 29. The resulted matrix was fed into a Bi-LSTM [61]. The output of each time step was then fed into an attention layer [14]. The second architecture generated the embedding matrix using pretrained GloVe embeddings trained on a Twitter corpus. This matrix was fed into another Bi-LSTM, and max-pooling was applied to the output of the Bi-LSTM. The outputs of the two architectures were concatenated and then fed into two fully connected networks. Their model achieved second among the teams in the SemEval-2018 competition.

Du and Nie [42] proposed a deep learning model that uses pretrained word embeddings for the tweets representation. The embeddings were fed into a gated recurrent unit (GRU), and the classification was obtained using a dense neural network (DNN). Their model achieved 15th rank among the teams in the SemEval-2018 competition. Abdullah and Shaikh [1] formulated emotion recognition in tweets as a binary classification problem. Word embeddings were used for tweet representation. The embeddings were fed into four DNNs. The output of the fourth DNN was normalized to either one or zero based on a threshold, which was 0.5. Their model achieved 4th rank for Arabic and 17th rank for English among the teams in the SemEval-2018 competition. Li et al. [87] proposed a deep learning model that uses word embeddings for tweet representation. The embeddings were fed into an LSTM. For the classification, the model calculated a score for each emotion label and selected the ones with the top three scores. Their model achieved 23rd rank in the SemEval-2018 competition.

Ezen-Can and Can [48] proposed to formulate a multi-label emotion recognition problem as a binary classification problem. This approach allowed different model architectures and parameters for each emotion label. The authors utilized three GRU layers, two of which were bidirectional. Due to the size of the training dataset, they built an autoencoder and used unlabeled tweets to learn weights that could be used in the classifiers. They used pretrained embeddings for the representation of emojis [43], words, and hashtags [115]. Their results were better than the baseline but not as high as those of the other participants. Their model achieved 24th rank among the teams in the SemEval-2018 competition.

Basile et al. [16] proposed a deep learning model for emotion recognition in textual conversation. The model consists of four submodels, which are three-input submodel (INP3), two-output submodels (OUT2), sentence-encoder submodel, and the bidirectional encoder representations from transformers (BERT) [39] submodel. The INP3 submodel takes each part of the conversation and fed it into two Bi-LSTM layers, followed by an attention layer [170]. The outputs are concatenated and fed into three DNNs. The OUT2 submodel has the same architecture as the INP3 submodel. However, the three parts conversation are concatenated and used as one input. Also, there is an additional DNN inserted after the attention layer that produces an additional output, a classification of the conversation as emotional or others. The purpose of this submodel is to reduce the effect of having an imbalanced dataset. In the sentence-encoder submodel, they built a feed-forward network with a fine-tuned universal sentence encoder (USE) [27] and only used the first and third part of the conversation. In the BERT submodel, they modeled the problem as a sentence-pair classification problem using only the first and third turn of the conversation. This submodel is combined with a lexical normalization system [156]. Different classification algorithms were tested including SVM, SVM with normalization (SVM-n), logistic regression, naive Bayes, JRip rule learner [31], random forest, and J48. The results show that the features learned by INP3 and OUT2 submodels lead to better performance than the features learned by USE and BERT submodels. However, an ensemble of the four submodels leads to the best performance result with SVM-n.

Xiao [166] proposed a deep learning model for emotion recognition in textual conversation. The ekphrasis tool was used for preprocessing the text. They fine-tuned the following models: the universal language model (ULM) [67], BERT model, OpenAI’s Generative Pretraining (GPT) [123] model, DeepMoji [49] model, and a DeepMoji model trained with NTUA [18] embedding. The results show that the ULM model has the best performance among the other models, and the DeepMoji model trained with NTUA embedding came in second. However, ensembling these models obtained the highest result. They combined these models by taking the unweighted average of the posterior probabilities for these models, and the emotion class with the largest averaged probability was selected.

Ragheb et al. [124] proposed a deep learning model for emotion recognition in textual conversation. The three parts of the conversation were concatenated and inputted into the embedding layer. The output of the embedding layer is fed into three consecutive layers of Bi-LSTM trained by average stochastic gradient descent. Then, a self-attention mechanism followed by an average pooling was applied on the first and third parts of the conversation. The difference between the two pooled scores is taken as an input to a two DNN followed by softmax to obtain the emotion labels. The Wikitext-103 dataset [94] was used for training the language model. The results show the low performance of recognizing the happy emotion label.

Ma et al. [91] proposed a deep learning model for emotion recognition in textual conversation. To overcome the out of vocabulary problem caused by using pretrained word embeddings, they replaced the emojis with a suitable emotion word. The embeddings are fed into a Bi-LSTM layer, while an attention mechanism increases the weights of the emotion words. The inner product is taken from the output of the Bi-LSTM and the attention weights and fed into another Bi-LSTM layer. Then, global max-pooling, global average pooling, and last tensor are used on the output of the Bi-LSTM layer. The pooling scores are fed into an LSTM layer and then a DNN with a softmax activation function. The results show the low performance of recognizing the happy emotion label.

Ge et al. [53] proposed a deep learning model for emotion recognition in textual conversation. Three pretrained embeddings, which are word2vec-twitter [58], GloVe, and ekphrasis [17], were used. The embedding layer is fed into a Bi-LSTM followed by an attention layer and a CNN layer. The outputs of the Bi-LSTM and the CNN are concatenated, and global max-pooling is applied. The pooling scores are fed into a DNN with softmax activation function for classification. The results show that using pretrained embeddings improved the performance. Also, by combining the outputs of Bi-LSTM and CNN layers, the model was able to learn local features as well as long-term features.

Rathnayaka et al. [125] proposed a deep learning model for multi-label emotion detection in microblogs. They used ekphrasis tool for preprocessing. The pretrained word embedding GloVe was used. The embedding layer is fed into two Bi-GRU layers. Then, the embedding layer and the output of the first Bi-GRU layer are fed into the first attention layer. Also, the embedding layer, the output of the first Bi-GRU layer, and the output of the second Bi-GRU layer are fed into the second attention layer. Then, the two attention layers are concatenated and fed into a DNN with a sigmoid activation function to perform the classification. They achieved the state-of-the-art results with their model.

Seyeditabari et al. [135] formulated emotion recognition in text as a binary classification problem. Two word embedding models were used, which are ConceptNet Numberbatch [144] and fastText [97]. The embedding layer is fed into a Bi-GRU layer. Then, they used a concatenation of global max-pooling and average pooling layers. The pooling scores are fed into a DNN, and a sigmoid layer performs the classification. The results show that deep learning models can learn more informative features, which improve the performance significantly.

Shrivastava et al. [138] proposed a deep learning model for emotion recognition in multimedia text. The word2vec [95] model was used for constructing the words embeddings. The embedding layer is fed into the convolutional layers, followed by a max-pooling layer and then a DNN layer. The output of the DNN was then fed into an attention layer. The classification was performed by softmax. The results show that the precision of the emotion labels anger and fear is better than other emotion labels, while the recall and F1-score of the emotion label happiness are better than those of the other emotion labels.

4.5 Hybrid approaches

Seol et al. [134] proposed a hybrid of keyword-based and learning-based approaches. First, the system searches for emotion keywords in a sentence using the emotional keyword dictionary (EKD), which consists of words that express emotional meaning. If the system finds at least one emotional keyword, then the sentence is classified according to the EKD. However, if the input sentence does not contain any emotional keyword, then the knowledge-based artificial neural network (KBANN) classifier is used. The KBANN is a type of artificial neural network (ANN) that uses domain knowledge to initialize the network. Except for neutral emotions, each emotion is trained to be recognized by a separate KBANN. Moreover, Haggag [63] proposed a KBANN that is trained using an evolutionary algorithm. A structured knowledge base is created to store semantic and syntactic information for frame elements. Emotions are recognized via a matching process. Moreover, there are four methods to match a frame against a knowledge-based frame set, which are first matching, best matching, best opposite matching, and average matching. This approach allows for a trade-off between the performance and the strength of the matches found. The experimental results showed that the recognition accuracy of their proposed model is better than those of other existing emotion approaches, including keyword-spotting and the supervised machine learning models.

Gievska et al. [57] proposed a hybrid approach that uses both a lexicon and learning-based approaches. A lexicon of emotion words related to Ekman’s six basic emotions was derived from the following: WordNet-Affect, AFINN, H4LvdFootnote 30, and the NRC Word-Emotion Association Lexicon. A number of classification algorithms were tested, including naïve Bayes, SVM, and decision trees. The SVM provided the best results; thus, it was selected. The results showed that the lexicon approach limitation is improved with the help of the SVM classifier; it was able to recognize the implicit emotions in the sentences.

Shaheen et al. [136] proposed a framework that combines rule-based and learning-based approaches. Their emotion recognition system had two main phases. First, a set of the annotated reference called emotion recognition rules (ERRs), which are used to capture the emotional part of the sentence, is constructed. Second, the input sentence ERR is compared with the annotated ERRs using the KNN classifier. The KNN takes the input ERR and searches the annotated ERR set for a similar match. There are two similarity measures: semantic similarity and keyword similarity. The semantic similarity shows how close the two ERRs are in meaning, whereas the keyword similarity shows the number of matched words between the two ERRs. The input ERR will take the emotion label of the annotated ERR with the maximum semantic similarity. If there is a tie, then the keyword similarity is used. If the KNN classifier fails to find a match, which may occur when the training dataset is small, a PMI classifier is used. If it fails, a PMI with information retrieval (PMI-IR) is used. PMI-IR uses search engines (the authors used Google) to find a match. Two datasets were used: Aman’s dataset and a dataset constructed from sentences are collected from Twitter. In one of their experiments, the authors used the second dataset for training and the first dataset for testing. The results demonstrated the strengths and robustness of their approach, where they trained using one dataset and tested using a completely different one.

Amelia and Maulidevi [8] proposed a hybrid method that combined the keyword-spotting technique and learning-based method to recognize the dominant emotion in short stories. The emotion words used in the keyword-spotting technique came from the NRC Emotion Lexicon. Although the NRC lexicon contains emotion words from 20 different languages, none of the languages corresponded to Bahasa Indonesia. Thus, the emotion words were translated from English into Indonesian using Google TranslateFootnote 31 and kamus.net.Footnote 32 Then, they double-checked the translation to avoid auto-translation mistakes using Kamus Besar Bahasa Indonesia (KBBI)Footnote 33, which is a great dictionary of the Indonesian language of the language center. For the learning-based method, they used three algorithms: logistic regression, SVM, and naïve Bayes. Both methods were run separately, and each recognized one or more dominant emotion for each short story. Then, the hybrid method was used to select the most dominant emotion. However, if neither result of the two methods could be chosen, then the result obtained via the keyword-spotting technique was chosen. We believe that since no syntactic information and semantic information were used in the features, the keyword-spotting technique performed better than the learning-based methods.

Li et al. [88] proposed a hybrid neural network (HNN) composed of latent semantic machine (LSM), which uses the biterm topic model (BTM). Three experiments were performed. The first one evaluated the influence of the number of hidden neurons with one hidden layer. The results showed that the best numbers of hidden neurons on ISEAR and SemEval-2007 are 80 and 60, respectively. In the second experiment, the authors compared the performance between their HNN and CNN, both with one hidden layer. Their HNN outperformed the CNN for both datasets. In the third experiment, they compared the performance of using two hidden layers. The HNN with two layers outperformed the CNNs with one and two hidden layers. However, the HNN with one hidden layer performed better than the HNN with two hidden layers.

Riahi and Safari [127] proposed a hybrid model, which consists of three submodels: a machine learning submodel, VSM submodel, and keyword-based submodel. Each submodel analyzes the input text from a different aspect and outputs an emotion label. If all submodels produce the same emotion label, then the label will be assigned to the input text; otherwise, the input text is left without an emotion label.

Herzig et al. [65] proposed an ensemble approach that combines the traditional representation of text as a BOW vector with a new representation that utilized pretrained word vectors, which are GloVeFootnote 34 and Word2Vec (GoogleNews). To obtain the document representation from the word embedding, they experimented with three methods: continuous bag-of-words (CBOW), TF-IDF weights, and classifier weights (CLASS). The experiments were performed on five datasets from different domains, and a (one vs. all) SVM classifier was used. The results showed that word vectors trained by GloVe achieved higher performance than Word2Vec-based vectors. The results showed that there is an advantage in combining traditional text representation, such as BOW, with embedded document representation.

Park et al. [113] proposed two models for multi-label emotion recognition in tweets. The first model was formulated as a linear regression with label distance as the regularization term. The second model was formulated as a logistic regression classifier chain. Classifier chain treats a multi-label problem as a sequence of binary classification problem while taking the prediction of the previous classifier as an additional input. For the features, the authors trained a CNN using another Twitter corpus distantly labeled with hashtags to obtain emotional word vectors. Additionally, they used two deep models to learn emoji vectors. In the first model, they used the pretrained deep learning network of Felbo et al. [49]. This network consists of Bi-LSTM with attention layer to extract features from the original competition datasets. For the second model, they collected 8.1 million tweets, which contained 34 different emojis relevant to the emotion labels. They then clustered these emojis into 11 clusters based on the distance on the correlation matrix of the hierarchical clustering from [49]. Next, they trained a one-layer Bi-LSTM classifier with 512 hidden units to predict the emoji cluster of each sample. They also included human-engineered features, such as the number of elongated words and the number of exclamation and question marks. The results showed that the regularized linear regression performed better than the classifier chain. However, the best result was achieved from the ensemble of both models. Their model achieved 3rd rank in the SemEval-2018 competition.

Gee and Wang [54] proposed a model for multi-label emotion recognition in English tweets. The proposed model consists of five submodels, which are a Bi-LSTM, an LSTM with attention mechanism, a Bi-LSTM, a lexicon vector, and five layers of DNNs. Transfer learning was performed to learn the weights of the LSTM networks. The input to the first two submodels was word embeddings, while the input to the third and fourth submodels was a lexicon vector extracted by TweetToLexiconFeatureVector AffectiveTweets WEKA filter. The outputs of the four submodels were concatenated and fed into the fifth submodel. The model was trained incrementally for emotions within the same cluster formed by hierarchical clustering. Their model achieved 4th rank in the SemEval-2018 competition among the teams.

Kim et al. [82] proposed a model for multi-label emotion recognition in tweets. The proposed model used pretrained word embeddings. The embeddings were fed into three self-attention layers. The output of the self-attention layers was fed into a CNN, and then, max-pooling was performed. The output of the max-pooling was fed into a DNN for classification. They experimented with the impact of using emojis, self-attention layers, and lexicon features. The results showed that utilizing emojis, attention mechanism, and lexicon features improve the results. Their model achieved 5th rank for English and 1st rank for Spanish among the teams in the SemEval-2018 competition.

Rozental and Fleischer [130] proposed a model for multi-label emotion recognition in English tweets. Two preprocessing pipelines, which are simple and complex, were implemented. Both versions used the following steps: word tokenization using the CoreNLP,Footnote 35 [92] POS tagging using the Tweet NLPFootnote 36 tagger [111], replacing emojis with representative keywords, replacing URLs with a special keyword, removing duplications, and breaking hashtags into individual words. The complex preprocessing version included these additional steps: word lemmatization using CoreNLP, name entity recognition using CoreNLP, and replacing the entities with representative keywords, synonym replacement, and word replacement using Wikipedia dictionary. Two hundred million tweets were randomly sampled using the Twitter Firehose service. The authors cleaned the gathered tweets using the preprocessing pipelines (simple and complex). Then, they trained the embeddings using the GensimFootnote 37 package [126]. They created four embeddings for the words and two embeddings for the POS tags. In addition to the deep features, they extracted lexicon features and semantic and syntactic features. The embeddings were fed into a bidirectional gated recurrent unit (Bi-GRU) with a CNN attention mechanism. The output was then fed into two fully connected neural nets. Their model achieved 6th rank in the SemEval-2018 competition.

De Bruyne et al. [35] formulated emotion recognition in tweets as a binary classification problem. Different syntactic, semantic, and stylistic features were used to represent the tweets. Additionally, different classifiers were tested, including SVM, linear SVM with stochastic gradient descent learning (SGD), logistic regression, and RF. The authors took the best performing classifier for each emotion label and combined them in a classifier chine, where the prediction from the previous model was passed to the next classifier as additional features. Their model achieved 11th rank in the SemEval-2018 competition.

Kravchenko and Pivovarova [85] proposed a model for multi-label emotion recognition in English tweets. The proposed model hybridized two types of features, which are lexicon features and word embeddings. The gradient boosting classifier was used for the classification. They concluded that the model performed better with word embeddings than lexicon features, and the best result was achieved from the combination of both. Their model achieved 15th rank in the SemEval-2018 competition.

Badaro et al. [12] proposed a model for multi-label emotion recognition in Arabic tweets. Several features were tested including n-grams, affect lexicons, sentiment lexicon, and word embeddings from AraVecFootnote 38 [143] and FastTextFootnote 39 [21]. AraVec embeddings outperformed the other features. The authors also tested several learning models including support vector classifier (SVC) with both penalties L1 and L2, ridge classification (RC), RF, and an ensemble of the three. Linear SVC with L1 outperformed the other learning models. Their model achieved 1st rank in the SemEval-2018 competition.

Agrawal and Suri et al. [3] proposed to combine lexical and deep learning features for emotion recognition in textual conversation. The aim was to build a model robust to emoticons, slang, abbreviations, spelling mistakes, and style of writing. They trained LightGBM [80] and logistic regression models. LightGBM performed better than logistic regression. They performed a hold-one-out experiment on the features. The results show that the maximum gain was from character n-grams.

4.6 Evaluation measures

The following section presents different evaluation measures used in related work. These include the multi-label accuracy (the Jaccard accuracy) (Eq. 1), accuracy (Eq. 2), \(F^\mathrm{{micro}}\) (Eq. 5), and \(F^\mathrm{{macro}}\) (Eq. 9).

$$\begin{aligned} {\text {Jaccard accuracy}}= \frac{1}{|S|}\sum _{s\in S}\frac{|G_s\cap P_s|}{|G_s\cup P_s|} \end{aligned}$$
(1)

where \(G_s\) is the set of gold labels for sentence s, \(P_s\) is the set of predicted labels for sentence s, and S is the set of sentences.

$$\begin{aligned} {\text {Accuracy}}=\frac{\sum _{e\in E}{\text {TP}}+\sum _{e\in E}{\text {TN}}}{\sum _{e\in E}{\text {TP}}+\sum _{e\in E}{\text {TN}}+\sum _{e\in E}{\text {FP}}+\sum _{e\in E}{\text {FN}}} \end{aligned}$$
(2)

where E is the set of emotion labels, TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives.

For the micro-averaged results, the TP, FP, and FN for each emotion label are summed; then, the average is taken. \(P^\mathrm{{micro}}\) and \(R^\mathrm{{micro}}\) are calculated as follows:

$$\begin{aligned} P^\mathrm{{micro}}= & {} \frac{\sum _{e\in E}{\text {TP}}}{\sum _{e\in E}{\text {TP}}+\sum _{e\in E}{\text {FP}}} \end{aligned}$$
(3)
$$\begin{aligned} R^\mathrm{{micro}}= & {} \frac{\sum _{e\in E}{\text {TP}}}{\sum _{e\in E}{\text {TP}}+\sum _{e\in E}{\text {FN}}} \end{aligned}$$
(4)

\(F^\mathrm{{micro}}\) is the harmonic mean of \(P^\mathrm{{micro}}\) and \(R^\mathrm{{micro}}\).

$$\begin{aligned} F^\mathrm{{micro}}=2\cdot \frac{P^\mathrm{{micro}}\times R^\mathrm{{micro}}}{(^\mathrm{{micro}}P+R^\mathrm{{micro}})} \end{aligned}$$
(5)
Table 3 Summary of related work in emotion recognition in text
Table 4 Features used for emotion recognition in text
Table 5 Strengths and limitations of the reviewed papers

\(F^\mathrm{{macro}}\) computes the harmonic mean of precision and recalls independently for each emotion label e and then takes the average, hence treating all emotion labels equally.

$$\begin{aligned} {\text {precision}}_e= & {} \frac{\mathrm{TP}_e}{\mathrm{TP}_e+\mathrm{FP}_e} \end{aligned}$$
(6)
$$\begin{aligned} \mathrm{{recall}}_e= & {} \frac{\mathrm{TP}_e}{\mathrm{TP}_e+\mathrm{FN}_e} \end{aligned}$$
(7)
$$\begin{aligned} F_e= & {} 2\cdot \frac{\mathrm{{precision}}_e}{\mathrm{{recall}}_e} \end{aligned}$$
(8)
$$\begin{aligned} F^{\mathrm{{{macro}}}}= & {} \frac{1}{|E|}\sum _{e\in E}F_e \end{aligned}$$
(9)

4.7 Summary

The following section presents a summary of the reviewed state-of-the-art approaches, comparing them within tables. Table 3 reports the language of the text, the approach, the corpus used for testing, and the obtained result of the reviewed state-of-the-art approaches. Table 4 presents the features used by the classical learning-based approaches, the deep learning approaches, and the hybrid approaches. Table 5 presents the strengths and limitations of the reviewed state-of-the-art approaches.

5 Discussion

Different approaches have been examined to address emotion recognition in text. The keyword-based approaches are the main approach for explicit emotion recognition. However, these approaches do not always succeed in recognizing explicit emotions. If a sentence expresses an emotion but does not include any word from the emotion keyword set, then the emotion will not be recognized. Even when a sentence includes an emotion keyword, it is not guaranteed to express the same emotion because the word meaning can change according to the context. The main approaches for implicit emotion recognition are rule-based approaches, learning-based approaches, deep learning approaches, and hybrid approaches. Rule-based approaches are affected by the quality of the text. Thus, if the text is written in an informal style and contains several grammatical mistakes, then this approach may not be able to recognize the implicit emotion correctly. Additionally, an implicit emotion can only be recognized if there is a rule that represents it in the rule set. Classical learning-based approaches need efficient features to be able to recognize implicit emotion. Human-engineered features do not cover all the cases of how emotions are expressed. Therefore, many implicit emotions are mislabeled or missed, and only those that the learning approach is trained to recognize are successfully recognized. Deep learning offers high-quality features and eliminates the need for feature engineering, which is one of the most time-consuming parts of machine learning practice. However, deep learning requires a large quantity of training data. Implementing a hybrid approach can improve the results because it takes advantage of the approaches integrated into it. However, the disadvantages and limitations of these approaches can also be inherited.

This study shows that Chinese is the most dominant language after English in terms of emotion recognition in text. Additionally, there is newly published research from other languages, including Arabic, Hindi, Indonesia, and Spanish. For emotion recognition in English, some researchers used one or more existing emotion-annotated corpora—Alm, Aman, ISEAR, SemEval-2007, SemEval-2018, and SemEval-2019—to measure the performance of their models, while others used their own created corpora, such as Ma et al. [90], Shivhare et al. [137], Perikos and Hatzilygeroudis [116], Bandhakavi et al. [15], Douiji et al. [41], Yan and Turtle [169], and Haggag [63], to measure the performance of their models. For emotion recognition in Chinese, most of the work except for that of Lee et al. [86] and Wang et al. [159] evaluated the performance of their approach using a self-built corpus but used the same resource for text, Sina Weibo, which is a Chinese microblogging Web site. The researchers who created their own corpus had the opportunity to test their model on recognizing more emotions. However, as the number of emotions increases, the difficulties also increase, which results in a performance reduction.

This study shows that the most used learning-based method is SVM. In comparison with other methods, SVM almost obtained the best results. (It achieved the second best result in Ghazi et al. [56].) The most important part of any learning-based approach is the features. The success of the approach depends on whether the correct set of features is selected. Representation learning gained attention due to the success of word embeddings by Mikolov et al. [95, 96]. This study shows that utilizing word embeddings and deep neural networks enhances the performance. It also shows that hybrid approaches managed to reach good results in the annotated dataset. Herzig et al. [65] tested the performance of their model, which combines the traditional representation of text and word embeddings on Alm dataset, ISEAR dataset, and SemEval-2007 dataset. Their system obtained one of the best results for the ISEAR dataset and Alm dataset, but it did not achieve good results for the SemEval-2007 dataset. The ISEAR dataset offers over 1000 instances per class, which the SemEval-2007 dataset does not have. To overcome the dataset size limitation, many participants in the SemEval-2018 competition used transfer learning to pretrain the weights of the deep neural networks.

Although English corpora exist, some of them are not large enough to train deep learning approaches. Moreover, the accuracy of recognizing an emotion is reliant on how well-balanced the dataset is. In the case of the Alm dataset, only 9.86% of the sentences are labeled as expressing a positive emotion, and the result of recognizing this class is the worst. In SemEval-2019, the low performance of recognizing the emotion label happy is a common problem among the participants in this competition. The opposite is true for Yuan and Purver [173], where the highest accuracy for recognizing an emotion was obtained for the happy emotion label, as the size of the annotated sentences for this class is the largest. In SemEval-2018, the highest performance result was 58.8, which is not that high, and looking at the dataset, it is very imbalanced. Almahdawi and Teahan [5] tested the effect of downsizing the classes to the class with the smallest size, and the accuracy significantly improved.

Emotion recognition in text has made some progress in the last few years. Looking at SemEval-2018 and SemEval-2019 participants, it is apparent that deep learning approaches are dominating the emotion recognition filed. More language models were created [27, 39, 67, 123]. Strong deep models were built along with creative deep attention mechanisms. Nevertheless, more research must be done to overcome the following challenges:

  • The difficulties of recognizing implicit emotions Emotions are complex; humans have problems expressing and understanding them. Recognizing emotions in text increases the difficulty of understanding emotion due to the lack of visible facial expressions, body gestures, and voice. Automating emotion recognition is a difficult task. A machine needs to deal with the complexity of linguistics and the context of written text.

  • The quality of the datasets The available datasets are not large enough to cope with the new trends, especially deep learning. Moreover, all of the datasets except for ISEAR are imbalanced. This shifts the focus from the task of emotion recognition to how to deal with problems caused by under represented classes. Thus, high-quality data must be created to improve emotion recognition models.

  • The limited resources in languages other than English Emotion recognition in other languages is not as advanced as emotion recognition in English. Thus, resources, including high-quality data and lexicons, must be created for other natural languages.

In the future, we predict that using pretrained word embeddings in emotion recognition in text would be a standard practice, and new pretrained models will be developed. Further, transfer learning would play more important role, especially with the lack of large datasets to train deep learning models. Lastly, transformer [157] models would dominate deep learning models. The transformer model is gaining popularity and has already been used in Open AI’s GPT-2, which is a successor to the GPT language model. Moreover, Dai et al. [32] proposed a new improvement to the transformer model that enables learning dependency beyond a fixed length.

6 Conclusion

In this paper, we surveyed existing approaches for both explicit and implicit emotion recognition in text. The keyword-based approaches are mostly used for explicit emotion recognition. However, they fail to fully recognize implicit emotion in text due to the lack of linguistic information. The main approaches for implicit emotion recognition include rule-based approaches, classical learning-based approaches, deep learning approaches, and hybrid approaches. Rule-based approaches can only recognize implicit emotion that are already represented in their rule sets. Classical learning-based approaches can recognize implicit emotions, given that the classifier has been already trained on such types of emotions. However, they do not require large training datasets to achieve a reasonable performance. Deep learning-based approaches can outperform the other approaches, given a very large quantity of training data. Hybrid approaches generally inherit the advantages of the approaches integrated into them, in addition to their disadvantages and limitations. Although most of the best results are obtained by learning-based approaches, deep learning approaches, and hybrid approaches, there are other approaches that performed rather well and must be further investigated. These include the compression-based approach and constraint optimization approach.

The results of this work show that POS tagging, parsing, and other simple NLP tasks can highly impact the performance of emotion recognition systems. This study also identified the sets of features used by the best performing approaches. It highlighted the features automatically extracted by deep learning models, which can capture explicit and implicit information. Combining handcrafted features and word embedding for classical machine learning or deep learning approaches represents a promising research avenue.