Keywords

1 Introduction

Depression is an increasingly common illness around the world and one that is likely to increase soon. Some of the symptoms of depression may be restlessness, irritability, impulsivity, anxiety, palpitations, sadness, loss of energy, a sense of hopelessness and many more [26]. According to World Health Organization (WHO), 3.8% of the population suffers from this disorder which means that approximately 280 million people around the world live with depression [1]. In Europe, 6.38% of the population suffers from depression, ranging from 2.58% in the Czech Republic and 10.33% in Iceland [16]. It is often difficult to diagnose mental illness, however, the growth and globalization of social network usage can help to reduce the number of cases that go unnoticed. Social networks play a key role and have a direct correlation with depression as suggested by Yoon et al. [31]. Over the past few years, there has been an increase in the number of people interested in studying and using machine learning (ML) algorithms to create medical decision support systems [24]. This is due to the great evolution that has taken place in the industry in terms of computing power and the ever-increasing amount of data available [25].

The sentiment behind social media posts should be examined in order to diagnose depression as quickly and accurately as possible [6]. To achieve this, it needs a system capable of processing, analyzing, and deriving knowledge from a diverse and unstructured data set. One specific domain within the field of ML, particularly deep learning (DL), has the ability to accomplish this very task. Natural language processing (NLP) is capable of understanding human language and taking valuable information from it [32]. This has been a hot topic in research in recent years using data mining and ML techniques. The potential that these techniques could have for clinical use is very high as shown in the Ricard et al. study [23]. Consequently, it is proposed an ML approach to detect the sentiment associated with a tweet made on the Twitter social network. This could be the depressive, neutral, or non-depressive sentiment. Twitter was chosen because of its massive use, being the third most popular social network in the world. It also has a simple data model and an easily accessible API to collect data.

This paper objective is to create a predictive model capable of detecting a possible depressive feeling associated with a tweet. This will allow worldwide improvements in the way potential people at risk of depression are detected. The main contributions are: (i) Describe a full ML pipeline that covers collecting data, processing it, training a classification algorithm, and evaluating its performance, (ii) Comparative analysis of different feature generation techniques and classification algorithms, (iii) Compare the achieved results and proposed model with prior research works in the literature.

This paper has the following structure: Sect. 2 presents the summary of all similar papers in the literature, Sect. 3 is the methodology used, including information about the dataset, pre-processing techniques, label validation, exploratory data analysis, how we generate features with TF-IDF and Word2Vec, experimental setup and how we evaluate the models created. Section 4 shows all the results that we obtain with both ML and DL models. Section 5 presents the discussion where we state the findings of this study and make a comparison with what has been done previously in the literature. Section 6 are the conclusions we can draw from our work including contributions, limitations, and future work.

2 Related Work

In the last few years, several works have been proposed on how to automate the diagnosis of depression in a patient, to reduce the number of cases of depression that are increasing and reducing the number of suicides caused by major depression.

The study proposed by [7] used the public dataset Sentiment 140 which contains data without the presence of signals of depression. Adding to that data, they gathered a dataset with signals of depression, through the collecting data of Twitter with recourse to the Twint tool. The following keywords were used to pick tweets with the signals of depression: hopeless, lonely, antidepressant and depression. In pre-processing, the stop words, punctuation marks and hyperlinks were removed, and authors used Lemmatization to group different forms of the same word. After preprocessing the data, a Feature Generation was performed based on techniques such as Tokenization to separate the words in the text into a form that the machine understands. Valence Aware Dictionary and sEntimentReasoner (VADER) were also used to extract the polarity of the tweets to get the overall emotion of the text and finally Word2Vec was utilized to transform the text into word vectors.

After these data preparation processes, the dataset was divided into 60% for training and the rest was divided into validation and testing. They proceeded to classify the data using two types of approaches: a) a Long Short-Term Memory (LSTM) network; b) a hybrid CNN-LSTM model. In the first approach, they were able to obtain 90.33%, 91%, 91%, and 91% of Accuracy, F1-Score, Precision and Recall, respectively. On the other hand, in the second approach, they were able to get 91.35%, 91%, 92% and 91% on the same metrics, respectively.

Another study proposed by [9] achieved an improvement in these results using a combination of Word2Vec and DL models. It achieved 99.02% Accuracy, 99.04% Precision, 99.01% Recall and 99.02% F1-Score for the LSTM network. And, obtained 99.01% of Accuracy, 99.20% of Precision, 99.01% of Recall and 99.10% of F1-Score for the hybrid CNN+LSTM model.

Many works make use of feature extraction tools such as Bag-Of-Words (BOW), Tokenizer and TF-IDF models. In the study proposed by [29], the authors aimed to predict whether the person was not depressed, was half depressed, moderately depressed or severely depressed, so they used the unsupervised K-Means clustering algorithm to label the tweets. Decision Trees, Random Forest and Naive Bayes algorithms were used in this study for the classification of the tweets. The dataset was split into 80% for training and 20% for testing. In the end, they evaluated the performance of the algorithms through the classification metrics Accuracy, F1-Score, Recall, Precision and R-Score. The combination of TF-IDF models to generate features with the Random Forest algorithm stood out from this approach, having obtained 95% of Accuracy, 95% of Precision, 95% of Recall, 67% of R-Score and 95% of F1-Score.

The authors of [27] used a public dataset with 43000 tweets. For each tweet, a pre-processing was performed which consisted of removing non-alphabetic characters (e.g., HTML tags, punctuation, hashtags, numeric values, special characters, URLs), normalization the tweeters converting the text to lowercase, removing stop words (e.g., prepositions, conjunctions, and articles), and at last applied stemming. Since ML algorithms cannot process the raw text, TF-IDF was used to extract features to then be provided as input to the model. As a result, using Multinomial Naive Bayes they achieved 72.97%, 74.58% and 75.04% accuracy, precision and recall, respectively.

Alsagri et al. [5] used almost the same pre-processing steps but the data was obtained through the Twitter API and was a much smaller amount, about 3000 tweets. It is also a differentiated approach as it tries to classify the user himself as depressive through the various tweets associated with him. Using TF-IDF it obtained 82.50% accuracy, 73.91% precision, 85% recall, 79% f1 score and finally, 77.50% AUC.

Kabir et al. [14] proposed a new topology to diagnose depression disease in Twitter messages called DEPTWEET and introduced a unique dataset labelled, with clinical validation, and for each label, a confidence score was assigned. The Twitter messages were retrieved using the Twint tool and the search keywords were defined based on the PHQ-9 questionnaire for depression. They classified each tweet as one of the four possible values: non-depressed, mildly depressed, moderately depressed, or severely depressed. As classifiers, the authors used Support Vector Machine (SVM), Bidirectional LSTM (BiLSTM), and two pre-trained transformer-based models: BERT and DistilBERT. ROC score was chosen as the evaluation metric and the best result was obtained using the transformer-based models, with the DistilBERT standing out and getting 78.88% in non-depressed, 74.72% in mildly, 78.79% in moderate and 86.60% in severe depression.

When compared to the literature, our work proposes a collection of tweets using the TWINT tool and the assignment of a label (POSITIVE, NEGATIVE, and NEUTRE) to the text through the sentence polarity score achieved by VADER. In addition, we will perform a manual validation of each phrase present in the data to ensure that the classes are assigned correctly, thus reducing the probability of error in the label assignment process done in the [14, 28] work. Word2Vec and TF-IDF were also used for feature generation. Finally, to classify the text, we use two DL architectures (LSTM and hybrid CNN+LSTM) and several ML algorithm.

3 Methodology

The methodology proposed in this article to detect depression consists of 3 steps: i) collecting data from Twitter using the Twint tool, cleaning and categorizing the collected tweets; ii) manual validation of the label assigned to each tweet and data augmentation; iii) generating features for each sentence using TF-IDF and Word2Vec to train ML and DL algorithms. As shown in Fig. 1.

Fig. 1.
figure 1

Experimental setup.

3.1 Dataset

The data collection for this work consisted in acquiring Twitter data with signs of depression, using the Twint tool. The keywords lonely, depressed, frustrated, hopeless and antidepressant were used to obtain the phrases with signs of depression [7].

Using Twint, we configure the search parameters such as the respective search keyword, the tweet limit, and the language in which the tweets are written. After that, the Twint tool will systematically extract the desired data from Twitter, aggregating tweets, user details, and even interaction metrics. At the end of this process, we have a dataset ready to be processed and analyzed.

All tweets were gathered on November 3, 2022. Twint tool starts by retrieving the latest tweets and continues to fetch older tweets until it reaches a stopping condition, such as a specified number of tweets or a certain time limit. In our case, it stopped when it reached the limit of 3500 collected tweets. Each word originated a dataset of tweets with 3500 records that were later converged to a final dataset with 17500 records.

3.2 Data Pre-processing

When collecting tweets for analysis, it is crucial to account for the presence of noise resulting from the limitations of the collection process, which is based on a single keyword. Where there is no control over the content of the tweets obtained. Therefore, pre-processing techniques were applied to the tweets. First, the sentences were normalized to convert them to lowercase. Next, all hyperlinks, hashtags, identifications of other users and emojis were removed. In addition, all stopwords were removed from the tweets collected through the stopwords function of the nltk library. These steps have been suggested in several previous works [11, 15, 22].

After the collected data had been cleaned, the VADER tool was used to categorize the tweets into Positive, Negative and Neutral. VADER is a tool widely used in sentiment analysis tasks due to its simplicity, effectiveness, and ability to handle domain-specific and colloquial language. It was introduced by Hutto et al. in 2014 [13] and has been employed in various applications, including social media monitoring, customer feedback analysis, and opinion mining [3, 8, 10, 12, 20]. To do the labelling of the tweets the compound score value generated by VADER was used as a base. It consisted in assigning the label Positive for compound score values greater than or equal to 0.05, the label Negative for compound score values less than or equal to -0.05 and values between -0.05 and 0.05 would be identified as Neutral. These compound values are those recommended in the article by Hutto et al. [13].

During pre-processing, a check was made for the existence of missing values, and they were removed as recommended by [21]. With the removal of the missing values, 1057 records were lost out of the 17500 records in the final dataset. Besides the missing values, it was verified the existence of duplicate data. As suggested by [30] this resulted in the removal of 955 records.

3.3 Manual Validation of Label

The manual validation process for the sentences in the collected dataset involved excluding sentences with fewer than three words after pre-processing. This was done because we cannot determine the sentiment from a sentence of less than 3 words [18]. This step resulted in 1520 sentences being eliminated.

After this validation, the resulting sentences were checked to ensure that the label assigned by the VADER algorithm was correct or not. In instances where there was ambiguity in interpreting the pre-processed sentence, we turn to the corresponding original sentence to better understand its meaning and decide. When incorrect labels were identified, appropriate corrections were made to ensure accuracy.

Furthermore, as part of this procedure, a check was performed to ensure that the collected sentences were relevant to the theme of the study. In cases where the semantics of the sentences were incorrect or they were in a different language, they were discarded. As a result, a total of 10,920 sentences were subjected to validation. Out of these, 8,519 sentences were removed from the dataset, leaving 2,512 sentences that were used for the study. This high number of discarded sentences is due to several factors: (i) many of the collected tweets did not fit the topic of depression, often consisting of reviews, opinions, quotes or other types of the text unrelated to a depressive feeling; (ii) despite the Twint settings, some tweets came in other languages and were therefore removed; (iii) a few tweets were ambiguous or even contradictory about the possible associated sentiment and we decided to discard them.

3.4 Exploratory Data Analysis

During the exploratory data analysis, the purpose was to examine the distribution of the assigned labels before and after manual validation and dataset augmentation. Figure 2 illustrates the class distribution prior to validation and augmentation, indicating an imbalance among the classes. From the graph analysis, it is evident that a class imbalance exists, as VADER tends to assign a strong Negative sentiment label to sentences containing negative keywords. However, Fig. 3 presents the class distribution after the dataset went through validation and augmentation, revealing that the data is now almost perfectly balanced.

Furthermore, we built three Word Clouds that allow us to visualize which words are present in the collected sentences. The Word Cloud with the words of the sentences that were considered positive and negative are shown in Fig. 4 and Fig. 5, respectively. Finally, Fig. 6 shows the words of the sentences that were considered neutral.

Fig. 2.
figure 2

Distribution of the classes before the manual validation and augmentation.

Fig. 3.
figure 3

Distribution of the classes after the manual validation and augmentation.

3.5 Feature Generation

The algorithms selected to generate features applying the algorithms described in the literature are presented in this section. Therefore, this work used the TF-IDF algorithm which consists of the vectorization of documents to calculate a score for each word based on its importance in the document and corpus [2]. Word2Vec algorithm allows representing all the words of the sentences extracted from Twitter in embeddings based on the similarity of the words [4]. This algorithm was configured with 300 for the embedding size and 10 for the window size.

Fig. 4.
figure 4

Word Cloud with label Positive.

Fig. 5.
figure 5

Word Cloud with label Negative.

3.6 Experimental Setup

For the development of this work, the Pycaret library in version 2.3.10, TensorFlow in version 2.11.0, Scikit-learn in version 1.2.0, NLKT in version 3.7 and Gensim in version 3.6.0 were used. In addition to these libraries, the nlpaug library was also used in version 1.1.11 to do data augmentation. To increase the number of instances and diversity when training the ML algorithms, data augmentation was used. The augmented sentences were generated by changing certain words by synonyms throughout the sentence, always maintaining their meaning and the associated sentiment. In addition, this step allowed the target to be balanced. From 2512 sentences we now have in the final dataset 5032 instances.

Therefore, after having augmented the dataset and the features generated by Word2Vec and TF-IDF, Several experiments have been done mixing ML and DL methods with the TF-IDF and Word2Vec feature selection techniques. ML algorithms were implemented using the Pycaret library properties and the top 10 algorithms with the best performance were selected. The result obtained by the models was achieved based on the parameters configured by default.

Fig. 6.
figure 6

Word Cloud with label Neutral.

On the other hand, the hybrid model and the LSTM network were built based on the Sequential API. The hybrid model consists of 1 Embedding layer, 2 Conv1D layers, 2 BatchNormalization layers, 2 MaxPooling1D layers, 1 LSTM layer, 4 Dense layers and 3 Dropout layers. The LSTM network consists of 1 Embedding layer, 1 LSTM layer, 1 Flatten layer, 5 Dense layers and 4 Dropout layers. Both DL models use SparseCategoricalCrossentropy as the Loss function and SGD as the optimizer, using a learning rate of 1.0e-04 during the experiment with TF-IDF and a learning rate 1.0e-02 for the experiment with Word2Vec. They were trained for 50 epochs with a batch size of 32. During training, an early stopping was applied with a patience of 5.

3.7 Evaluation

Before proceeding to the classification of the tweets, through the ML and DL algorithms, the dataset was divided into 70% for training and 30% for testing in the experiments performed [17].

The performance of the ML algorithms was evaluated through the classification metrics: Accuracy, F1-Score, Recall, Precision and AUC. On the other hand, the performance of the DL algorithms was assessed with the metrics: Accuracy, Loss, F1-Score, Recall and Precision. These metrics are calculated on the data of the testing dataset.

4 Results

Only the top outcomes from the experiments conducted will be provided in this section. Therefore, Table 1 has presented the top 10 best ML algorithms with the features generation being performed by the Word2Vec algorithm. Table 2 presents the top 10 best ML algorithms, where the TF-IDF algorithm was used to generate the features.

Table [34] are presented the best results obtained with the DL algorithms using TF-IDF and Word2Vec algorithms to generate features, respectively. In bold are marked the best result.

Table 1. Top 10 ML algorithms with Word2Vec.
Table 2. Top 10 ML algorithms with TF-IDF.

By analysing the results, is concluded that the Extra Trees Classifier algorithm combined with TF-IDF with 84.83%, 85.01%, 84.83% and 84.87% of Accuracy, Precision, Recall and F1-Score, respectively. It is the most accurate solution to predict whether a person has depression, through the sentences collected from Twitter. However, it is observed that the SVM algorithm describes a near performance with 80.86%, 82.32%, 82.32% and 82.31% in the same metrics.

Based on these results, we can conclude that the DL algorithms do not demonstrate a good performance to predict depression in people through the gathered tweets. It has the combination of the hybrid model with Word2Vec demonstrating the higher result with: 52.05% of Accuracy, 59.25% of Precision, 52.15% of Recall, 51.92% of F1-Score and 1.0313 Loss.

Table 3. Results of the DL algorithms with TF-IDF.
Table 4. Results of the DL algorithms with Word2Vec.

5 Discussion

This comparative study between ML and DL algorithms uses different feature-generation techniques. By examining how people express themselves on social media, we can, with a certain degree of confidence, determine if they are feeling depressed or not. It serves as a benchmark for future research in sentiment analysis, offering a methodology that collects raw tweets and processes those sentences to predict the sentiment that person intends to express. In addition, it demonstrates the use of data augmentation tools associated with NLP problems.

Based on the findings of this study, it is clear that using the Extra Trees Classifier along with the TF-IDF feature generation technique achieves good prediction results. On the other hand, it is verified through the same results that the DL algorithms were inferior to the ML algorithms, in both combinations. Except that when using the Word2Vec algorithm, the hybrid model was able to obtain a better performance than algorithms such as Naive Bayes, Ada Boost, Gradient Boosting, K-Neighbors Classifier and Linear Discriminant Analysis. This can be explained by the low amount of data available and its high complexity [19]. Although this has been verified, we believe that DL techniques have the potential for better performance with a larger and more diverse dataset.

The best result presented in this study proves to be superior to the results obtained by a pre-trained model based on a transformer [14]. Where the authors collected data from Twitter, through the Twint tool and the validation of the sentences was done by a doctor, as well as the keywords selected were based on questionnaires previously performed. On the other hand, in studies that used deep learning algorithms the results obtained were significantly higher than those demonstrated in this work [7, 9]. However, it was not possible to know the structures of the algorithms and their configurations to justify the results obtained.

In addition, most literature works used a larger dataset than the one used in this work. Either they used a public dataset such as sentiment 140, which is already labelled and in some cases medically validated, having a larger number of instances [27]. Or in other studies, data is collected following the same methodology as this study but is also used a validated public dataset to increase the number of tweets and to balance target [7, 9]. Whereas, the authors of this study collected sentences from Twitter that were validated regarding the sentiment expressed in it by themselves.

The use of a two-step validation process, involving VADER initially and subsequent manually, enhances the accuracy of the sentiment associated with the phrase, thereby increasing confidence in the obtained results, despite the potential bias associated with the authors’ interpretation. Also, it is possible to verify that the Extra Trees Classifier algorithm presents an AUC of 95.15%, which means that our ML algorithm has a good ability to distinguish between Positive, Negative and Neutral classes.

Table 5 presents a comparison between the proposed method and previous works in the literature.

Table 5. Comparison of the results with the literature.

When evaluating the insights provided by the literature, it remains unclear whether the algorithms’ impressive performance translates into effective class distinction. We can take our study as an example, although the SVM and Ridge classifier algorithms exhibit strong predictive capabilities for sentiment analysis on tweets, a closer examination of the AUC value reveals their inability to effectively differentiate between the classes within the dataset.

This work shows that it is possible to classify sentences from Twitter according to the associated sentiment, doing so in sentences where there are many grammatical gaps, the vocabulary is not homogeneous and often there are both spelling and grammatical errors. According to the authors, utilizing raw data comprising colloquially written phrases that closely reflect real-life experiences enhances the classifier’s performance and adds greater significance to the results. This approach is seen as more valuable compared to models trained on transformed phrases without spelling or grammatical errors, and homogeneous, failing to capture the authentic social media reality.

6 Conclusion

Depression is a highly common illness in our society, characterized by feelings of sadness, lack of interest, and potential psychological and physical harm. Individuals with depression tend to engage more with social media compared to those without the condition. Detecting the underlying emotions expressed in social media posts could aid in identifying and monitoring individuals who require mental health support, ultimately enhancing their well-being.

Throughout this work, a predictive model capable of predicting whether a given Twitter phrase has a negative, neutral, or positive sentiment was developed. The best DL model was the Hybrid combination (CNN + LSTM) with Word2Vec achieving a low accuracy value (52.05%). Overall, the best model created was the ML classifier Extra Trees combined with TF-IDF achieving 84.83% of accuracy. Therefore, it concludes that Extra Trees Classifier with TF-IDF is the best combination to predict a possible depressive feeling associated with a sentence.

Nevertheless, this work has several limitations. The dataset size is limited which can make it very difficult to create a model with the ability to generalize to external examples. Furthermore, future research will be needed to see if the model can maintain this performance on external data. Also, our validation of sentences will always have a bias associated with what may be the interpretation of reviewers, and which may not correspond to the real feeling behind the sentence. The use of Large Language Models (LLMs) can automate and improve the data labelling process leading to potentially better classifier performance.

On Future work, we will try to figure out how to generalize the model created to external data. The augmentation technique used can lead to a potential bias where despite considerably increasing the volume of data, we still have a low variance, which can affect the model’s performance. To solve this, multiple different augmentation techniques and classifiers can also be analyzed to improve the results.