1 Introduction

The advancement of internet and communication technology [19, 21], particularly online social media, has revitalised how people engage with one another and express themselves in public. Users of online social media platforms such as Facebook, X (formerly Twitter), and Instagram can express their emotions and sentiments about a topic, subject, or any social or general issues in writing or through multimedia content. These social media platforms have given people infinite power to communicate their feelings, emotions, and thoughts [15, 42], which was previously confined to only physical contacts.

Depression is a common illness that affects an estimated 3.8 percent of the global population, including 5.0 percent of adults and 5.7 percent of people over the age of 60Footnote 1. Depression affects around 280 million individuals globally and can be harmful to one’s health, particularly if it occurs regularly and with moderate or severe intensity. It can cause a person to suffer greatly and perform poorly at work, school, and at home. At its worst, depression can lead to suicide, which is the fourth leading cause of mortality among people aged 15 to 29. Despite the fact that there are recognised, effective therapies for mental disorders, more than 75% of people in low- and middle-income countries do not receive therapy and do not discuss their situation with others. A lack of funds, a shortage of competent health-care staff, and the societal stigma associated with mental illnesses, all serve as impediments to proper diagnosis and treatment. During a depressive episode, the person feels dissatisfied, irritable, or empty for the most of the day, almost every day for at least two weeks. Impaired focus, feelings of excessive guilt or poor self-worth, hopelessness about the future, thoughts of dying or suicide, disrupted sleep, changes in diet or weight, and feeling particularly tired or low in energy are all possible symptoms. Depression can be caused by job discontent, everyday life troubles, sickness issues, a terrible relationship, exam failure, comparing oneself to others, financial situations, or an overdose of medications [28]. Social withdrawal and isolation are primary consequences of depression [27, 38] and other mental diseases [4, 52], which may lead to suicide.

Online social sites provide a simple platform for people of all ages, even those who are depressed. Examples of depressive posts are shown in Figs. 1 and 2.

Fig. 1
figure 1

An example of a tweet about depression

The detrimental effects of depression, as well as the lack of resources in many countries throughout the world, inspired us to create a system for identifying a person suffering from depression by analysing his social media activities, particularly tweets. One of the most significant challenges to developing a machine or deep learning based system for depression detection is an adequate dataset. We began by gathering depression-related datasets from Twitter as people commonly use these social media platforms to express their feelings and emotions [50]. So identifying different people’s mental levels by looking at their social media posts can be a viable alternative. Most people utilise unpleasant or negative posts to describe their level of depression. According to these findings, peer-to-peer communities on social media can challenge stigma, raise the possibility of professional aid, and directly offer online assistance to persons suffering from mental illness. In the United States, a similar study [5] found that internet users with stigmatisation such as depression or urinary incontinence are more likely to use online resources for health-related information and communication than people with other chronic diseases, emphasising the importance of research to assist depressed individuals on social media platforms and the Internet.

Fig. 2
figure 2

Instance of depressive Tweets

The openness of social media allows people to openly and freely engage and comment on any issue on these media; on the other side, it allows specialists working in the health area to obtain insight into what is going on in someone’s mental state. Machine-learning algorithms [14] may be able to provide such insight by evaluating the unique patterns buried in online communication and processing them to reveal the mental state (such as despair, sadness, happiness, rage, anxiety, etc.) among social media users.

Despandey et al. [12], Lin et al. [32], and Orabi et al. [41] were among the researchers who attempted to diagnose depressive persons based on their tweets. On their proprietary dataset, they used both traditional machine-learning and deep-learning models.

In light of the above discussion, this paper focuses on the following Research Questions (RQs):

  1. 1.

    RQ1: How does E-CLS improve depressive and non-depressive tweet classification compared to the individual models?

  2. 2.

    RQ2: What is the impact of word embedding and TF-IDF in E-CLS for sentence categorization?

The notable contributions of this research, including answers to the above raised research questions, are summarized as follows:

  1. 1.

    To mitigate the scarcity of publicly available depressive datasets, this paper proposes a new Depressive Dataset (DepDs), which has been labeled manually by a psychiatric expert, a professor, and a post graduate student (M. Tech.) DepDs consist of 5,000 depressive sentences and 7029 non-depressive sentences.

  2. 2.

    ARQ1: We propose an ensemble model, named E-CLS, that effectively utilities the GloVe and BERT embeddings and TF-IDF features of the DepDs dataset for depression detection.

  3. 3.

    ARQ2: The integration of word embedding and TF-IDF characteristics within the E-CLS network significantly amplifies its effectiveness in categorizing depressive and non-depressive sentences. Word embedding captures semantic relationships, while TF-IDF quantifies term importance. Their combined utilization enriches the model’s understanding of text, leading to more accurate categorization outcomes.

  4. 4.

    We investigated the performance of seven conventional machine-learning classifiers (Support Vector Machine (SVM), Logistic Regression (LR), K-Nearest Neighbor (KNN), Naive Bayes (NB), Gradient Boosting (GB), Decision Tree (DT), and Random Forest (RF)) and five deep-learning classifiers (Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) Network, Bi-LSTM (Bi-Directional LSTM), Attention-based (LSTM-A), and Bidirectional Encoder Representations from Transformers (BERT)). All the classification models are evaluated on DepDs dataset. Among these, E-CLS model achieves an impressive \(F_{1}\)-score of 0.91, surpassing existing machine-learning and deep-learning models by 2%.

To the best of our knowledge, this is the first study on the detection of depressive utterances using an ensemble of a traditional machine-learning model SVM and two deep-learning models, namely, CNN and LSTM-A, leading to an ensemble model E-CLS .

The remainder of the paper is organized as follows: Section 2 addresses the related literature of depression detection briefly. The dataset section is described in sction 3. The suggested model and its operations are detailed in Section 4. The findings of the proposed methodology are listed in Section 5. The detailed discussion is dictated in Section 6. Finally the study is concluded in Section 7.

2 Related works

Several studies on detecting depressive posts on social media have lately been published. One group of researchers [3, 6, 9, 11, 22, 25, 30, 43] used traditional machine-learning strategies to detect depressive sentences. They extracted text features such as n-grams, word tokens, TF-IDF vectorizer, sentiment polarisation, and so on. Another group of researchers, [10, 29, 41, 44, 47, 53,54,55,56], used deep-learning strategies to detect depressive statements in text. These researchers used several embedding approaches such as GloVe, word to vector (Word2Vec), and fastText. Tables 1 and 2 contain a summary of some of the prospective efforts for depression identification from social media posts.

Table 1 Prominent works in literature based on conventional machine-learning approaches
Table 2 Prominent works in literature based on deep-learning approaches

This section is divided into two subsections for better organisation of the associated literature: (i) depression detection using machine-learning approaches and (ii) depression detection using deep-learning approaches.

2.1 Depression detection using conventional machine-learning approaches

This section highlights research in which researchers used machine-learning techniques to identify depressed postings. Table 1 summarizes prominent works in existing literature based on conventional machine learning approaches. Khan et al. [25] used traditional machine-learning classifiers such as Multinomial NB, SVM, Random Forest RF, Decision Tree DT, KNN, and Extreme GB with countvector of the text data as a feature to achieve an accuracy of 0.87. Arun et al. [3] used an XGB classifier to reach an accuracy of 0.98 on clinical data comprising cognitive performance, mental health, and cardiometabolic illness. Hosseinifard et al. [22] employed KNN, Linear Discriminant Analysis (LDA), and LR on non-linear aspects of an electroencephalogram (EEG) signal to reach an accuracy of 0.90. Cohn et al. [9] contrasted clinical diagnosis of major depression with automatically measured facial motions and vocal prosody in patients undergoing depression treatment. To evaluate facial and vocal expression, they used manual Facial Action Coding System (FACS) coding, active appearance modelling (AAM), and pitch extraction. They had an accuracy of 0.88, 0.79, and 0.79 for diagnosing depression in a patient using FACS, AAM, and vocal prosody-based characteristics, respectively. Choudhury et al. [11] used crowdsourcing methods to produce a large corpus of depressive vs non-depressive datasets. They performed Principle Component Analysis (PCA) on the text characteristics to minimise their dimension before classifying them with SVM with an accuracy of 0.75. Jain et al. [24] suggested a depression analysis and suicidal ideation detection method based on the Parent Health Questionnaire (PHQ), which includes questions like “What’s your age?” or “Are you a regular in school/college?”. They tested multiple machine-learning models on the available answer data and reached a maximum accuracy of 0.86 with Logistic Regression. Asad et al. [1] gathered data on depression from Facebook and X (formerly Twitter). They achieved an accuracy of 0.74 by combining a text feature based on TF-IDF with SVM and NB classifiers.

Sadeque et al. [45] provided latency-based metrics for early identification of depressive posts and demonstrated that the new metrics outperformed the current statistic Early Risk Detection Error (ERDE). Peng et al. [43] presented a multi-kernel SVM-based model with user microblog content, user profile, and user behaviours as features to recognise depressed persons and achieved an F1-score of 0.76. Leiva et al. [30] employed numerous cutting-edge machine-learning models, as well as TF-IDF vectorizer and sentiment polarization, to reach an F1-score of 0.59. Islam et al. [23] worked on Facebook depressed data posted by diverse individuals. Various aspects such as emotional variables, temporal categories, and language dimension were extracted from these posts and employed with a KNN to distinguish depressive and non-depressive people with an F1-score of 0.61. Nadeem et al. [37] collected depressive and non-depressive data using a crowdsourced method. On this data, many typical machine-learning classifiers, such as DT, SVM, LR, NB, and Ridge regression were used, yielding an F1-score of 0.84. Hassan et al. [17] presented a model to detect a person’s depression state by monitoring and extracting emotions from text. The author achieved an accuracy of 0.91 by combining SVM, NB, and maximum entropy classifiers with N-gram, Part of Speech (POS), Negation, and Sentiment Analyzer data. Burdisso et al. [6] presented a new framework called Sequential S3 (Smoothness, Significance, and Sanction). When evaluated on the CLEF’s eRisk2017 datasetFootnote 2, this SS3 framework used word token as a feature and produced an F1-score of 0.61, which was much higher than other state-of-the-art machine-learning and deep-learning models. Almeida et al. [2] introduced an ensemble classification strategy that includes supervised learning, information retrieval, and feature selection methods and achieved an F1-score of 0.53.

2.2 Depression detection using deep-learning approaches

Table 2 summarizes prominent works in existing literature based on deep learning learning approaches. Zogan et al. [56] suggested a Multi-Modal Depression Detection Hierarchical Attention Network (MDHAN) with social information, interaction, emoji sentiment, topic distribution, and symptoms as features and attained an \(F_1\) score of 0.89. Oh et al. [40] used the depression score among panic disorder, generalised anxiety disorder, and depressive disorder diagnostic modules with dense-layer, feedforward, and neural network to achieve an Area Under Curve (AUC) of 0.91 and AUC2 of 0.89 respectively. Mumtaj et al. [36] used EEG signals to extract relevant features and used the one-layered Deep CNN and LSTM models to reach an accuracy of 0.98. Lam et al. [29] achieved a \(F_1\) score of 0.87 by using 1DCNN and transformer deep-learning models using transcript text and Mel-frequency spectrogram as features. Yang et al. [55] employed deep-learning models Deep neural network (DNN) and Deep CNN (DCNN), as well as paragraph vector, Histogram of Displacement Range (HDR), Geneva Minimalistic Acoustic Parameter Set (GeMAPS), and Low Level Descriptors (LLD), to achieve Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) of 5.974 and 5.163, respectively. Orabi et al. [41] achieved an F1 score of 0.87 by combining Bi-LSTM and CNN deep-learning models with Word2Vec embedding techniques. Wu et al. [54] proposed a model called Deep Learning-based Depression Detection with Heterogeneous Data Sources (D3-HDS) to predict an individual’s depression label by assessing his/her living environment, behaviour, and social media posting contents. They achieved an accuracy of 0.74 using the Word2Vec embedding technique.

Shetty et al. [49] presented a model to predict depression in people based on their online behaviour. To reach an accuracy of 0.95, an ensemble of LTSM, CNN, and SVM, as well as TF-IDF, count vectorizer, and n-gram, were used. Wang et al. [51] investigated deep-learning methods with pre-trained language representation models for depression risk prediction, including BERT, a robustly optimised BERT pretraining approach (RoBERTa), and generalised autoregressive pretraining for language understanding (XLNET), and compared them to previous methods on a manually annotated benchmark dataset. Their results produced an \(F_1\) score of 0.86. The depression risk was rated at four levels ranging from 0 to 3, with 0 denoting no inclination and 1, 2, and 3 denoting mild, moderate, and severe depression risk, respectively. Wongkoblap et al. [53] built a predictive MIL model with and without anaphora resolution to recognise users with depression from their generated textual material on Twitter, as well as predictive models that may highlight posts pertinent to mental health. The authors utilised the GloVe embedding approach and received an \(F_1\) score of 0.92. Cong et al. [10] utilised a hybrid deep-learning model that combined XGBoost with attention-based Bi-LSTM. The XGBoost component was utilised to deal with the imbalance dataset, while attention-based Bi-LSTM was employed to improve the suggested model’s performance. The authors claimed to have an \(F_1\) score of 0.61. Rosa et al. [44] proposed a Knowledge-Based Recommendation System (KBRS), a new system for monitoring and detecting potential users with emotional problems based on language classification with depressive or stressed content. For character-level representation a CNN is used and Bi-LSTM-RNN is used for disorder entity recognition. The researcher reported an accuracy of 0.89. Shah et al. [47] proposed a hybrid deep-learning model comprised of Bi-LSTM with various word embedding approaches and metadata variables to diagnose sad people using social media posts. The authors claimed to have an F1 score of 0.71.

To the best of our knowledge, this work is the first study on ensemble of one conventional machine-learning (SVM) model and two deep-learning models (CNN, LSTM) to detect depressive posts.

3 Dataset

3.1 Depressive dataset creation

To assemble our Depressive Dataset (DepDs), we meticulously compiled a comprehensive collection of English tweets. The selection of specific keywords and hashtags, such as “depressed", “sad", “anxiety", and “lonely", aimed to capture a diverse spectrum of tweets reflecting various aspects of mental health expression on social media. This careful curation ensured the richness and representativeness of the dataset, aligning with our primary goal of developing a robust model for the classification of depressive and non-depressive tweets.

3.1.1 Collection procedure

To address the lack of publicly available X (formerly Twitter) datasets focused on identifying depressive posts, we initiated the creation of a new dataset leveraging the X (formerly Twitter) API. The collected dataset comprises 12,029 English tweets. For detailed insights of the tweet collection process is presented in Algorithm 1.

Algorithm 1
figure a

Collect Tweets from Twitter API

3.1.2 Annotation

The collected tweets exhibited characteristics such as hashtags and Uniform Resource Locators (URLs) pointing to various websites and multimedia content. These features served as valuable cues for the labelling process, aiding in the classification of tweets as either depressive or non-depressive presented in Algorithm 2.

Algorithm 2
figure b

Annotation Process

3.1.3 Label assignment

The labeling process was carried out by a psychiatric doctor, a professor and one Post Graduate (M. Tech.) student. Each person independently assessed and labeled the tweets based on their content. Subsequently, a voting-based methodology was employed to determine the final label. In cases where two or more students reached a consensus on a specific label, that label was assigned to the respective tweet. The details are shown in Algorithm 3.

3.1.4 Inter-rater agreement

To assess the consistency of labels assigned by different annotators, we utilized the Kohen Kappa (K) measure. The resulting k-value for inter-rater agreement was calculated to be 0.85, indicating a substantial level of agreement among the persons involved in the labeling process as shown in Algorithm 4.

Algorithm 3
figure c

Label Assignment Process (Majority Voting)

Algorithm 4
figure d

Inter-Rater Agreement Process

3.1.5 Final dataset composition

Out of the 12,029 collected tweets, 7,029 were labelled as depressive, while the remaining 5,000 were designated as non-depressive. This balanced representation of both classes within the dataset contributes to its overall robustness and effectiveness for subsequent analyses.

The detailed of dataset creation were given in Algorithms (1-4)

3.2 Analysis of DepDs using BERT

Given that the dataset pertains to depressive content, where a level of 1 signifies a depressive sentence and 0 denotes a non-depressive one, the analysis was performed utilizing a pre-trained ‘bert-base-uncased’ model. The confidence distribution plot (see Figs. 3 and 4) offers valuable insights into the model’s certainty in distinguishing between depressive and non-depressive sentences. This visualization provides a clear overview of the model’s confidence scores across the dataset. For instance, when the ‘bert-base-uncased’ model and ‘sentiment-analysis’ analyzes a sentence like “I feel overwhelmed and sad", it assigns a confidence score of 0.85, indicating a high level of certainty that this sentence is depressive. Conversely, for a sentence like “The weather is lovely today", the model assigns a confidence score of 0.12, suggesting a low likelihood of depressive content.

Fig. 3
figure 3

Distributions of Level

Fig. 4
figure 4

Distribution of Confidence

The 2-dimensional distribution plot (see Fig. 5) provides an illuminating perspective on the relationship between confidence scores and assigned depressive levels. Imagine two axes where the x-axis represents confidence scores ranging from 0 to 1, and the y-axis represents the assigned levels (0 for non-depressive, 1 for depressive). In this plot, each data point corresponds to a sentence, with its position indicating both its confidence score and assigned level. For instance, a data point at coordinates (0.85, 1) signifies a highly confident prediction of a depressive sentence, while a point at (0.12, 0) indicates a low-confidence assignment of a non-depressive sentence. This visual representation encapsulates how the model associates confidence with actual assigned levels. The 2-dimensional distribution plot thereby offers a powerful visualization of the model’s proficiency in distinguishing between depressive and non-depressive content based on confidence scores.

Fig. 5
figure 5

2-d Distributions of Level

Fig. 6
figure 6

Values of Level

Fig. 7
figure 7

Values of Confidence

Fig. 8
figure 8

Faceted distributions of Level

Fig. 9
figure 9

Faceted distributions of Confidence

Suppose we have an input text: “I feel overwhelmed and sad, and it’s hard to find joy in anything lately". This text expresses emotions commonly associated with depression. The BERT classifier analyzes this text and assigns a sentiment label and a corresponding confidence score. In this case, the classifier predict a label of ‘Depressive’ and assign a confidence score of 0.87. Here, the score of 0.87 indicates a high level of confidence in the classifier’s prediction that the input text is depressive. This means that the model is quite certain in its assessment that the text exhibits characteristics associated with depression. On the other hand, if we had a different input text like “The weather is lovely today, and I’m feeling great!", the BERT classifier predicts a label of ‘Non-Depressive’ with a lower confidence score of 0.12 for depressive and 0.88 for non-depressive. It indicates that the input text is less likely to contain depressive content, and the model’s confidence in this prediction is relatively high for non-depressive label. So, the score (see Figs. 6 and 7) provides a numerical representation of the model’s confidence in its depression detection , with higher scores indicating greater confidence for the depressive posts.

The ‘Faceted distribution’ plot (see Figs. 8 and 9) presents a detailed view of confidence score distributions for both depressive and non-depressive sentences. It partitions the dataset into separate segments based on assigned levels. One facet represents sentences labeled as depressive (assigned level 1), while the other represents non-depressive sentences (assigned level 0). Each facet displays a histogram of confidence scores, showing the frequency of sentences at different confidence levels. This visual breakdown allows for a targeted examination of the model’s performance within specific level categories. For example, in the facet for depressive sentences, a peak at high confidence scores indicates accurate classifications, while a spread across various confidence levels may signal areas for improvement. This facet-based analysis refines our understanding of how the model performs within distinct depressive and non-depressive contexts.

To the best of our knowledge, none of the X (formerly Twitter) datasets used in existing works on identifying depressive posts are publicly available. The detailed datasets used in this work can be seen from Table 3. Figure 10 shows the word cloud distribution of the depressive tweets in the dataset, while Fig. 11 shows the same for nondepressive tweets.

4 Methodology

In this section, the detailed working of the proposed ensemble model (E-CLS) is presented. We consider the prediction of all the three models (LSTM-A, CNN, SVM) and label information associated with them based on the majority voting of the three models. Along with the proposed model, several machine-learning and deep-learning models were also implemented to compare the performance of the proposed hybrid network (E-CLS). The detailed flow diagram for the proposed model can be seen in Fig. 12.

Table 3 Statistics of Datasets Used in this Work
Fig. 10
figure 10

Distribution of words in Depressive Dataset

Fig. 11
figure 11

Distribution of words in non-Depressive Dataset

Fig. 12
figure 12

The proposed hybrid E-CLS network for depression detection

4.1 Problem statement

The primary objective of this research is to develop a robust and accurate model for the classification of depressive and non-depressive tweets. In pursuit of this goal, we introduce an ensemble model E-CLS, amalgamating three distinct models: LSTM-A, CNN, and SVM. Each model contributes to the prediction process based on features extracted from the input text data. To facilitate the training and evaluation of our model, we meticulously curated a novel dataset, DepDs. This dataset consists of 12,029 English tweets acquired from the X (formerly Twitter) platform through the X (formerly Twitter) API. Out of these, 70% (8,420 tweets) were used as our training set, 20% (2,406 tweets) constituted our testing set, and remaining 10% (1,203 tweets) was used as the validation set.

The classification process within the E-CLS model involves the prediction of all three models, resulting in label information associated with each prediction. These labels are further consolidated using a majority voting approach, enhancing the overall prediction accuracy. Additionally, several established machine-learning and deep-learning models are implemented for comparative evaluation.

4.2 Proposed approach

Let \(M\) be the set of models, including LSTM-A, CNN, and SVM, denoted as \(m_1, m_2, m_3\) respectively. For each model \(m_i\), \(i = 1, 2, 3\), a prediction \(P_i\) is generated based on the input features. Let \(L_i\) represent the label associated with the prediction of model \(m_i\). The majority voting strategy is defined as:

$$ L_{\text {majority}} = {\text {*}}{argmax}_{l \in \{0, 1\}} \sum _{i=1}^{3} \mathbb {I}(L_i = l) $$

where \(\mathbb {I}(x)\) is the indicator function, yielding 1 if \(x\) is true and 0 otherwise. Consequently, the final prediction \(L_{\text {E-CLS}}\) of the E-CLS model is determined as:

$$ L_{\text {E-CLS}} = L_{\text {majority}}(L_1, L_2, L_3) $$

where \(L_{\text {E-CLS}} \in \{0, 1\}\) denotes the predicted label by the E-CLS model.

The (E-CLS) model combines the strengths of LSTM-A, CNN, and SVM models through majority voting, resulting in a powerful hybrid network capable of accurate tweet classification.

Fig. 13
figure 13

Abstract diagram of the proposed work

Fig. 14
figure 14

CNN architecture used in E-CLS

Fig. 15
figure 15

LSTM architecture used in E-CLS

The abstract diagram (see Fig. 13) of the proposed E-CLS framework for classifying depressive tweets focusing on the classification of tweets with depressive content. Data collection involves using the Twitter API to gather a balanced dataset of both depressive and non-depressive tweets. Following this, data preprocessing and text cleaning steps ensure that the collected information is free from noise and irrelevant details. Feature extraction techniques like TF-IDF and GloVe word embedding are then applied to represent the textual data in a numerical format. The ensemble model, composed of CNN (see Fig. 14), LSTM-A (see Fig. 15), and SVM classifiers, is trained on the preprocessed data. This diagram encapsulates the entire process, emphasizing the framework’s robustness in addressing mental health-related content on social media.

4.3 Machine learning models

We used the following classifiers: (i) SVM, (ii) LR, (iii) NB, (iv) KNN, (v) DT, (vi) RF, and (vii) GB. To train these machine-learning classifiers, we extracted features from sentences using 1-gram, 2-gram, and 3-gram TF-IDF. Section 5 presents the experimental findings of these classifiers.

4.4 Deep learning models

We implemented six deep learning models: (i) Long Short-Term Memory (LSTM), (ii) LSTM-Attention (LSTM-A), (iii) Bidirectional-LSTM (Bi-LSTM), (iv) Convolutional Neural Network (CNN), (v) Ensemble of CNN, LSTM, and SVM (E-CLS) network (with BERT-embedding) and (vi) Ensemble of CNN, LSTM, and SVM (E-CLS) network (proposed model). In the case of the proposed E-CLS network and other deep-learning models, pre-trained GloVe word embedding was used to give input to the system. Word-embedding techniques generate word vectors for words that have similar contextual meanings. The reason for using GloVe is that it generally performs well as it is pre-trained over a large corpus and reduces the computation overheads. The complete architecture of the proposed E-CLS network can be seen in Fig. 12. The embedded word vectors are passed through CNN and LSTM block, whereas TF-IDF vectors were passed to the SVM block. All these three blocks are placed in parallel, as shown in Fig. 12. The output feature vector coming from the above three blocks were passed through majority voting block. This block will mark any sentence as depressive if at least any two block will detect this sentence as depressive otherwise the majority voting block will mark it as non-depressive. The hyperparameter details of the deep learning-based approaches are given in Table 4, while the same for the SVM classifier is given in Table 5.

Motivation for choosing CNN, LSTM, and SVM as baselines By incorporating these diverse baselines into our ensemble model, we aim to leverage the complementary strengths of each individual model to improve overall performance. This approach allows us to benefit from the local feature extraction capabilities of CNN, the sequential modeling strengths of LSTM, and the robustness of SVM in separating classes in a high-dimensional feature space.

Table 4 Hyper-parameter settings for the deep neural network-based approaches
Table 5 Basic Hyperparameters for SVM Classifier

4.5 Word embedding vectors

Representing each text into its corresponding vectors [20] is a necessary step to provide input to the deep-learning models. In this work, we have used word embedding to extract the robust features from the text. We separated each sentence into words, and then we created a 200-dimensional embedding vector for each word. Word-embedding vectors are initialized randomly from the uniform distribution. These weights are then trained using back-propagation.

4.6 Convolutional neural network

The convolution process of the CNN [26] model is used to extract the semantic features of the sentence by using n-gram information. In the CNN model, the convolution process involves a filter \( F\epsilon R^{h*k} \), with the size of h words with k dimensions (same as the embedding vector). This filter is applied to the text matrix \( T_{i} \) and element-wise multiplication is performed, then the summation of all values are passed through a non-linear function to produce a new feature. Next time, this filter is again applied to the text matrix by moving one column towards the right and convolved with the next h words of the matrix \( T_{i} \). Subsequently, it is passed through the non-linear function to produce a new feature, and so on. A feature \( c_{i} \) for a window of words \( W_{i:i+h-1} \) can be generated as:

$$\begin{aligned} c_{i} = \phi (F. W_{i:i+h-1} + b) \end{aligned}$$
(1)

where, \( b\epsilon R \) is the bias and \( \phi \) is a non-linear function. This filter is carried out to each feasible word window having m words, \( W_{1:h} \), \( W_{2:h+1} \), .......,\( W_{m-h+1:m} \), and each is embedded in n-dimension to produce a feature map.

$$\begin{aligned} c = [c_{1}, c_{2}, \ldots .., c_{m-h+1}] \end{aligned}$$
(2)

where \( c\epsilon R^{m-h+1} \).

A simple convolution operation with a filter having size \(h=3\) is represented as:

$$ \begin{bmatrix} &{} W_1 &{} W_2 &{} W_3 &{} \dots &{} W_m \\ &{} \alpha _{11} &{} \alpha _{21} &{} \alpha _{31} &{} \dots &{} \alpha _{m1}\\ &{} \alpha _{12} &{} \alpha _{22} &{} \alpha _{32} &{} \dots &{}\alpha _{m2}\\ &{} \alpha _{13} &{} \alpha _{23} &{} \alpha _{33} &{} \dots &{}\alpha _{m3}\\ &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots \\ &{} \alpha _{1K} &{} \alpha _{2K} &{} \alpha _{3K} &{} \dots &{} \alpha _{mK}\\ \end{bmatrix} \bullet \begin{bmatrix} &{} f_{11} &{} f_{21} &{} f_{31} \\ &{} f_{12} &{} f_{22} &{} f_{32} \\ &{} f_{13} &{} f_{23} &{} f_{33} \\ &{} \vdots &{} \vdots &{} \vdots \\ &{} f_{1K} &{} f_{2K} &{} f_{3K} \\ \end{bmatrix} $$
$$\begin{bmatrix} &{} c_1 \\ &{} c_2 \\ &{} c_3 \\ &{} \vdots \\ &{} c_{m-h+1} \\ \end{bmatrix} = \begin{bmatrix} &{} f\{\alpha _{11}f_{11} +\alpha _{21}f_{21}+\alpha _{31}f_{31}+\alpha _{12}f_{12}+ \alpha _{22}f_{22}+\alpha _{32}f_{32}+ \dots + \\ {} &{} \alpha _{1K}f_{1K} + \alpha _{2K}f_{2K}+\alpha _{3K}f_{3K}\} \\ &{} f\{\alpha _{21}f_{11} +\alpha _{31}f_{21}+\alpha _{41}f_{31}+\alpha _{22}f_{12}+ \alpha _{32}f_{22}+\alpha _{42}f_{32}+ \dots + \\ {} &{} \alpha _{2K}f_{1K} + \alpha _{3K}f_{2K}+\alpha _{4K}f_{3K}\} \\ &{} f\{\alpha _{31}f_{11} +\alpha _{41}f_{21}+\alpha _{51}f_{31}+\alpha _{32}f_{12}+ \alpha _{42}f_{22}+\alpha _{52}f_{32}+ \dots + \\ {} &{} \alpha _{3K}f_{1K} + \alpha _{4K}f_{2K}+\alpha _{5K}f_{3K}\} \\ &{} \vdots \\ &{} f\{\alpha _{(m-2)1}f_{11} +\alpha _{(m-1)1}f_{21}+\alpha _{(m)1}f_{31}+\alpha _{(m-2)2}f_{12}+ \alpha _{(m-1)2}f_{22}+ \\ {} &{} \alpha _{(m)2}f_{32}+ \dots + \alpha _{(m-2)K}f_{1K} + \alpha _{(m-1)K}f_{2K}+\alpha _{(m)K}f_{3K} \} \\ \end{bmatrix} $$

where, \([\alpha _{m1}~ \alpha _{m2}~ \alpha _{m3} \ldots .. \alpha _{mK}]\) represents the embedding of word \(W_{m}\).

We used a Rectified Linear Unit (ReLu) as an activation function. The ReLu activation function is defined as:

$$\begin{aligned} \tau (x) = max (0, x) \end{aligned}$$
(3)

This means for \(\tau (x) < 0 \) it returns 0 and for \(\tau (x) \ge 0 \) it returns x itself.

We used this function because it improves the training of CNN by speeding up the training process, as the computation step in ReLu is easier.

The purpose of applying the pooling operation is to get the most important feature in each of the windows, i.e., the one with the highest value. Similarly, we got a number-of-features matrix, one for each of the filters. After the convolution layer, we concatenated each of the matrices and flattened it to a single feature vector as seen in Fig. 12. Obtained features from the flattened layer are passed through a fully connected dense layer with a softmax activation function, which predicts the output as a probability of occurrence of a depressive and non-depressive sentence. To overcome the situation of over-fitting at the dense layers, we used dropout as the regularization technique. Dropout prevents the interdependency between the hidden neurons by simply dropping it out randomly with the probability of \( D_{p} \). This allows the neural network to learn more robust features and speed up the training.

Algorithm 5
figure e

E-CLS Algorithm

4.7 Long short-term memory

The long short-term memory (LSTM) [46] is the extended form of recurrent neural network (RNN) consisting of special units in the recurrent hidden layer known as memory blocks. All the memory cells are interconnected to each other. Each memory cell consists of an input gate, output gate, forget gate, and cell state. LSTM has the capability to keep long-term dependencies using these gates and cell state. The information is added or deleted from the cell state based on their importance and is encoded by different gates. The detailed working of an LSTM cell can be seen in [18, 46].

4.8 Support vector machine

In this section we give some theoretical definitions of SVMs. Assume that we are given the training data

\((x_{i}, y_{i}),...,(x_{l}, y_{l}), x_{i}\in R^n, y_{i}\in (+1, -1).\)

The decision function \(\phi \) in SVM framework is defined as:

$$\begin{aligned} \phi (x) = sgn(\theta (x)) \end{aligned}$$
(4)
$$\begin{aligned} \theta (x) = \sum _{i=1}^{l} y_{i} \alpha _{i}K(x_{i}, x) + b \end{aligned}$$
(5)

where K is a kernel function, \(b\in R\) is a threshold, and \(\alpha _{i}\) are weights. Besides the weights \(\alpha _{i}\) satisfies the following constraints:

\(\forall i : 0 \le \alpha _{i} \le C\) and \(\sum _{i=1}^{l}\alpha _{i} y_{i} = 0\)

where C is a misclassification cost. The vectors \(x_{i}\) with non-zero i are called Support Vectors. For linear SVMs, the kernel function K is defined as:

\(K(x_{i}, x) - x_{i}.x\)

In this case, (5) can be written as:

$$\begin{aligned} \theta (x) = w.x + b \end{aligned}$$
(6)

where \(w = \sum _{i=1}^{l} y_{i}\alpha _{i} x_{i}.\) To train an SVM is to find i and b by solving the following optimization problem:

maximize  \(\sum _{i=1}^{l} \alpha _{i} - \frac{1}{2} \sum _{i=1}^{l} \alpha _{i} \alpha _{j} y_{i} y_{j}K(x_{i}, x_{j})\)

subject to  \(\forall i : 0 \le \alpha _{i} \le C\) and \(\sum _{i=1}^{l}\alpha _{i} y_{i} = 0.\)

The solution gives an optimal hyperplane, which is a decision boundary between the two classes.

4.9 Proposed E-CLS network

In many cases single machine-learning and/or deep-learning models are not sufficient to correctly predict depressive or non-depressive post. For the proposed dataset of depression, an ensemble of SVM, CNN, and LSTM outperform various machine-learning and deep-learning models.

The E-CLS algorithm is devised for the classification of textual statements into either Depressive or Non-Depressive categories. Initially, each statement undergoes preprocessing to eliminate extraneous information, followed by tokenization into individual words. Subsequently, an embedding matrix is generated to represent the semantic information of each word in the statement. Simultaneously, TF-IDF vectors are computed to create a high-dimensional representation of the statement. These embeddings and TF-IDF vectors serve as inputs to an ensemble of LSTM, CNN, and SVM. These models are adept at capturing contextual, pattern-based, and high-dimensional information, respectively. The predictions from these models are amalgamated, and if the cumulative prediction score surpasses or equals 2, the statement is classified as depressive; otherwise, it is labeled as non-depressive. This amalgamation of LSTM, CNN, and SVM models, coupled with TF-IDF vectors, enables a comprehensive understanding of the sentiment and content of social media posts, enhancing the classification process’s accuracy and robustness.

5 Results

Extensive experiments were performed with conventional machine-learning and deep-learning models. The performance of the system is measured using precision, recall, \(F_1\)-score, confusion matrix, and AUC-ROC curves. The details of each of the evaluation metrics are listed in Section 5.1.

5.1 Evaluation metrics

To evaluate the performance of the proposed model, we used precision, recall, \(F_1\)-score, confusion matrix, and AUC-ROC curve. The precision of depressive/non-depressive sentences can be calculated by obtaining the ratio of predicted depressive/non-depressive sentences as depressive/non-depressive sentences to total predicted depressive/non-depressive sentences.

$$\begin{aligned} \text {Precision (depressive)} = \frac{\text {Number of accurately predicted depressive post}}{\text {Total number of predicted depressive post}} \end{aligned}$$
(7)

Recall of depressive/non-depressive sentences can be calculated as the ratio of predicted depressive/non-depressive sentences as depressive/non-depressive sentences to total actual depressive/non-depressive sentences.

$$\begin{aligned} \text {Recall (depressive)} = \frac{\text {Number of accurately predicted depressive sentences}}{\text {Total number of actual depressive statements}} \end{aligned}$$
(8)

\(F_1\)-score is the harmonic mean of precision and recall. The mathematical equations for the precision, recall, and \(F_1\)-score can be seen from (7), (8), and (9), respectively.

$$\begin{aligned} F_{1}\text {-score (depressive)} = 2 \times \frac{\text {Precision} \times \text {Recall}}{\text {Precision} + \text {Recall}} \end{aligned}$$
(9)

The confusion matrix is used to show the distribution of the predicted testing samples in both depressive and non-depressive classes. Confusion matrices using SVM, CNN, LSTM, and the proposed E-CLS network can be seen from Figs. 16, 17, 18, and 19, respectively.

The Area Under The Curve-Receiver Operating Characteristics (AUC-ROC) curve plotted with True Positive Rate (TPR) against the False Positive Rate (FPR). The detailed description of the AUC-ROC curve can be seen in (Fig. 20, 21, 22) for different deep-learning models (LSTM, CNN, and E-CLS)

Fig. 16
figure 16

Confusion Matrix with SVM classifier on DepDs dataset

Fig. 17
figure 17

Confusion matrix with CNN on DepDs dataset

Fig. 18
figure 18

Confusion matrix with LSTM on DepDs dataset

Fig. 19
figure 19

Confusion matrix with proposed E-CLS model on DepDs dataset

Fig. 20
figure 20

Receiver Operating Characteristic Curve with LSTM model

Fig. 21
figure 21

Receiver Operating Characteristic Curve with CNN model

Fig. 22
figure 22

Receiver Operating Characteristic Curve with E-CLS model

Table 6 Results of conventional machine-learning model
Table 7 Results of deep-learning models and the proposed E-CLS model

The result of different conventional machine-learning classifiers and deep-learning models are listed in Tables 6 and 7. With a conventional machine-learning classifier, SVM reported the best performance with an F1-score of 0.89 whereas the KNN classifier yielded the worst performance with an F1-score of 0.61. While with deep-learning models, the proposed E-CLS with GloVe-embedding model achieved an F1-score of 0.91, which is greater than the other implemented machine-learning and deep neural network-based models. The confusion matrix for the proposed model can be seen in Fig. 19. The confusion matrix shows that our model detects 89% of depressive statements as depressive and 92% of non-depressive statements as non-depressive.

5.2 E-CLS with BERT-embedding

The E-CLS model, leveraging BERT-embedding, demonstrated promising performance on the Proposed Dataset (DepDs). Specifically, for the depressive class, the model achieved a precision of 0.87, recall of 0.85, and an F1-score of 0.86. In the non-depressive class, the precision, recall, and F1-score were 0.90, 0.92, and 0.91, respectively. The weighted average precision, recall, and F1-score for the E-CLS with BERT-embedding were 0.89, 0.89, and 0.89 (see Table 7).

5.3 E-CLS with GloVe-embedding

The E-CLS model, utilizing GloVe-embedding, exhibited strong performance on the proposed dataset (DepDs). For the depressive class, the precision, recall, and F1-score were 0.91, 0.88, and 0.90, respectively. In the non-depressive class, the model achieved a precision of 0.90, recall of 0.94, and F1-score of 0.92. The weighted average precision, recall, and F1-score for the E-CLS with GloVe-embedding were 0.90, 0.91, and 0.91 as shown bold in Table 7.

The major finding of this research is that the proposed E-CLS model performed best for identifying depression on the collected proposed dataset. The E-CLS achieved an F1-score of 0.91 (see Table 7) on the proposed datasets. From these results, it is evident that the proposed E-CLS model is performing significantly well.

The comparison of our work to others is not direct because some of them used different datasets. However, our model performed better than other similar models. In fact, our proposed E-CLS model outperformed as it achieved an F1-score of 0.91.

5.4 Qualitative analysis

The E-CLS model efficiently processes sentences through a combination of its components: CNN, LSTM, and SVM. For example, consider the sentence ‘I am feeling very low’ which is analyzed by the E-CLS model. The CNN, specialized in detecting local patterns, identifies the word ‘low’ as a potential indicator of low mood or depression. The LSTM, designed for understanding sequential context, recognizes the phrase’s structure and interprets it as an expression of the current emotional state. The SVM, synthesizing the CNN and LSTM outputs, integrates these insights to confidently predict that the sentence conveys feelings of low mood or depression. This analysis showcases how the ensemble model effectively harnesses the strengths of its components, demonstrating its proficiency in detecting signs of depression in social media text.

The E-CLS model exhibits varying performance metrics across demographic subsets, which we define as Group A (users aged 18-30 employing informal language) and Group B (users aged 40-60 with a tendency towards formal linguistic styles). Within Group A, the model demonstrates higher precision, likely attributed to the prevalence of specific keywords or expressions associated with depressive sentiments in this age bracket. Conversely, in Group B, characterized by a more formal language style among older users, the model tends to exhibit higher recall, potentially due to the presence of more overt indicators of depressive sentiments. This subgroup analysis illuminates the influence of demographic characteristics and linguistic styles on the model’s performance. For instance, expressions like ‘feeling down’ may be prevalent in Group A, while phrases indicating diminished pleasure in activities may be more common in Group B. These observations underscore how linguistic nuances and demographic factors play a significant role in influencing the model’s effectiveness.

5.5 Error analysis

In order to check the robustness of the model, we manually checked a few tweets, their associated label and the predicted labels. Some of the manually checked tweets are presented in Table 8. The error analysis of our E-CLS model involved examining individual predictions on a set of real-world sentences. For instance, the sentence ‘I can’t find joy in anything anymore’ was correctly classified as depressive due to the presence of negative language and expressions of hopelessness. Conversely, the sentence ‘Life is beautiful and full of opportunities’ was appropriately classified as non-depressive, given its positive and optimistic tone, which highlights the beauty of life and the potential opportunities it offers.

Table 8 Error Analysis of E-CLS Model

6 Discussion

The major findings of this research is that an ensemble model of LSTM, CNN and SVM with tweets can be used to estimate the sign of depression in a person with a very good accuracy. The other major contributions of this research is the curated dataset for the said task. The dataset collected is relatively larger in size than existing datasets and hence a better option to be used for evaluating a deep learning model.

Due to the public unavailability of the datasets used by other prominent works on depression, it was not possible to compare our proposed model with state-of-the-art conventional machine-learning and deep-learning models on the dataset(s) used in the respective literature. Hence, we implemented and tested the prominent state-of-the-art models on our proposed dataest. We have implemented Khan et al. [25] SVM model on our depressive dataset (DepDs) and obtained an F1-score of .89. Similarly, we have also implemented LSTM-A derived from Orabi et al. [41] and CNN derived from Shetty et al. [49] to get an F1-score of .89 and .87, respectively; the results have been shown in Table 9. It can be clearly seen that our proposed E-CLS model achieves an improvement of at least 2% over state-of-the-art techniques.

Table 9 Comparative analysis of proposed work with state-of-the-art models

One of the major limitations of the current work is the usage of textual content only. Nowadays emojis and hyperlinks are a major part of the most tweet and they play a significant role in expressing the emotions of a user. A tweet with negative sentiments having a happy emoji may indicate a depressive post. However, in the current work, we have ignored these emojis. The audio and video contents also form a part of some tweets with hyperlinks. We have not explored the associated hyperlinks and the related content.

7 Conclusion

To encapsulate our findings and contributions, we have delved into the intricate task of identifying depressive posts within social media textual content. Through the integration of Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), and Support Vector Machine (SVM) networks, our ensemble model, E-CLS, showcases remarkable efficacy. Leveraging TF-IDF and GloVe word embeddings, the model excels in discerning between depressive and non-depressive posts, achieving an impressive F1-score of 0.91. This performance outshines several established machine-learning and deep-learning counterparts.

Looking ahead, this study lays a solid foundation for further exploration. We envision extending our approach to accommodate multilingual scenarios, addressing the diverse linguistic landscape found in non-native English-speaking regions. Additionally, incorporating diverse media modalities such as images, videos, and audio clips holds promise for even more comprehensive mental health monitoring on social media platforms. Furthermore, the integration of emojis and hyperlinks, ubiquitous elements in social media communication, presents a compelling area for future investigation.

This study makes a substantial stride in the domain of depression detection from social media posts. The ensemble model showcases promising performance, and its adaptability to different languages and inclusion of diverse media types hold potential for significant impact in mental health monitoring through social media platforms.