Keywords

1 Introduction

Depression, characterized by persistent sadness and a loss of interest in activities a person normally enjoys or accompanied by the inability to do usual daily activities for at least two weeks, is very common nowadays affecting more than 300 million people globally [1]. It significantly affects the over-all well-being and functioning at school, family, and workplace often leading to self-harm or even suicide. With COVID-19 pandemic, depression has become even more pronounced as shown in the study by Rossi et al., indicating COVID-related stressful events to be associated with depression and anxiety symptoms in the Italian general population [2]. Depression has also been shown to be highly associated with numerous chronic diseases such as diabetes, heart disease caner, stroke, and chronic obstructive pulmonary disease [3]. Prompt recognition of the disease coupled with early professional intervention can significantly improve mental symptoms, resolve somatic problems such as gastrointestinal problems and sleeping disorders, thereby mitigating the negative implications for over-all well-being [4]. To assess depression, it is crucial to determine important contributing factors to plan the appropriate intervention. It is in this area of early diagnosis where machine learning (ML) can be utilized, thus enhancing the whole diagnostic process leading to institution of the much-needed early intervention efforts and medical therapy.

Our objective is to predict depression using a variety of ML classification algorithms namely: Logistic Regression (LR), Naive Bayes (NB), k-Nearest Neighbor (kNN), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), Adaptive Boosting (AdaBoost), and Extreme Gradient Boosting (XGBoost) evaluated on publicly available dataset. It is also our aim to determine the important features relevant to depression prediction and the logic employed by the classifiers to explain their prediction.

2 Literature Review

In the study by Grzenda et al. involving depressed 60 years and above, authors compared ML classifiers - SVM, RF, and LR on sociodemographic characteristics, baseline clinical self-reports, cognitive tests, and structural magnetic resonance imaging features to predict treatment outcomes in late-life depression [5]. RF obtained an area under receiver operating characteristic curve (AUROC) of 0.83 while SVM and LR recorded AUROC of 0.80 and 0.79, respectively. They also reported anterior and posterior cingulate volumes, depression characteristics, and self-reported health-related quality scores as the most important predictors of treatment response. Lin et al. [6], compared regression-based models (LR, lasso, ridge) and RF in depression forecasting among home-based elderly Chinese. Authors concluded that these models have good diagnostic performance in differentiating depression versus no depression. They reported life satisfaction, self-reported memory, cognitive ability, activities of daily living impairment to be the major determinants. In [7], authors applied XGBoost model to classify current depression versus no lifetime depression with a 0.86 AUROC. They further concluded that XGBoost and network analysis were useful to discover depression-related factors and their relationships and can be applied to epidemiological studies.

Sabab Zulfiker et al., applied six ML classifiers coupled with three feature selection methods and synthetic minority oversampling technique (SMOTE) to assess for presence of depression [8]. Their results showed AdaBoost with SelectKBest feature selection technique to be the best performing model with a 92.56% accuracy rate. Nemesure et al. [9], applied a novel ensemble of ML models (SVM, kNN, LR, RF, XGBoost, and neural network (NN)) to predict depression and Generalized Anxiety Disorder (GAD) with moderate predictive performance (AUROC of 0.73 for GAD and 0.60 for depression). Shapley Additive Explanations (SHAP) was used to generate feature importance.

Sousa et al. [10], determine predictors of depression and reported that sex, living status, mobility, and nutritional status appear to be the important factors to be associated with depression. They concluded that these important predictors would be crucial for prevention and for customization of interventions. In the study by Richter et al., evaluated several ML-based approaches that use behavioral data in the classification of depression and other psychiatric disorders. Authors classified these studies into laboratory-based assessments and data mining which was further divided into (a) social media usage and movement sensors data and (b) demographic and clinical information. Authors summarized the benefits and constraints and suggested future research directions to develop interventions and individually tailored treatments in the future [11].

In the study by Vincent et al. [12], they used a multilayered neural perceptron (MLP) and experimented with the backpropagation technique to assess for depression involving data collected from IT professionals. Authors reported that deep-MLP with backpropagation outperforms other machine learning-based models for effective classification of depression with a 98.8% accuracy. Jan et al., reviewed several ML algorithms for diagnosis of bipolar disorders [13]. Their survey identified 18 classification models, five regression models, two model-based clustering methods, one natural language processing, one clustering algorithms and three deep learning -based models. Magnetic resonance imaging data were mostly used for classifying bipolar patients whereas microarray expression data sets and genomic data were the least commonly used.

3 Methodology

In our research, the first step is the loading of the dataset. This is to be followed by pre-processing steps which include data cleaning, dataset normalization, feature selection techniques to select important predictors, and addressing data imbalance. We then applied various ML algorithms followed by assessment of their performance using accuracy, precision, sensitivity/recall, specificity, F1-scores, and Matthews correlation coefficient. Feature importance and AI explainability assessment were also done. The pipeline for this study is seen in Fig. 1.

Fig. 1.
figure 1

Machine learning pipeline for depression prediction

3.1 Dataset Description

We used a publicly available depression dataset from github [14]. The dataset contains 604 instances involving 455:149 male–female sex ratio, with 30 predictor variables and 1 target variable (depressed or not) based on Burns Depression Checklist. The description of these attributes is shown in Table 1.

Table 1. Description of attributes of depression

3.2 Pre-processing Steps

Pre-processing methods were applied to the dataset in preparation for ML training. There were no missing values but there were 10 duplicate records which were promptly removed. It also shows mild data imbalance with 391 (65.82%) with depression and 203 (34.18%) without depression. We performed data encoding for the attributes and feature scaling with normalization using the StandardScaler function of scikit-learn library. All categorical predictors were dummified resulting to an increase in the number of columns. For feature selection procedure, we applied and compared a wrapper method using recursive feature elimination with cross validation (RFE-CV) and a filter method using Pearson correlation. In our study, we used a threshold correlation with the target variable of > 0.20 and a correlation between predictors of less than 0.80. As the dataset is imbalanced, we applied Synthetic Minority Over-sampling TEchnique (SMOTE). The correlation heatmap is seen in Fig. 2.

3.3 Machine Learning Models

The dummified dataset was divided into 30% testing involving 179 records and 70% training involving 415 records with tenfold cross validation. We utilized python 3.8 and its various machine learning libraries (scikit-learn, keras, tensorflow, pandas, Matplotlib, seaborn, NumPy, and LIME) in our experiment. The models tested were LR, NB, kNN, SVM, DT, RF, AdaBoost, and XGBoost. Hyperparameter tuning was performed on each ML model. To determine the best performing model, Matthews correlation coefficient (MCC) was used.

3.4 Feature Importance and Model Explainability

For the best performing models, we generated the feature importance scores to determine the most important attributes relevant to depression prediction. To understand the local behavior of the model for a single instance of a patient with or without depresssion, we applied Local Interpretable Model-agnostic Explanations (LIME). LIME is used to explain individual prediction of a black-box machine learning model.

Fig. 2.
figure 2

Correlation heatmap of predictor variables for depression

4 Results and Analysis

The performance metrics of the 8 ML models for our dataset are shown in Table 2 where the effects of feature selection method are assessed. LR is the best performing model when there is no feature selection technique used as well as when Pearson correlation is used as a feature selection with accuracy rates of 91% and 89%, respectively. For Pearson correlation, a mild increase in the accuracy rate ranging from 1- 4% is seen for DT, RF, NB, kNN and SVM while a slight decrease of 2–5% is noted for LR, AdaBoost and XGBoost. Nonetheless, LR still remains to be the top model when Pearson correlation method is used as a feature selection. On the other hand, the top model for the RFE-CV feature selection is XGBoost with 85% accuracy. When RFE-CV is applied, generally there is a decrease in accuracy ranging from 3%-7% for most of the models while the rest of the models (DT, NB, kNN) did not show any significant changes. Overall, after considering the effects of feature selection, LR with no feature selection is the best performing model as it obtained the highest MCC score at 0.78, and 91% accuracy. Hence in this dataset, it appears that all attributes seem to be important in depression prediction and no attributes need to be eliminated.

Table 2. Performance metrics for predicting depression – assessment of feature selection

To address the issue of imbalance, SMOTE was applied to the dataset. Assessment of SMOTE for the feature selection methods is seen in Table 3. The application of SMOTE to the dataset when there is no feature selection resulted to a decrease in the accuracy rate with a range of 3%-18% for most models. The only model which posted a slight increase (2%) in the accuracy is NB while there was no change for RF and AdaBoost. Nevertheless, LR obtained the highest accuracy and MCC at 84% and 0.74 respectively. The application of SMOTE to the dataset when Pearson correlation was used as a feature selection generally resulted to a very small decrease in the accuracy (1%-4%) for most models while a small increase of 1% is seen for RF. No change in the accuracy was noted for DT, LR and AdaBoost. For the case when SMOTE was applied for RFE-CV, there were no significant changes in the accuracy rates across all models – a very slight increase of 1%-3% for SVM and AdaBoost, a decrease of 1%-4% for NB and XGBoost, while the rest of the models had no changes. Overall, LR posted the highest accuracy and MCC at 89% and 0.76, respectively for this experiment assessing the effects of SMOTE.

Table 4 highlights the confusion matrix of the best performing models for the six experiments (no feature selection (FS), with Pearson correlation, with RFE-CV, without SMOTE and with SMOTE). Comparative performance of the best models is also shown in Fig. 3. It can be deduced that the performance of the six models in the experiments seem to be similar or comparable to each other across all metrics. This suggests that for this particular dataset, we may or may not do feature selection method nor may or may not apply SMOTE to address imbalance. Nonetheless, the overall best performing model is LR without any feature selection method and with no SMOTE.

The feature importance of the attributes of LR is seen in Fig. 4. The most important features relevant to depression prediction are ANXI (feels anxiety), DEPRI (feels deprived), POSSAT (satisfied or not with current position/achievement), INFER (inferiority complex) and ENVSAT (satisfied or not with environment). These features are also in consonance with clinical assessment of depression.

For the explainable AI part of this research, we used LIME which is a technique that approximates any black box machine learning model with a local, interpretable model to explain each individual prediction. LIME is model agnostic hence can give explanations for any supervised machine learning model. To illustrate how LIME works, we randomly selected two patients the first one without depression while the second one has depression. Let us take the case of our first patient diagnosed as having no depression and was correctly classified by LR to be 0 or “not depressed” as illustrated in Fig. 5. The LIME output in Fig. 5 consists of 3 parts: left, center, and right. The left shows the classification predicted by LR which in this case is 0 or “not depressed” and with a confidence of 90%. The center shows the features that influenced the classification. For this patient, LIME was able to generate the important features for LR to arrive at the classification “no depression” and these are: patient has no anxiety (ANXI_Yes = 0), has no inferiority complex (INFER_Yes = 0), has no suicidal thoughts (SUICIDE_Yes = 0), has not recently lost someone close to him (LOST_YES = 0), was not in conflict with family or friends (CONFLICT_Yes = 0), was not physically, sexually, or emotionally abused (ABUSED_YES = 0), never felt cheated by someone recently (CHEAT_YES = 0), and average sleep of was not 8 h (AVGSLP_8 = 0). Note that there are also feature values that are leaning towards “depression” for this particular case which are: not satisfied with his current position or achievements (POSSAT_YES = 0) and felt deprived of something he deserves (DEPRI_YES = 1). However, the effects of these two features are not enough to oppose the effects of the other features contributing to a “no depression” classification. The rightmost part of the LIME output shows the actual values of the first 10 most important features for this patient. LIME can be an effective tool to explain the logic by the model to arrive at the prediction.

For the second patient who was diagnosed as “depressed”, the LIME output in Fig. 6 shows the classification predicted by LR which in this case was 1 or “depressed” and with a confidence of 100%. The center shows the features that influenced the “depressed” classification which are: patient is not satisfied with current position or achievements (POSSAT_True = 0), has anxiety (ANXI_Yes = 1), felt deprived of something he deserves (DEPRI_YES = 1), has inferiority complex (INFER_Yes = 1), and is undergoing financial stress (FINSTR_True = 1). Note that there are also values contributing to “no depression” for this particular case which are: did not lost someone (LOST_Yes = 0), nor felt abused (ABUSED_Yes = 0), nor cheated (CHEAT_Yes = 0), is not in conflict with family or friends (CONFLICT_Yes = 0) nor threatened (THREAT_Yes = 0). However, the effects of these features are not enough to oppose the effects of the features contributing to a “depression” classification. The top features that influenced the “depression” classification for this patient is in agreement with the top features selected by LR as most influential to a “depression” classification as seen in Fig. 3. The explainability feature of LIME can help health professionals understand and interpret classifier’s prediction leading to increased trust in the use of these methods.

In our study, we applied a filter method using Pearson correlation with the target variable (presence of depression) and among predictor variables. Feature selection aims to remove redundant features which can be expressed by other attributes and irrelevant features which do not contribute to the performance of the model in predicting depression [15]. RFECV reduces model complexity by removing attributes one at a time until it automatically finds an optimal number of features based on the cross-validation score of the model [16, 17]. It is a commonly used due to its ease of use. Using the associated feature weights, those attributes with small feature weights close to zero contribute very little to predicting depression. But we must take note that removing a single attribute would also lead to a change in the feature weights, which suggest that elimination of the features should be done in a stepwise fashion. On the other hand, pairwise correlation identifies highly correlated features and keeps only one of them to achieve predictive power using few features as possible since highly correlated features bring no new information to the dataset. These highly correlated features only increase model complexity, increase the chance of overfitting, and require more computations [18, 19].

SMOTE is an oversampling method that creates artificial minority data points within the cluster of minority class samples in a balanced way which render it to be an effective method in reducing negative effects of imbalance leading to increased performance [8, 20,21,22,23,24]. It works by utilizing a kNN algorithm to create synthetic data by first selecting a random data from the minority (no depression) class and then kNN from the data are set. That is, synthetic data is created between the random data and the randomly selected kNN. As such, there is not only an increase in the number of datapoints but an increase in its variety. However, SMOTE has its disadvantages such as sample overlapping, noise interference and blindness of neighbor selection as well as their suitability for clinical datasets [22, 25, 26].

Feature importance allows us to detect which features in our depression dataset have predictive power by assigning a score to each feature based on its ability to improve predictions and allow us to rank these features. The increase in the model prediction error after permuting the values of that feature determines its feature importance. An increase in the model error also increases the importance of that feature for predicting depression, while if the accuracy of the model remains the same or slightly decreases, then the feature is deemed unimportant for depression prediction [27,28,29]. However, this method has also some disadvantages such as prohibitive computational cost and cannot be used as a substitute for statistical inference [30].

Table 3. Comparative Performance Metrics of ML Models with and without SMOTE
Table 4. Confusion Matrix of the best Performing ML Models in various Experiments
Fig. 3.
figure 3

Performance metrics of best models for depression prediction

Fig. 4.
figure 4

Feature importance of LR model for depression prediction

Fig. 5.
figure 5

A sample of feature explainability for a correctly classified patient without depression by logistic regression

Fig. 6.
figure 6

A sample of model explainability for a correctly classified patient with depression by logistic regression

Our results are comparable with the other studies [5, 6, 9, 13] in the literature with respect to depression prediction. Our top performing models have very good sensitivity and specificity rates allowing the mental health professionals to use these models as a screening tool for depression in their clinical practice. Additionally, we highlighted the importance of utilizing LIME as an XAI tool in depression prediction. In [31], authors have validated the use of their XAI-ASD in improving diagnostic performance in predicting presence of depression and reported that explainability allows humans to appropriately understand and trust the emerging AI phenomenon. It has brought the machines closer to humans because of its capability to explain the logic behind the diagnosis. It needs to be emphasized that insufficient explainability and transparency in most existing AI systems seem to be a major reason for unsuccessful implementation and integration of AI tools in the routine clinical practice. Our findings suggest the utility of XAI models to make a diagnosis of depression with acceptable results. The clinical relevance of our experiment is even more highlighted with XAI models that can provide faster and with high reliability to help physicians in the screening of patients for depression. An early accurate diagnosis leading to prompt intervention efforts is very crucial to improve patient’s quality of life, diminished risk for developing chronic diseases, improve productivity and prevention of suicide cases [5, 8, 24, 32]. This research, thus, provided useful insights in the development of an automated models that can assist healthcare workers in the assessment of depressive disorders.

5 Conclusion

Depression is debilitating disease that leaves individuals persistently feeling sad or hopeless for more than two weeks affecting more than 300 million people globally. We applied several machine learning models with model explainability to a publicly available depression dataset. After a series of experiments to assess the effects of the use of feature selection methods and the technique to address dataset imbalance, the best performing model was logistic regression (LR) with a 91% accuracy, 93% sensitivity, 85% specificity, 93% recall, 93% F1-score and 0.78 Matthews correlation coefficient. Feature importance identified the most important attributes necessary to make a depression classification are also in consonance with clinical assessment of depression. LIME method provided tools to visualize the reasoning behind the classification of depression by the machine learning model for better understanding of physicians. Incorporation of XAI tools in clinical practice can further enhance the diagnostic acumen of health professionals. The primary limitation of our research is the use of small datasets due unavailability of large and open-source depression datasets.

Future enhancement of this study should focus on inclusion of other tools for feature importance as well as techniques in XAI such as SHAP for better understanding of the models by healthcare providers. Moreover, mixed type of datasets combining symptoms with neuroimaging features seen in functional magnetic resonance imaging can also be explored to generate more superior diagnostic accuracy. Our findings are promising and have generated useful insights in the development of automated models that are faster and with high reliability which can be of use to physicians in predicting depression. Nonetheless, early intervention efforts and treatment for depression ensure the best quality of care for our patients.