Introduction

Clinical assessment and treatment in psychiatry currently depend on diagnostic criteria built entirely on expert consensus, instead of relying on objective biomarkers (Bzdok and Meyer-lindenberg 2018). Such criteria, described in the Diagnostic and Statistical Manual, 5th Edition (DSM-5), and in the International Classification of Diseases (ICD-10), are still considered the gold standard for diagnosis in psychiatry (American Psychiatric Association 2013). Nevertheless, those diagnostic systems have been criticized due to their absence of clinical predictability and neurological validity (Bzdok and Meyer-lindenberg 2018) and their poor diagnostic stability (Baca-Garcia et al. 2007). While other medical fields hold markers of disease presence and severity, such as tumor volume measurement and biochemical blood tests, psychiatry still lacks routine objective tests (Bedi et al. 2015; Mundt et al. 2012).

Historically, evaluation and treatment in psychiatry are based on reports from patients and on clinical evaluation (Mundt et al. 2007). Thus, diagnosis and therapeutic decision are extremely sensitive to memory and subjectivity biases (Jiang et al. 2018). Considering this, over the last decades, there has been an intense search for biomarkers for diagnosis and follow-up of psychiatric patients (Iwabuchi et al., 2013; Mundt et al. 2012), most of those being expensive and invasive (Higuchi et al. 2018). Despite all efforts, instruments for assessment of mental disorders still remain a conundrum (Mundt et al. 2007).

Major depressive disorder (depression) is the most common mental disorder, affecting more than 300 million people worldwide (Sadock et al. 2017; World Health Organization 2018). It is also a leading cause of disability and economic burden (Mundt et al. 2007, 2012). The global prevalence of depression was estimated to be 4.4% in 2015 (World Health Organization 2017), with more women affected than men in a 2:1 ratio (Weinberger et al. 2017). In Brazil, the prevalence of depression is currently the fifth largest in the world, 5.8% (World Health Organization 2017), while its lifetime prevalence can be as high as 16.8% (Miguel et al. 2011).

Patients suffering from depression may present with low mood, irritability, anhedonia, fatigue, psychomotor retardation, cognitive impairment (difficulty in decision-making, poor concentration), and disturbances of somatic functions (insomnia or hypersomnia, appetite disorders, changes in body weight). These symptoms are associated with intense suffering and decline in functioning and may ultimately lead to suicide (American Psychiatric Association 2013). Depression is associated with approximately half of all suicides globally (Cummins et al. 2015).

Early depressive symptoms such as psychomotor retardation and cognitive impairment are frequently related to disturbances in speech (Hashim et al. 2016). Actually, patterns within depressed speech have been documented years ago (Mundt et al. 2007). In particular, the persistent altered emotional state in depression may affect vocal acoustic properties. As a result, depressive speech has been described by clinicians as monotonous, uninteresting, and without energy. These differences could provide the detection of depression through analysis of vocal acoustics of depressed patients (Jiang et al. 2018).

Machine learning is an intensive field of research, with successful applications to solve several problems in health sciences, like breast cancer diagnosis (Cordeiro et al. 2012; Cordeiro et al. 2016; de Lima et al. 2016; Rodrigues et al. 2019), Alzheimer’s disease diagnosis support based on neuroanatomical features (dos Santos et al. 2009; dos Santos et al. 2008; dos Santos et al. 2007), multiple sclerosis diagnosis (Commowick et al. 2018), and many applied neuroscience solutions (da Silva Junior et al. 2019; de Freitas et al. 2019).

Qualitative changes in speech from people suffering with depression have been reported decades ago (Darby and Hollien 1977), e.g., reduction in pitch range (Vanello et al. 2012), increased number of pauses (Mundt et al. 2012), slower speech (Faurholt-Jepsen et al. 2016), and reduced intensity or loudness (Hönig et al. 2014).

For the recognition of changes in mood state, prosodic, phonetic, and spectral aspects of voice are relevant, in particular fundamental frequency (F0) or pitch, intensity, rhythm, speed, jitter, shimmer, energy distribution between formants, and cepstral features. Among those features, jitter is considered important for mood state recognition due to its ability to identify rapid temporary changes in voice (Maxhuni et al. 2016). Alteration of Mel frequency cepstral coefficients (MFCC) are also found in depressed individuals. MFCCs consist of parametrical representation of the speech signal (Hasan et al. 2004) and have been extensively studied as possible features for the detection of major depressive disorder (Cummins et al. 2014; Jiang et al. 2018).

In a sample of 57 depressed patients, Cohn et al. (2009) analyzed prosodic and facial expression elements using two machine learning classifiers: support vector machines (SVM) and logistic regression. Their accuracy for the identification of depression was 79–88% for facial expressions and 79% for prosodic features.

Due to its good generalization power, SVM is considered a state-of-the-art classifier (Alghowinem et al. 2013a), showing great performance in the identification of speech pathologies (Arjmandi and Pooyan 2012; Sellam and Jagadeesan 2014; Wang et al. 2011); along with GMM, SVM is the most widely used classification technique using voice parameters (Jiang et al. 2017). The greatest performances from different SVM kernels in this dataset support findings from previous studies in the literature. For instance, SVM RBF kernel was successfully used for the detection of MDD and PTSD using voice quality parameters (Scherer et al. 2013); it has also provided better performance than other classifiers (MLP, HFS) in handling raw data and has shown strong discriminative power using features like intensity and root mean square (Alghowinem et al. 2013b). Another study has reported the superiority of linear models (SVM linear kernel and LR) for the detection of depressive and anxiety disorders in early childhood using high-quality data vocal features (Mcginnis et al. 2019).

In a study with adolescents, Ooi, Lech, and Brian Allen (2013) used glottal, prosodic, and spectral features and Teager energy operator for the prediction of early symptoms of depression in that age group and reported accuracy of 73% (sensibility: 79%; specificity: 67%). In another study with a larger sample of adolescents, Low, Maddage, Lech, Sheeber, and Allen (2011) utilized the above attributes with the addition of cepstral parameters and submitted to SVM and Gaussian mixture model (GMM) classifiers. They described significant differences in classifier performances for detecting depression based on gender: 81–87% for males and 72–79% for females.

With emphasis on vocal features for the identification of depression, Hönig et al. (2014) used automatic feature selection to study 34 features: spectral (MFCCs, formants F1 to F4), prosodic (pitch, energy, duration, rhythm), and vocal quality or phonetic features (jitter, shimmer, raw jitter, raw shimmer, logarithm harmonics-to-noise ratio, spectral harmonicity, and spectral tilt). In agreement with findings from Low et al. (2011), they reported a slightly higher correlation in males (r = 0.39 males vs. 0.36 females). This suggests that clinical depression can possibly lead to more significant changes in vocal features in men than in women. Similarly, Jiang et al. (2017) also noticed gender differences in classifier performances, with superior results in males. In a sample of 170 subjects, they investigated the discriminative power of three classifiers for the detection of depression: SVM, GMM, and k-nearest neighbors (kNN). SVM achieved the best results in that study, with accuracy of 80.30% (sens.: 75%; spec.: 85.29%) for males and 75.96% (sens.: 77.36%; spec.: 74.51%) for females.

Adversely, Higuchi et al. (2018) analyzed pitch (F0), spectral centroid, and five attributes of MFCC using polytomous logistic regression for the classification of depression, bipolar disorder, and healthy controls. They did not find any difference between genders. Moreover, they also reported an overall accuracy of 90.79%, the highest among the studies revised for this work. This discrepancy may be due to the fact that some features, such as voice quality parameters, could be more gender-independent than other vocal features like F0 (Scherer et al., 2013). In addition to that, several previous studies did not perform gender-based classification experiments (Alpert et al. 2001; Cannizzaro et al. 2004; Liu et al. 2015; Mundt et al. 2007, 2012; Ooi et al. 2013). The same approach is adopted in this study.

Another attribute of study design that might influence the performance of an automated instrument is the type of speech task. Spontaneous speech, e.g., social interactions or interviews, tend to yield higher classification performances than reading tasks. This finding suggests that spontaneous speech provides more acoustic variability, improving the recognition of depression (Alghowinem et al. 2013a; Jiang et al. 2017). Moreover, it is likely that depressed individuals can suppress their emotional state during reading tasks, because of the unimportant nature of the read content or their concentration on reading or even both (Mitra and Shriberg 2015).

This work is organized as follows: “Methods” briefly discusses studies related to the identification of depression using voice. “Methods” describes in detail the implementation of a proposed voice-based instrument for the detection of depression. In “Results,” we present and discuss our results. “Discussion” states our conclusion and suggestions for future work on this subject.

Methods

For this exploratory study, 33 volunteers over 18 years old from both genders were selected and separated into one of the following groups:

  • Control group: 11 healthy participants (5 females) were selected through the Self-Reporting Questionnaire (SRQ-20) screening for common mental disorders (Gonçalves et al. 2008; K. O. B. Santos et al., 2010).

  • Depression group: 22 patients with previous diagnosis of major depressive disorder (17 females), in conformity with Hamilton Depression Rating Scale—HAM-D 17 (Hamilton 1960).

All individuals from the depression group fulfilled DSM-5 diagnostic criteria for major depressive disorder and were diagnosed by an independent professional prior to this study. Data for this group was collected in outpatient settings and psychiatric wards in the Hospital das Clínicas, Federal University of Pernambuco, and in the Hospital Ulysses Pernambucano, both in Recife, Northeast Brazil. Participants with coexistent neurological disorders or who made professional use of their voices were excluded. The use of validated psychometric scales aimed to verify previous diagnostic consistency and assess clinical severity. The mean age of the control group was 30.1 years (± 12.6 years), whereas the mean age of the depression group was 42.9 years (± 13.0 years). There is no standardization on age controlling in studies: some selected age-matched controls to their samples (Alghowinem et al. 2013b; Alghowinem et al. 2012; Cummins et al. 2015); and some did not (Afshan et al. 2018; Cannizzaro et al. 2004; Higuchi et al. 2018; Jiang et al. 2017; Joshi et al. 2013; Liu et al. 2015; Ozdas et al., 2004; Scherer et al. 2013). Given this heterogeneity, in this work, we assume the perspective of the majority of revised studies in which age between groups was not controlled. All participants have given written consent, and this study was conducted only after approval of a local Research Ethical Board. Table 1 provides a summary of mean age and scale scores for both groups.

Table 1 Mean age and rating scale scores

For the control group, the SRQ-20 cutoff score was 6/7 (K. O. B. Santos et al. 2010), and for the depression group, the eligibility criterion was HAM-D score above 7. Consequently, patients suffering from mild to severe depression were selected, as this study aimed to encompass different diagnostic scenarios within depression. The average HAM-D 17 score of the depression group 19.32 corresponds to moderate illness (Zimmerman et al. 2013). By including depressed patients irrespective of their disease severity, we believe to have created a database that represents a real-world clinical dataset. To the best of our knowledge, the same approach was made in the works of Afshan et al. (2018), Alghowinem et al. (2012), Alpert et al. (2001), Hönig et al. (2014), Jiang et al. (2018), and Ooi et al. (2013).

Acquisitions of voice samples

2We used a Tascam™ 16-bit linear PCM recorder at 44.1 KHz sampling rate, in WAV format, without compression. Audio acquisitions were made during an interview with a psychiatrist in naturalistic settings, i.e., patients from the depression group were recorded during a routine medical evaluation in an outpatient office or hospital ward. More specifically, 17 patients were recorded in an outpatient office, four patients were recorded in a psychiatry ward, and one in an internal medicine ward. After each interview, a clinician utilized HAM-D 17 scale to verify diagnostic suitability and assess clinical severity. Participants from control group were asked to answer SRQ-20 in order to verify their eligibility. Recordings for this group were made in different environments, as follows: six in offices or classrooms, 3 in gyms, and two in their residences. Despite being made in different facilities, recordings from both groups shared similar environmental conditions, such as closed rooms and little background noise interference. No duration limit was set for the recordings. As conversations were thoroughly recorded, voices from the clinician and possible third parties were also acquired and needed to be further removed. Six from 22 depressed patients had companions during the recording process, one from which did not speak; four made punctual remarks about medications in use; only one companion had a significant interference in the clinical evaluation, whose corresponding excerpt was entirely discarded. Having said that, we believe the presence of possible companions did not hinder the process of data acquisition. The total time recorded was 425.1 min (7.09 h). Figure 1 summarizes the process of data acquisition.

Fig. 1
figure 1

Block diagram of voice acquisition

Audio editing

After voice acquisition, we used Audacity™ audio software to remove voice signals from the interviewer and any potential companion. The edition process was manually made and yielded 271 min of voice signals from participants (4.52 h) as follows: 96.9 min for the control group and 174.1 min for the depression group. Table 2 provides detailed information for both groups.

Table 2 Recording duration after audio editing

Feature extraction

All recordings were submitted to vocal feature extraction on GNU Octave™, a free open-source signal-processing software. We used rectangular windows, frame length of 10 s with 50% overlap. As raw audio data was used, no filtering process was applied. During this stage, we extracted the following 33 features: skewness; kurtosis; zero crossing rates; slope sign changes; variance; standard deviation; mean absolute value; logarithm detector; root mean square; average amplitude change; difference absolute deviation; integrated absolute value; mean logarithm kernel; simple square integral; mean value; third, fourth, and fifth moments; maximum amplitude; power spectrum ratio; peak frequency; mean power; mean frequency; median frequency; total power; variance of central frequency; first, second, and third spectral moments; Hjorth parameter activity, mobility, and complexity; and waveform length. We wanted to add a scope of features as broad as possible, avoiding human selection, which could introduce some a priori knowledge not desirable for this study.

This broad set of characteristics combines representation in the time domain and in the frequency domain. It is an assumption of this work to use a sign representation approach that does not depend on human specialist knowledge, avoiding potential representation biases. Given that (1) this is an approach that does not assume prior knowledge regarding the origin and nature of the signal over time, (2) good results were obtained with electroencephalographic and electromyographic signals (da Silva Júnior et al., 2019; de Freitas et al. 2019), (3) rectangular windows were used with both electroencephalography and electromyography, (4) although audio signals have very different characteristics from other signals over time that would not guarantee acceptable results for the same approach, the results with electroencephalographic and electromyographic signals are a good motivation for investigating this approach in audio signals.

In order to investigate the influence of a reduced set of features and possible model generalization improvements, we added a feature selection stage based on genetic algorithms using a J48 decision tree as objective function. The parameters of the genetic algorithms were set empirically as follows: crossover probability of 0.6, mutation probability of 0.1, population size of 20 individuals, and 500 generations. Individuals are represented as vectors of 33 binary chromosomes; each chromosome is related to the selection of a specified feature. The objective/fitness function conducts the heuristic search according to the overall classification accuracy.

Classification experiments

Classes were balanced by adding artificial instances on Weka™ artificial intelligence environment. Class balancing was performed using Weka’s class balancer method, consisting of the estimation of a uniform distribution parameters from the instances of the minority class to generate random instances until the number of instances of both classes equalize. This step is important so as to avoid computational biases towards the class with more representativeness, in this case the depression class. All experiments were performed using tenfold cross-validation. Figure 2 summarizes the steps of our proposed solution. The hyper-parameters of the classifiers were empirically determined based on the experience of the research group and the results of the related literature.

Fig. 2
figure 2

Block diagram of proposed solution, considering audio edition, vocal feature extraction, and the binary classification between depressed subjects and the ones integrating the healthy control group

Feature vectors were submitted to experiments with the following ML algorithms on Weka™: multilayer perceptron, logistic regression, random forests, decision trees, Bayes net, Naïve Bayes, and support vector machines with different kernels (linear, polynomial kernel, radial basis function or RBF, and PUK), described as following:

  • Multilayer perceptron (MLP): learning rate 0.3, momentum 0.2, 50 neurons in the hidden layer

  • Random forests (RF): 10, 50, and 100 trees

  • Support vector machines (SVM):

    • Parameter C for 0.01, 0.1, and 1.0

    • Polynomial kernel: 1 (linear), 2 and 3 degree

    • Radial basis function (RBF) kernel: Gamma for 0.01, 0.25, and 0.5, i.e., a small value, a medium value, and a value corresponding to a pure Gaussian curve, respectively

    • Pearson Universal VII Kernel (PUK) with preset hyper-parameters

  • Bayes Network

  • Naïve Bayes classifier

  • Logistic regression classifier

Results

Experiments for this exploratory study were initially made under default settings on Weka™. After this, we tested different setups for all algorithms with adjustable settings (MLP; polynomial kernel and normalized polynomial kernel SVM, SVM PUK kernel, and random forest) with tenfold cross-validation and 30 repetitions for each configuration. Tables 3 and 4 describe in details our best results for each ML model, considering overall accuracy, kappa index, sensitivity, and specificity, both sample mean and standard deviation, for datasets without and with automatic feature selection, respectively. Figures 3 and 4 present boxplots of multiple configurations for overall accuracy, kappa index, sensitivity, and specificity, for datasets without and with automatic feature selection using genetic algorithms, in this order. For SVM with polynomial, PUK, and RBF kernels, we only presented the best results for each type of kernel.

Table 3 Classification performance of multiple classifier configurations without automatic feature selection for binary classification (control vs. depression)
Table 4 Classification performance for multiple machine learning techniques with automatic feature selection for binary classification (control vs. depression)
Fig. 3
figure 3

Boxplots with classification performance to distinguish depressive patients from the control group using all extracted attributes. Random forest with 100 trees performed better, with highest values of accuracy, kappa index, and specificity. In addition, it achieved a reasonable value of sensitivity. In the case of SVMs, only the configurations with the best results for each type of kernel were plotted

Fig. 4
figure 4

Classification performance to distinguish depressive patients from the control group using selected attributes by an automated method. The random forest achieved better results, being visually similar when tested with 50 or 100 trees

Discussion

Through analysis of Table 3, when using all extracted attributes, we notice that classification mean accuracy varied significantly for SVM classifier (50.4675–84.5768%), depending on the kernel used. Furthermore, to the best of our knowledge, this is the first study to compare performances of several SVM kernels for the detection of depression. It also highlights the need of further studies for discriminating the impact of different kernels in the discriminative power of SVM classifiers.

However, random forest with 100 trees provided the highest accuracy (87.5575% ± 1.9490) among all classifiers in this study. Similarly, the kappa index and specificity values were the highest for this configuration: 0.7508 ± 0.0319 and 0.8354 ± 0.0254, respectively. In contrast, SVMs with an RBF kernel showed the highest sensitivity values, with a mean of 1 in some cases. Despite this, SVMs performed poorly in other metrics. In addition, we can see that classifiers based on Bayes theory showed inferior results. This may indicate that the attributes considered in this study are statistically dependents.

It is also important to highlight that the automatic selection of attributes resulted in a worsening of the classification performance (Table 4). This suggests that the use of all 34 extracted attributes is important in the binary classification. As in the original scenario, random forests with 100 trees showed better results after feature selection with genetic algorithms, with mean accuracy of 80.5193 ± 1.9490, mean kappa index of 0.6100 ± 0.0391, and mean sensitivity and specificity of 0.8548 ± 0.0243 and 0.7547 ± 0.0323, respectively.

The confusion matrix for the best classifier (random forest with 100 trees for all attributes) is shown in Table 5. As can be seen, 89.72% of participants belonging to the control group were classified as control, while 83.74% of patients with depression were classified correctly. It is important to notice that there is greater confusion in the group of people with depression, with 16.26% being classified as healthy.

Table 5 Confusion matrix for the model with the highest performance classifier (random forest with 100 trees), considering all extracted attributes

As mentioned earlier, except for the work of Higuchi et al. (2018), we achieved better results than other previously published studies. Our work provided high classification accuracy both for depressed and healthy individuals. However, it is important to note that our small sample size may limit statistical interpretations. Factors that may influence vocal acoustic properties, such as smoking history, pharmacotherapy, and demographic variables such as age, gender and educational level, were not controlled and represent a limitation. In a prospective study, we aim to control possible confounders and repeat the same experiments with more participants.

Conclusion

Current psychiatric diagnosis still lacks objective biomarkers and relies mostly on specialist opinion based on diagnostic manuals. Nevertheless, such diagnostic systems have been heavily criticized due to their absence of correlation with the neurobiology and etiopathogenesis of mental disorders. Among these, depression presents with vocal acoustic alterations that may be used as objective parameters for the identification of this disorder.

Therefore, this exploratory study focused on the development of an auxiliary instrument for the diagnosis of depressive disorders. To this end, we extracted vocal acoustic features and performed experiments using different automated classification techniques. Some of the most widely used classifiers were also tested in this work. Our results suggest the viability of a machine learning tool for the detection and even screening of major depressive disorder in a cost-effective and non-invasive manner. In future studies, we intend to perform the same experiments in a larger sample with age-matched groups, as well as with controlled disease severity and gender-based datasets. With this approach, we aim to evaluate the impact of depression severity on vocal acoustic parameters. We also would like to assess if depression unequally affects vocal acoustic properties from men and women and how such differences influence the performance of automated classifiers.