Keywords

1 Introduction

The increase in the prevalence of clinical depression in human beings has been linked to a range of serious outcomes. It is a common mental disorder lasting for a long period and leads to a high impact on patients, their families and society. Depression is associated with half of all suicides and a significant economic burden [1]. The World Health Organization (WHO) estimated that about 350 million people of all ages suffer from this disease [2]. Moreover, depression is estimated to become the second greatest disease burden in the world by the year 2020.

However, current depression diagnosis methods almost rely on patient self-report and professional interview of symptom severity [3]. The patient self-report, like Self-rating Depression Scale (SDS) [4], risks a range of subjective biases. Similarly, professional interview varies depending on their clinical experience and the diagnostic methods used (e.g., Diagnostic and Statistical Manual of Mental Disorders (DSM-IV) [5]). So, an objective and convenient method for depression evaluation is necessary.

Developments in affective sensing technology (e.g., facial expression, body gesture, speech, motion, eye movement, etc.) will potentially enable an objective depression evaluation method. Among these technologies, speech signals can be collected easily by non-invasive and portable instrument. Voice of depressed individuals reflect the perception of qualities such as monotony, slur, low intensity and less fluctuation [6]. Vocal characteristics have been verified to change with a speaker’s mental condition and emotional state [79]. Such changes are complicated processes involving coordination of several brain areas and peripheral muscle controls [10]. And, researches support the feasibility and validity of vocal acoustic measures of depression severity [11, 12]. Therefore, we focus on depressed patients’ speech analysis.

At early age, many researchers aimed at the correlation between depression and some particular speech features [13, 14]. A lot of experiments have been conducted to reveal relevance between depression and various acoustic features, like pitch, jitter, speaking rate, formants, Mel-Frequency Cepstral Coefficient (MFCC) and so on. Low et al. [15] and Mundt et al. [3] illustrated relation between depressive severity and some acoustic features. Lately, automatic detection approaches of depression have been investigated. Alghowinem et al. investigated and compared different features on depression classification. And, She figured out that spontaneous speech gives better results than reading [16]. Many researchers believe that feature combination optimization may lead to progress of recognition accuracy. Moore et al. proposed new feature sets with good performance on depression classification [7].

In this paper, we speculate speech signal correlate with severity of depression in a way. In order to validate our hypothesis, we take two steps: First, choose a feature set through comparing the classification accuracy in different tasks. Second, explore the correlation between the feature set and severity of depression.

The rest of this paper is organized as follows: Sect. 2 is a presentation of the details of our method and experiment, consisting of seven parts: the participants and their basic information, the procedure of experiment, data collection, data preprocessing and feature extraction, feature selection, classifiers, and correlation analysis. In Sect. 3, we showed the results of our experiment. Following this, we presented a discussion in Sect. 4 and in Sect. 5 conclusions were draw.

2 Method

2.1 Participants

111 participants’ (54 males, 57 females) data from an ongoing study in Beijing and Lanzhou, China, were used for the experimental validation. These participants, with the age range of 18-55, were selected by psychiatrists following Diagnostic and Statistical Manual of Mental Disorders (DSM-IV). All participants were asked to sign informed consent, fill in basic information and a series of scales. These basic information of subjects are summarized in Table 1.

Table 1. Basic information of subjects

All the participants were interviewed by a psychiatrist to finish the Patient Health Questionnaire-9 (PHQ-9) [17]. They were divided into three groups according to the PHQ-9 scores: 38 healthy control subjects (PHQ-9 < 5), 36 mild depressive patients (5 ≤ PHQ-9 < 17) and 37 severe depressive patients (PHQ-9 ≥ 17). This three groups division can describe the change trend of speech features and keep relative larger subjects of each groups. The results are showed in Table 2.

Table 2. Basic information of groups

2.2 The Procedure of Experiment

Our experiment comprises three parts: interview, reading and picture description. Each part can be divided into three groups in terms of its induced emotion: positive, neutral and negative. To counteract the sequence effect of evoked emotion, the emotion order of each participant is assigned randomly. Details of the experiment follows below.

Interview.

The interview part consisted of 18 questions. These questions are divided into three groups according to emotion valence: 6 positive, 6 neutral and 6 negative. These topics came from DSM-IV and some depression scales which are often used in depressive disorder diagnosis. For examples: What is your favorite TV program? What is the best gift you have ever received [18]? Please describe one of your friends. How do you evaluate yourself? What makes you desperate?

Reading.

This part consisted of a short story named “The North Wind and the Sun”, which was often used in acoustic analysis in international, multilingual clinical research, and three groups words with positive (e.g., outstanding, happy), neutral (e.g., center, since) and negative (e.g., depression, wail) emotion valence. Positive and negative words were selected from affective ontology corpus created by Lin [19], and neutral ones were selected from Chinese affective words extremum table [20]. All of them are commonly used words in Chinese and have close stroke number.

Picture Description.

This part comprises four pictures. Three of them, which express positive (happy), neutral and negative (sad) faces, were selected from Chinese Facial Affective Picture System (CFAPS) and the last one with a “crying woman” came from Thematic Apperception Test (TAT) [18]. Participants were asked to describe these four pictures freely.

2.3 Data Collection

We collected recording data in a clean, quiet and soundproof laboratory. The whole experiment lasted about 25 min for one participant. During the course of recording, the subject was asked not to touch any equipment and keep the distance between mouth and microphone about 20 cm. A NEUMANN TLM102 microphone and a RME FIREFACE UCX audio card with 44.1 kHz sampling rate and 24-bit sampling depth were used for collecting voice signals. All recording data were saved as uncompressed WAV format. During the whole experimental process, ambient noise was required under 60 dB to prevent interference with subject’s audio signals.

In the experiment, 29 recordings for every single participant were stored and named as 1 to 29 in a determined sequence. The details were as follows: The positive, neutral and negative interview recordings are named as 1–6, 7–12 and 13–18 separately. The record of the short story is name as 19. The readings of six word groups are named as 20–21, 22–23 and 24–25 in accordance with the sequence of positive, neutral and negative emotion. 26–28 were the picture description with the same order to reading part. The record of TAT was numbered as 29.

2.4 Data Preprocessing and Feature Extraction

All recordings are segmented and labeled manually. Only subjects’ voice signal are reserved for analysis. Preprocessing mainly includes of filtering (a band-pass filter with 60–4500 Hz), framing, windowing and sometimes endpoint detection for some particular feature extraction. Each frame is 25 ms length with 50 % overlap. Voice characteristics can be divided into two categories: acoustic and linguistic features [21, 22]. The latter will not be analyzed since we are aiming at general characteristics for depressed speech regardless of the language used. Several software tools are employed for extracting sound features. We used the open-source software ‘openSMILE’ [23], VOICEBOX [24] and Praat [25] to extract 1753-dimension features. These features will be used in the following feature selection, classification and correlation analysis.

There are two steps to get the final acoustic feature subset of the speech signal: First, the signals of story (19) and TAT (29) are excluded in this paper. So, only 27 recordings for every subject were analysed. Second, compute the average value of every feature in the same part and induced emotion for one participant. For example, the speech 1–6 are for interview in positive emotion and we stored the mean values of all features as the Data 1 (in Table 3). The details are presented in Table 3.

Table 3. Names of nine data sets

2.5 Feature Selection

Feature selection refers to selecting effective features for classification in universal feature set. It is a critical problem in preprocess of data mining to cope with the curse of dimensionality [26]. In our experiment, we utilize a two-stage feature selection method by combining a filter and a wrapper method to reduce the feature dimension. Filter approach only utilizes data to decide which features should be kept. In general, filter approach has an efficient searching strategy with a result tradeoff. With “wrapping” accuracy of classifier, wrapper method may lead to a better performance compared to filter. Combining both we may have a high efficient method.

Here are the details about our two-stage feature selection method. We combine the minimal-redundancy-maximal-relevance (mRMR) criterion [27] as the filter approach and the Sequential Forward Floating Selection (SFFS) algorithm [28] as the search strategy of the wrapper approach. On the first stage, a candidate subset is selected from the universal feature set by mRMR. On the second stage, final subset is obtained from the candidate subset by SFFS. The final feature subset is used for the following discussion. In this process, the Support Vector Machine (SVM) [29] and Leave-One-Out Cross-Validation (LOOCV) scheme are be employed for evaluating and testing. This feature selection scheme is carried out on the nine data sets separately, which means nine feature subsets will be gained. We named these nine feature subsets as fs_1, fs_2, … fs_9, etc.

2.6 Classifier

We intend to evaluate feature subset in a specific situation to measure the severity of depression by pattern classification approach. Three widely used classifiers were employed in this paper: SVM, Naïve Bayes (NB) [29] and Random Forest (RF) [30]. The Radial Basis Function (RBF) kernel function was chosen in LIBSVM package [31]. Compared with the dimensionality of feature set, the sample size is often so small that we use the LOOCV scheme in testing. LOOCV is a special cross-validation. More specifically, one sample is for testing and the others are for training within a process. This repeated for all the samples and the result is the average accuracy of all repeats.

2.7 Correlation Analysis

Our main target is to explore the correlation between vocal features and depression severity. The PHQ-9 is a brief depression assessment instrument with severity categories. It is the depression module of the Primary Care Evaluation of Mental Disorder [32, 33] that was designed to be used in primary care [34] and provides scores on each of the nine DSM-IV criteria using a severity scale from “0” (not all) to “3” (nearly every day). In our research, one or more feature subsets selected from the nine sets based on classification accuracy are used to explore the relation between voice and depression severity. Principal Component Analysis (PCA) will be applied on the normalized data of these feature subsets. We observe the Pearson’s Correlation Coefficient (r) and the corresponding significance level (ρ) between the first principal component (FPC) and the PHQ-9 score with significance being tested with a T-test.

3 Result

Table 4 shows the average classification accuracy of three groups with three classifiers on nine feature subsets respectively. Although the accuracy of interview on positive and negative is inferior to reading or picture description for male, interview is with the best performance on average accuracy. And, it has a minimal standard deviation. For both male and female interview is the best choice of three for speech signal collection.

Table 4. Classification accuracy using data on nine feature subsets separately

Table 5 presents the Pearson correlation coefficients between FPCs from fs_1, fs_2 and fs_3 on interview and PHQ-9 scores separately with significance levels. The values of r and ρ show that these FPCs are related to depression severity at a moderate level and statistically significant for both male and female.

Table 5. Pearson‘s correlation coefficient (r) and corresponding significant level (ρ) between FPC from the data of fs_1, fs_2 and fs_3 in interview and PHQ-9 score separately

Figures 1, 2 and 3 show the scatter diagram of FPCs from Data_1, Data_2 and Data_3 and PHQ-9 scores in order to observe the linear correlation directly with different emotion respectively. We can find that the negative questions perform better than positive and neutral on both male and female. All the correlation coefficients have opposite signs between genders.

Fig. 1.
figure 1

Scatter diagram of FPC and PHQ-9 score on the data of positive interview

Fig. 2.
figure 2

Scatter diagram of FPC and PHQ-9 score on the data of neutral interview

Fig. 3.
figure 3

Scatter diagram of FPC and PHQ-9 score on the data of negative interview

4 Discussion

Our research aims at exploring the correlation between acoustic features and depression to evaluate depression severity. From results above, we get three points: First, the average classification accuracies (male: 0.57, female: 0.52) are probably limited for a real system, nonetheless, they are much higher than chance level. Second, for both male and female, interview is the best pattern among these three ways to pick up the speech signals to evaluate severity of depression. Third, the correlation between the feature subsets from interview and PHQ-9 manifest that depressive severity is related to speech at a moderate level and the correlation is statistically significant.

Recording patterns may influence the classification performance. In our experiment, interview performs better than reading and picture description, which is consistent with the conclusion of Alghowinem et al. [23]. She pointed out that spontaneous speech gives a better results than reading. Both interview and picture description can be considered as spontaneous speech. However, picture description is worse than interview, we speculate that most of interview questions refer to the subject himself so that they are easy to get into emotional state.

In our further study, we intend to seek a more stable feature subset for depression assessment on a larger size of participants. And, we will combine speech features with other physiological feature (e.g. facial expression, gait, head movement etc.) to improve the classification accuracy.

5 Conclusion

Our work aims at an objective diagnostic aid supporting clinicians in evaluating severity of depression. The results confirmed our hypothesis by examining subjects’ acoustic features on interview, reading and picture description patterns. Speech may be considered as a biomarker on depressive severity. Interview is a proper way to gain effective speech signal for depression assessment. The correlation between the FPC of speech feature subset and PHQ-9 score with statistical significance indicate that there may exist some features sets can be used to evaluate depression severity.