Introduction

Current clinical practice in psychiatry depends on diagnostic criteria built entirely on expert consensus, instead of relying on objective biomarkers (Bzdok and Meyer-lindenberg 2018). Such criteria, described in the Diagnostic and Statistical Manual, 5th Edition (DSM-5), and in the International Classification of Diseases (ICD-10), are still considered the gold-standard for diagnosis in psychiatry (American Psychiatric Association 2013). Nevertheless, those diagnostic systems have been criticized due to their absence of clinical predictability and neurological validity (Bzdok and Meyer-lindenberg 2018), and their poor diagnostic stability (Baca-Garcia et al. 2007). This ultimately leads to trial-and error treatment (Petzschner et al. 2017). While other medical fields hold markers of disease presence and severity, such as tumor volume measurement and biochemical blood tests, psychiatry still lacks routine objective tests (Bedi et al. 2015; Mundt et al. 2012).

Assessment and treatment in psychiatry are historically based on reports from patients and on clinical evaluation (Mundt et al. 2007). This makes diagnosis and therapeutic decision extremely sensitive to memory and subjectivity biases (Jiang et al. 2018). In this context, there was an intense search for biomarkers for diagnosis and follow-up of psychiatric patients in the last decade (Iwabuchi et al. 2013; Mundt et al. 2012). However, most of them are expensive and invasive (Higuchi et al. 2018). Therefore, despite all efforts, objective measures for assessment of mental disorders are still unknown (Mundt et al. 2007).

Other major challenges psychiatry faces are that nosology and clinical practice do not benefit from advances in neurosciences. These difficulties can be tackled by computational psychiatry, which applies machine learning (ML) with focus on clinical applications and single-subject treatments (Bzdok and Meyer-lindenberg 2018; Petzschner et al. 2017). Machine learning has successful implementations in problem-solving tasks in several medical fields, like supportive diagnostic tools based on neuroanatomical structures for Alzheimer’s disease (dos Santos et al. 2009; W. P. dos Santos et al. 2007), breast cancer (Cruz et al. 2018; de Lima et al. 2016; de Santana et al. 2018), and multiple sclerosis diagnosis (Commowick et al. 2018).

Schizophrenia is a group of severe psychotic disorders with heterogeneous etiologies, clinical presentations and responses to treatment (Sadock et al. 2017). It is characterized by hallucinations, delusions, thought and behavior disorder or catatonia, and “negative symptoms,” such as diminished emotional expression and avolition (American Psychiatric Association 2013). Since the first descriptions of this disorder, speech/language deficits have been described as remarkable features of schizophrenia, and are often associated with core negative symptoms and social impairment (Alberto et al. 2019). These symptoms comprise poverty of speech, disorganized speech, derailment, tangentiality, neologism, incoherence, mutism, perseveration, echolalia, thought blocking (Mac-Kay et al. 2018) inappropriate affect prosody or aprosodia (Chakraborty et al. 2018a; Covington et al. 2012; Elite et al. 2014). Also known as flattened speech intonation, aprosodia consists of diminished vocal emphasis (Alpert and Anderson 1977); reduced inflection and fluency (Alpert et al. 2000); and prosody comprehension deficits, such as difficulties in recognizing intonation patterns (Elite et al. 2014) Overall, these speech abnormalities result from disruptions in cognitive processes and contribute to the frequent communication deficits in schizophrenia (Mac-Kay et al. 2018).

In this framework, computational psychiatry has shown to be a promising method to deal with the complexity of psychiatric diagnosis, translating neuroscientific advances to clinical applications. Its data-driven approach applies machine learning techniques to high-dimensional data in order to improve classification diagnosis, treatment selection and even treatment outcomes (Huys et al. 2016). The use of ML models is appropriate for individual-level predictions, which would provide personalized therapeutic decisions in the future (Bzdok and Meyer-lindenberg 2018). Moreover, it may also enable mobile monitoring of patients and telemedicine applications that are accessible for clinical use (Cohen et al. 2016). In the context of speech-language deficits, vocal acoustic analyses using ML classifiers appear to be a promising venue for understanding their role within mental disorders (Cohen et al. 2012).

Thinking about this, this work proposes the application of ML techniques in audio-recordings to perform binary classification. For this, we collected data from 31 patients, divided into 2 groups: group of patients diagnosed with schizophrenia, and a control group, composed of healthy patients. In this context, we pre-processed all recordings in order to minimize environment noises. After that, we extracted 33 features from each 10 s-window of the signals. Finally, multiple classifiers were tested. Our goal is to provide an intelligent tool that performs accurate and non-invasive schizophrenia diagnosis with low computational cost.

This paper is organized as follows: Section 2 describes studies related to the characterization of schizophrenia based on vocal parameters. In Section 3, an instrument for the detection of schizophrenia is introduced and implemented. Results are presented and discussed in Sections 4 and 5, respectively. Section 6 states our conclusions with suggestions for future studies on this subject.

Related works

As speech-language abnormalities are a hallmark in schizophrenia, several related studies have been published, most of which on natural language processing and semantics/syntax (Bedi et al. 2015; Chakraborty et al. 2018a; Elvevåg et al. 2010; Kayi et al. 2017; Tovar et al. 2019), and a limited number of studies about vocal patterns in schizophrenia (Tahir et al. 2019).

Patients with schizophrenia tend to show slowed speech, reduced pitch variability, significantly increased number of pauses, and decreased variability in syllable timing than healthy individuals. These characteristics were observed in a semi-automatic analysis of vocal pitch or fundamental frequency (F0) during an emotionally neutral reading task performed by Martínez-sánchez et al. (2015). In a sample of 80 subjects, they reported a discrimination accuracy of 93.8% between schizophrenic patients and controls using signal processing algorithms. They also observed remarkable intergroup differences, with patients exhibiting slowed speech, low volume, and many pauses.

Likewise, Rapcan et al. (2010) compared vocal pitch, temporal, and energy parameters of 39 schizophrenic patients and 18 healthy controls during an emotionally neutral reading task. Their results demonstrated significant differences between groups, with patients showing decreased mean utterance duration, and increased values in number of pauses, proportion of silence, mean pause duration, total length of pauses, and relative variation in energy. On the other hand, no statistical significance was reported for total length of utterances and relative variation in vocal pitch. However, the lack of educational level matching between groups with reading task may represent an important limitation to their findings, because different educational status may translate into different reading speed and fluency between patient and control samples.

Vocal acoustic analysis is also capable of measuring the severity of negative symptoms such as aprosodia. Compton et al. (2018) analyzed audio recordings of schizophrenic patients with aprosodia, schizophrenic patients without aprosodia, and healthy controls, and compared variability in pitch (F0), first (F1) and second (F2) formants, and intensity/loudness. Their results showed significant differences among groups, with the group with aprosodia showing reduced variability in pitch, F2, and intensity/loudness than other groups.

Similarly, Covington et al. (2012) analyzed F0, F1, and F2 of 25 video-recorded interviews. They investigated tongue movement as an indicator of the severity of negative symptoms in first-episode schizophrenia-spectrum patients. Their study concluded that F2, a measure of variability of tongue anterior or posterior position, was significantly correlated with the severity of negative symptoms.

Chakraborty et al. (2018b) employed low-level speech signals (or low-level descriptors, LLD) alone or in combination with body movements to predict negative symptoms of schizophrenia using automatic classifiers. For that purpose, they applied support vector machines (SVM), a supervised machine learning technique widely used in classification problems (Russell and Norvig 2016). They reported a classification accuracy of 79.49% using low-level speech signals alone, and of 86.36% for their combination with body movements.

Likewise, Tahir et al. (2019) investigated conversational and prosodic features as objective measures of negative symptoms in schizophrenia. Conversational features relate to duration of speech, speaking turns, interruptions, and interjections, while prosodic features comprise F0, F1, F2, and F3; mel frequency cepstral coefficients (MFCCs); and amplitude (minimum, maximum and mean volume, entropy). The performance of some ML algorithms in discriminating between patients and healthy controls was evaluated in their article: SVM, multilayer perceptron (MLP), random forest (RF), and ensemble (bagging). The best results were reported for MLP (accuracy = 81.3%), with speaking rate, frequency, and volume entropy showing significant differences between groups.

In a meta-analysis of 46 papers about acoustic patterns in schizophrenia, Alberto et al. (2019) compared three categories of study design: qualitative ratings, quantitative univariate analyses, and multivariate ML investigations. Machine learning studies provided superior results, with overall out-of-sample accuracy of 76.5–87.5%, and appeared to be more promising. They also identified remarkable differences in acoustic patterns between schizophrenic patients and healthy controls, with the patient group showing decreased proportion of spoken time, reduced speech rate, and increased duration of pauses. These abnormalities were directly related to flat affect and alogia. Additionally, they observed that studies with dialogical and free speech provided the greatest differences between groups, in contrast with studies using constrained monologs.

Methods

In this study, a sample of 31 volunteers over 18 years old was selected and divided into two subsamples:

  • Healthy control: 11 healthy participants (6 males) were selected through the Self-Reporting Questionnaire (SRQ-20), a screening instrument for common mental disorders (Gonçalves et al. 2008; K. O. B. Santos et al. 2010);

  • Schizophrenia: 20 patients previously diagnosed with schizophrenia (12 males) were assessed using the Brief Psychiatric Rating Scale (BPRS; Overall and Gorham 1962), one of the most widely used instruments for the evaluation of symptom severity in schizophrenia (Leucht et al. 2005).

All individuals from the schizophrenia sample (mean age = 36.00; SD = 12.39; 54.5% male) fulfilled DSM-5 diagnostic criteria for schizophrenia and were previously diagnosed by an independent psychiatrist. Data for this group were collected at outpatient settings and at inpatient psychiatric units in Hospital das Clínicas, Federal University of Pernambuco, and in Hospital Ulysses Pernambucano, both in Recife, Northeast Brazil. Participants with coexistent neurological disorders or who made professional use of their voices were excluded.

Meanwhile, the control sample (mean age = 30.09; SD = 12.58; 60.0% male) was matched with the patient sample for age, gender and region of origin (Brazilian Northeast). The same exclusion criteria were applied to this group. Participants from both groups were literate, but the control sample had a higher educational level (p < 0.001). Unfortunately, it was not possible to match subsamples with reference educational level, as this was a challenging co-variable to match for in this particular population. Although this represents a limitation to our study, a similar approach was made in some previous studies (Cannizzaro et al. 2005; Cohen et al. 2008; Rapcan et al. 2010). Sample characteristics are presented in Table 1.

Table 1 Sample characteristics: the 31 participants were divided into two groups: control group composed of healthy patients, and the group of people diagnosed with schizophrenia. In both groups, there is a predominance of males. The average age of the control group is 30 years, while in the second group it is 36 years

The use of SRQ-20 was designed to remove participants with current mental illnesses from the control sample. The SRQ-20 cut-off score of 6/7 was considered (Santos et al. 2010), whereas in the schizophrenia sample, participants with prior diagnosis were included, irrespective of their BPRS score. The mean BPRS score of schizophrenic patients in this sample was 44.55 and corresponded to moderate illness severity (Leucht et al. 2005). All participants have given written consent, and this study was conducted only after approval of a local Research Ethical Board.

Acquisitions of voice samples

A Tascam™ 16-bit linear PCM recorder was used, at 44.1 KHz sampling rate, in WAV format, without file compression. Audio-recordings of the schizophrenia sample were acquired during an interview with a psychiatrist in naturalistic settings, i.e., patients were recorded during a routine medical assessment at outpatient offices or inpatient units. After each interview, a trained clinician assessed their symptoms using BPRS. Meanwhile, healthy controls were audio-recorded in different environments (e.g., office, classroom, gym). Participants from this sample were asked to answer SRQ-20, as this questionnaire is self-applied. No duration limit was set for the recordings. As conversations were thoroughly recorded, voices from the clinician and possible third parties were also acquired and needed to be further removed. The total duration of the recordings of both samples was 407.3 min (6.79 h). The process of data acquisition is summarized in Fig. 1.

Fig. 1
figure 1

Block diagram of data acquisition: audio-recordings of the schizophrenia sample were acquired during an interview with a psychiatrist. After that, a trained clinician assessed their symptoms using BPRS. Healthy controls were audio-recorded in different environments. They were also asked to answer SRQ-20 questionnaire. BPRS and SRQ-20 scores were calculated. The SRQ-20 cut-off score of 6/7 was considered, while all diagnosed patients with SCZ were included, regardless of the obtained scores. After participants’ selection, we did the audios editing, aiming to remove voices from other people besides the patients

Audio editing

After data collection, voice signals from the interviewer and any potential companion were manually removed using Audacity audio software (version 2.3.2). This process yielded 222.6 min of recorded audio from participants (3.71 h) as follows: 96.9 min for the control sample and 125.7 min for the schizophrenia sample. Recording duration of both samples after audio editing is shown in Table 2.

Table 2 Recording duration after audio editing

Feature extraction

All recordings were submitted to a vocal feature extraction on GNU Octave™; a free open-source signal-processing software. Rectangular windows, with frame length of 10 s. In order to determine the window overlap percentage, three overlap sizes were tested: 10% (1 s), 25% (2.5 s), and 50% (5 s). For this, the random forest classifier was used. We performed these experiments 30 times, using 10-folds cross validation in Weka environment. Boxplots in Fig. 2 shows the accuracy results for these three scenarios. As shown in the figure, 50% overlap outperforms the others. It reached higher mean accuracy value, as well as less dispersion.

Fig. 2
figure 2

Boxplots with comparison of performance using different window overlaps: 10%, 25%, and 50%. The boxplot shows that an overlap of 50% achieved greater accuracy values and less dispersion of values

As raw audio data were used, no filtering process was applied. Consequently, background noise was also captured. However, we believe such noise would not be able to interfere significantly, given the homogeneous spectral behavior of the acoustic features selected for extraction. At this stage, the following 33 parameters were extracted: skewness; kurtosis; zero crossing; slope sign changes; variance; standard deviation; mean absolute value; logarithm detector; root mean square; average amplitude change; difference absolute deviation; integrated absolute value; mean logarithm kernel; simple square integral; mean value; third, fourth and fifth moments; maximum amplitude; power spectrum ratio; peak frequency; mean power; mean frequency; median frequency; total power; variance of central frequency; first, second and third spectral moments; Hjorth parameter activity, mobility and complexity; and waveform length. The corresponding mathematical expressions of these attributes are presented in Table 3.

Table 3 Equations of the 33 extracted parameters

The choice of the above parameters relies on their accurate representation of input signals to computational models, once decision-making process of machine learning classifiers is not associated with human interpretation. Additionally, attributes from different domains (e.g., temporal and spectral) were selected so as to avoid feature selection biases. Furthermore, such parameters have already been successfully used for representing other biomedical signals, such as electroencephalography. Subsequently, the most relevant parameters were selected using particle swarm optimization (PSO), a feature selection method for dimensionality reduction within classification problems (Xue et al. 2012).

Feature selection using particle swarm optimization

Particle swarm optimization (PSO) algorithms were created by James Kennedy and Russel Eberhart in 1995, respectively a social psychologist and an electrical engineer (Kennedy and Eberhart 1995). PSOs are based on the behavior and movement of flocks of animals, such as fish and birds, therefore being algorithms based on theories that describe animal social behavior, having elements in common with genetic algorithms and with evolutionary programming (Eberhart and Kennedy 1995; Kennedy and Eberhart 1995; Santos and Assis 2013).

Similar to genetic algorithms, PSO is initialized with a random initial population. However, while in the genetic algorithms, the individuals in this initial population are represented by chromosomes, in the PSO a position vector and a velocity vector are associated with each individual. In addition, in the PSO there are no mutations or selection of individuals. Thus, at each iteration, only positions and speeds of different individuals are adjusted in the direction of the best global position and the best individual position, according to a certain objective function, according to the following canonical expression (Eberhart and Shi 2011; Chuanwen and Bompard 2005; Van der Merwe and Engelbrecht 2003; Hu et al. 2003; Trelea 2003; Shi and Krohling 2002):

$$ {\boldsymbol{x}}_i\left(t+1\right)={\boldsymbol{x}}_i(t)+{\boldsymbol{v}}_i\left(t+1\right), $$
(1)

since

$$ {\boldsymbol{v}}_i\left(t+1\right)={w\boldsymbol{v}}_i(t)+{c}_1{r}_1\left[{\boldsymbol{p}}_i(t)-{\boldsymbol{x}}_i(t)\right]+{c}_2{r}_2\left[{\boldsymbol{p}}_g(t)-{\boldsymbol{x}}_i(t)\right], $$
(2)

for 1 ≤ i ≤ m, where m is the number of particles in the cluster; w is the inertia factor, where 0 < w < 1; r1(t) and r2(t) are numbers randomly uniformly distributed in the interval [0, 1]; c1 and c2 are constriction constants, also called coefficients of acceleration, so that c1 + c2 = 4 (typically, c1 = 2 + D and c2 = 2 − D, where D ≈ 0), where c1 is the weight due to consciousness of the particle, individual consciousness or local consciousness, depending on the implementation, while c2 is the weight due to global awareness; xi is the position, while vi is the speed of ith particle; pg is the best global position, while pi is the best individual or local position in relation to the ith particle.

Local and global best positions are considered according to local and global maxima of a determined objective function, whilst the position xi defines the i-th solution candidate. In this classification problem, we defined xi as a n-dimensional binary vector in which each coordinate is associated to the presence (“1” values”) or absence (“0” values) of the corresponding selected characteristic. Therefore, each solution candidate is associated to training and test sets composed by dimension-reduced feature vectors. As objective function, we used a J48 decision tree returning classification accuracies. The parameters w, c1, c2, r1, and r2 were all set to 0.33. We used a population of 20 individuals evolving in 500 generations. This solution was implemented in Java using the Java machine learning library Weka (Moraglio et al. 2007; García-Nieto et al. 2009).

Classification

Both databases (with all features extracted and after PSO selection) were balanced through the addition of artificial instances on Weka™ artificial intelligence environment. This is essential to avoid computational biases towards the class with more representativeness, in this case the schizophrenia sample. Edited audio samples were submitted to classification experiments using the following ML algorithms on Weka™: multilayer perceptron (MLP), logistic regression, random forest (RF), decision trees, Bayes net, Naïve Bayes, and SVM with different kernels (linear, polynomial kernel, radial basis function or RBF, PUK, and normalized polynomial kernel). Given the relatively small number of subjects in each sample, experiments were performed with 10-fold cross-validation in order to maximize training samples. Figure 3 illustrates the steps of the prediction system.

Fig. 3
figure 3

Block diagram of proposed solution: After editing each of audio-recordings, 33 attributes were extracted from each sample window (10 s with 50% overlap) in Octave environment. Then, and .ARFF file was generated, and multiple classifiers were tested in the Weka software. The tested classifications were binary, seeking to differentiate healthy patients from patients diagnosed with schizophrenia

Results

Initially, computational experiments were performed using classifiers in their default settings. Subsequently, different setups for all algorithms with adjustable settings were tested (MLP; polykernel and normalized polykernel SVM, SVM PUK kernel, and random forest). The best performances for each classifier type are presented in the boxplots of the Figs. 3 and 4 below. Figures 3 and 4 show the accuracy and kappa index values, respectively. They also compare the classifiers’ performances using all 33 extracted attributes (white boxplots), and using the attributes selected by the PSO method (gray color). Using PSO, 12 attributes were selected, which are listed in the Table 4. As can be seen in Figs. 4 and 5, most classifiers have a better performance when considering all attributes. The exception occurs only for classifiers based on Bayes’ theory. However, the latter are classifiers with low performance for this problem. Thereby, they were not chosen. Furthermore, Table 5 presents accuracy, kappa index, sensibility, and specificity values for the best classifiers.

Fig. 4
figure 4

Accuracy boxplots for comparison of classifiers performance. In most classification experiments, the performance of the classification with all the extracted attributes exceeds the classification performance using the attributes selected with PSO. Considering the boxplots with all the attributes (white color), we can see that the Support vector machines have higher accuracy values

Table 4 List of 12 attributes after selection with particle swarm optimization
Fig. 5
figure 5

Kappa index boxplots for comparison of classifiers performance. Similarly to the accuracy values, the results of the Kappa index are higher for cases in which all 33 attributes extracted initially were considered. In addition, SVMs also performed better

Table 5 Classification performances of machine learning models (schizophrenia vs. healthy control). SVM with PUK kernel presented the best results in all four evaluated metrics (accuracy, kappa index, sensitivity, and specificity). It achieved an average accuracy of 91.76%, mean kappa index of 0.8352, sensibility of 91.9%, and specificity of 91.6%

The results above demonstrate that classification accuracy for SVM models varied significantly (72.93–91.76%), depending on which kernel was used. SVM PUK kernel achieved mean accuracy of 91.76% (sensibility 91.9%; specificity 91.6%), which was the best performance of all classifiers used in this study. The confusion matrix of this kernel is shown in Table 6. SVM normalized polynomial kernel also achieved accuracy above 90%. The greatest performances of different SVM kernels in this dataset support findings from previous studies, which possibly indicate the superiority of this algorithm for classification tasks using vocal parameters.

Table 6 Confusion matrix for the model with the highest performance (SVM PUK): 91.59% instances from the control group were correctly classified, while 91.89% instances from the Schizophrenia group were correctly classified

Discussion

This paper presents a study on discriminating schizophrenic patients and healthy subjects based on vocal features and machine learning classifiers. The process of data acquisition was designed to provide high translational power, as this is the first study to collect audio-recordings during actual psychiatric interviews. A feature extraction algorithm has been locally developed for the reliable extraction of 33 acoustic features, which have successfully been used for modeling classification problems in neurology and psychiatry. Some machine learning models tested in this paper have achieved high performances; in particular, SVM with PUK kernel yielded high classification accuracy both for schizophrenic patients and healthy controls. With the exception of Martínez-sánchez et al. (2015), our results outperformed those from similar studies using vocal parameters for the detection of schizophrenia.

Nevertheless, although promising, findings reported in this article should be considered preliminary due to limitations in study design. For instance, the small sample size and not controlling for possible confounding factors, such as smoking history and use of medications, may limit statistical analyses. Additionally, an important caveat is the difference in educational level between samples, given the fact that educational background is related to speech fluency. In future studies, we aim to address these limitations and perform the same experiments on a larger number of subjects. In Table 7 below a comparative analysis between some of the studies mentioned in this article and this study is presented.

Table 7 Comparative analysis of previous studies and this paper

Conclusion and future works

Current psychiatric diagnosis still lacks objective biomarkers and relies mostly on specialist opinion based on diagnostic systems. Nevertheless, these criteria have been criticized due to their lack of correlation with the neurobiology and etiopathogenesis of mental disorders, leading to trial-and-error treatments. In this context, patients with schizophrenia may present with vocal acoustic abnormalities that may be used as objective parameters for the identification and assessment of this disorder.

Therefore, this paper aimed at the development of objective measures of schizophrenia to aid clinical practice in the future. For this purpose, we extracted vocal acoustic features and performed experiments using different automated classification techniques based on machine learning. Some of the most widely used machine learning classifiers were tested in this work. Our results demonstrate the viability of an inexpensive and non-invasive tool for the detection of schizophrenia based on vocal acoustic analysis through machine learning algorithms. In future studies, we intend to perform the same experiments in a larger sample, and also with gender-based datasets. We would like to evaluate if schizophrenia affects vocal acoustic properties from men and women in a different fashion, and if so, how these differences influence the performance of automated classifiers.