Introduction

Schizophrenia (SZ) is a severe disorder of the brain which affects the thinking, memory, understanding, speech, and the behavioral characteristics of an individual [1, 2]. This chronic psychiatric disorder affects the employment, marriage and lifestyle of the person [3, 4] and consequently quality of life is then compromised, being unable to function in workplaces, with 20–40% attempting suicide at least once [5]. The World Health Organization (WHO) reports that 20 million people worldwide are affected by this mental disorder [6]. Yet, WHO has also reported that SZ is curable, and precise and timely prognosis is helpful for better treatment and the recovery of the patient.

Currently, there is not a well-known clinical test for SZ, and diagnosis relies on behavioral symptoms such as hallucinations, functional deterioration and disorganized speech observed by experts. Such mentioned assessments are subjective and not very accurate. To overcome the aforementioned limitations, an automatic, reliable and reproducible approach from brain imaging modalities using advanced machine learning method is required. This system can overcome these limitations and can be utilized everywhere with no need to highly-trained experts. For the diagnosis of mental disorders such as SZ, electroencephalogram (EEG) is a powerful tool since it can interpret the brain state so well and widely used in clinical applications [7,8,9]. Moreover, EEG is well accepted due to high temporal resolution, easy setup and, being noninvasive, and portable method. EEG signals have been used in brain source localization for diagnoses of various brain disorders such as epilepsy, schizophrenia and Parkinson [10,11,12,13,14,15,16]. For example, in epilepsy studies, a significant issue is finding activated regions of spikes [15], or in mental disorders such as SZ patients, finding brain source localization is crucial for treatment approaches such as transcranial magnetic stimulation (TMS).

Traditionally, a number of EEG-based machine learning methods have been used for feature extraction, perform feature selection and finally employ conventional classification methods for automated detection of schizophrenia [17,18,19,20,21,22,23,24,25,26,27,28,29,30]. But in recent years, there has been a developing interest in the utilization of deep learning methods as a disruptive alternative to the aforementioned feature based methods [31, 32]. Deep learning algorithms are able to automatically extract significant features and classify them directly from the data. These methods imitate the workings of the human brain in data processing and generating patterns of decision making usage.

Recent developments in neural network architecture design and training have enabled researchers to solve previously intractable learning tasks of deep learning methods. As a result, several research works have focused on the application of deep learning as the state-of-the-art in machine learning especially the Convolutional neural network (CNN) in a wide range of computer vision studies especially in medical applications [33,34,35,36,37,38,39,40] and also for processing EEG signals with very success.[41,42,43,44,45]. Also, there are some works in detection of SZ patients that used CNNs and EEG signals. Recently, in [46] an automatic method for diagnosis SZ patients using CNN from EEG signals is proposed. They have used CNN for discrimination of 14 normal and 14 SZ patients. In another study [47], multi-domain connectome CNN using different fusion strategies for detection of SZ patients from EEG signals are proposed. Moreover, from functional magnetic resonance images (fMRI), a 3-dimensional CNN in combination of autoencoders [48] and 3 dimensional CNN for identification of SZ disorder [49] are presented.

The main novelty of this paper is to provide a more generalized approach to model the brain dysfunction by combination of continuous wavelet transform (CWT), transfer learning with four popular pre-trained deep CNNs (AlexNet, ResNet-18, VGG-19 and Inception-v3) and support vector machine (SVM) as a novel approach for automated diagnosis of the SZ patients from EEG signals. Also, discriminant brain regions for recognition of 14 patients suffering from SZ and 14 healthy subjects are considered and determined by the proposed method. Finding these distinct brain sources is crucial to treat SZ patients with TMS.

Material and methods

Participant and EEG recording

The data used in this study which is publicly available was collected from 14 patients suffering from SZ and 14 healthy subjects [23]. There are seven males with average ages of 28.3 + 4.1 years of and seven females with average ages of 27.9 + 3.3 years in the patient’s group and the same number of males and females in control group with age ranges of 26.8 ± 2.9 for males and 28.7 ± 3.4 years for females. The patients must meet International Classification of Diseases (ICD)–10 criteria for paranoid SZ. Study protocol is approved by the Ethics Committee of the Institute of Psychiatry and Neurology in Warsaw, Poland and written consent are provided from all participants. A minimum age of 18, and a minimum of seven days medication washout period are Inclusion criteria. Pregnancy, organic brain pathology, severe neurological diseases (e.g. epilepsy, Alzheimer’s, or Parkinson disease), and presence of a general medical condition are Exclusion criteria. The signal was recorded for 12 min with participants having their eyes closed and in a relaxed state at sampling rate of 250 Hz. EEG signals passed through the low pass and high pass Butterworth filters with cut off frequencies of 0.5 and 45 Hz, respectively. The signal was divided into 5 s segments before the analysis, therefore each channel of each subject yields 144 segments. The standard 10–20 International system was used to record the data and therefor 19 channels are resulted per subject: Fp1, Fp2, F7, F3, Fz, F4, F8, C3, Cz, C4, P3, Pz, P4, T3, T4, T5, T6, O1, O2. These channels of EEG signals are divided into 5 brain regions (Table 1).

Table 1 Channel clusters for each brain regions

EEG signal to image conversion

Wavelet transform is a tool that provides a two-dimensional time–frequency representation of EEG signal as an image. This image is able to efficiently capture the variation of the spectral content of a signal over time and can represent discriminant properties of normal and SZ subjects. The resultant image represents EEG power changes in frequency and time and is used to feed CNNs. It represents signal as a linear combination of basic functions called wavelets [45]:

$$X_{\omega } \left( {a.{\mkern 1mu} b} \right) = \frac{1}{{|a|^{{1/2}} }}\int\limits_{{ - \infty }}^{{ + \infty }} {x\left( t \right)\bar{\Psi }\left( {\frac{{t - b}}{a}} \right)dt}$$
(1)

where \(a\) is scale (real and positive integer), b is the translational value (real integer), \(\upomega\) is a window and \(\Psi (\mathrm{t})\) is the mother wavelet. In this study, Morse (3,60) mother wavelet which yields better localization in frequency domain compared to other mother wavelets is used.

CNN

CNN is one of the most powerful and popular tools of deep learning methods in the field of medical imaging. It is the state-of-the-art deep learning methodology consisting of many stacked convolutional layers. This network contains a convolutional layer, pooling layer, batch normalization, fully connected (FC) layers and finally a softmax layer [31, 32]. Feature maps are extracted at the convolutional layers. Pooling layers lessen feature maps using maximum or average operators and the most significant features are extracted. Finally, FC layers prepare extracted features to be classified by softmax layer. Nonlinear layers (mostly ReLU function) are used to strengthen the network for solving nonlinear problems. ReLU as activation function are used after each convolutional and fully connected layer. Also, drop out and batch normalization techniques are introduced to overcome the overfitting problem in this neural network.

Pre-trained CNNs

Pre-trained CNNs are trained networks on very large amounts of images with many categories. AlexNet [50], VGGNet [51], Inceptions [52] and Residual network (ResNet) [53] are popular pre-trained CNNs that are trained on ImageNet database and were the winner of the ImageNet Large Scale Visual Recognition Competition (ILSVRC) from 2012 till 2015. ImageNet is a known image database for visual object recognition project that starts with 1.2 million of images from 1000 different categories from animals (dogs, cats, lions, ….) to objects (desks, pens, chairs, …).

AlexNet

AlexNet with 61 million parameters is a simple CNN with a few convolutional layers which has been won the ILSVRC2012 [50]. It has 5 convolutional layers for extraction low and high levels features, max pooling layers and 3 fully connected layers for classification. Figure 1a shows name of layer in left column and the number and size of kernels (filters) of the convolutional and pooling layers in right column. For example, ‘Conv1′ layer has 96 kernels with the size of 11 × 11 × 3 with stride and padding of 4 and 0, respectively.

Fig. 1
figure 1

Block representation of convolutional and pooling layers of a AlexNet and b VGG-19. Each block contains information about the number of filters, size of filters, stride and padding

VGGNet-19

VGGNet is the runner-up of ILSVRC2014 and has been introduced by Simonyan and Zisserman [51]. This network has two versions with different stacked convolutional layers, VGG-16 and VGG-19. VGG-16 has three stacked of 3 convolutional layers and VGG-19 has three stacked of four convolutional layers. In this paper, VGG-19 has been used with 19 uniform convolutional layers and 144 million parameters. Figure 1b shows the structures of this method. For example, ‘Conv1_1′ layer has 64 kernels with the size of 3 × 3 × 3 with stride and padding of 1 and 1, respectively.

Inception-v3

Inception-v3 with 23.9 million parameters was the runner up of ILSVRC2015 [52]. Inception-v3 has many stacked inception modules which are parallel convolutional layers. This network reduced the number of connections, without degrading the efficiency of the network. Figure 2a shows the structures of this method. For example, ‘Conv2d_1′ layer has 32 kernels with the size of 3 × 3 × 3 with stride and padding of 2 and 0, respectively.

Fig. 2
figure 2

Block representation of the a Inception-v3 and b ResNet-18 in compact form. Each block contains information about the number of filters, size of filters, stride and padding

ResNet-18

ResNet is the winner of ILSVRC2015 [53]. ResNet has many stacked identity shortcut connections that help to solve the vanishing gradient problem of CNNs. CNNs with many layers face with the vanishing gradient problem, i.e. when there are so many layers, repeating multiplication make very low gradient value near zero and it will be vanished in updating procedure. Therefore, the performance will be degraded as each additional layer. ResNet has some versions with various convolutional layers. ResNet-18 is the version comprise of 18 convolutional layers with 11.7 million parameters. Figure 2b shows the structures of this method. For example, ‘Conv_1′ layer has 64 kernels with the size of 7 × 7 × 3 with stride and padding of 2 and 3, respectively.

Transfer learning

The number of parameters in the model increases as networks gets deeper which in turn results in improved learning efficiency. The deeper networks lead to more complicated computations and as well as demanding more training data. Transfer learning employs a reference deep model trained previously on a huge database and adapts it using a smaller insufficient database for a new application [54,55,56,57,58]. It means we transfer the information (the learned parameters such as weights layers and biases) to our problem with an insufficient database. Transfer learning takes advantage of a pre-trained CNN model on a huge database. This procedure has a number of benefits for researchers such as lower training time, weaker and cheaper hardware requirement, lower computational load, and fewer images for training. In the procedure, the images resulted from EEG signal via CWT transform are used as input and convolutional and pooling layers of pre-trained CNN models namely AlexNet, ResNet-18, VGG-19 and Inception-v3 are used as deep features and fed into the SVM classifier. Then, we have tuned the parameters of SVM to classify SZ patients and healthy subjects. In other words, the fully connected layer and softmax layer of pre-trained CNN models are replaced with a SVM as classifier layer and the parameters of SVM are tuned. It should be noted that there are little differences in input of each network (AlexNet is 227*227. VGGNet-19 and Resent are 224*224, Inception-v3 is 229*229). So in the first step of data preparation, according to different sizes of model inputs, all images were resized to proper sizes.

SVM classifier

SVM is a supervised method of classification in machine learning field and can solve classification problem efficiently. It minimizes error iteratively by maximizing marginal hyperplane. This classifier has been successfully used in EEG signal processing studies [59,60,61]. The linear hyperplane for a training set of data \({x}_{i}\) is defined as Eq. (2) [62]:

$$w^{T} x + b = 0$$
(2)

where \(w\) and \(b\) are n-dimensional vector and bias, respectively. A hyperplane must have the least possible error in separating data and maximum distance to closest data of each class. Then, according to these two special properties a sample belongs to the left (y = 1) or right (y = − 1) sides of the hyperplane. Equation (3) shows relation of two margins that controls the separability of samples:

$$w^{T} x + b\left\{ {\begin{array}{*{20}c} { \ge 1 \,for \,y_{i} = 1} \\ { \le - 1 \,for \,y_{i} = - 1} \\ \end{array} } \right.$$
(3)

The distance (d) to find the best hyperplane is computed as Eq. (4):

$$d\left( {w. b;x} \right) = \frac{{\left| {\left( {w^{T} x + b - 1} \right) - \left( {w^{T} x + b + } \right)} \right|}}{{\left| {\left| w \right|} \right|}} = \frac{2}{{\left| {\left| w \right|} \right|}}$$
(4)

Maximizing the margin would be equal to minimizing \(w\). Then, the optimal hyperplane is computed such that [62]:

$${\text{Minimize}}\,\frac{1}{2}w^{t} w + C\sum_{i = 1}^{M} \xi_{i}$$
(5)
$${\text{Subject to}}\,y_{i} \left( {w^{t} x + b} \right) \ge 1 - \xi_{i} for i = 1. \ldots .M$$
(6)

where C and \({\xi }_{i}\) are the margin parameters and slack variable. Margin parameter determines the tradeoff between maximizing the margin and minimizing classification error, respectively. Slack variable penalizes data points which violate the margin requirements. Here, we used L2-SVM classifier that uses the square sum of slack variables (\({\xi }_{i}\)) as the optimization function. This optimization function is computed as below:

$${\text{Minimize}}\,\frac{1}{2}||w||^{2} + \frac{C}{2}\mathop \sum \limits_{i = 1}^{M} \xi_{i}^{2}$$
(7)
$${\text{Subject to}}\,y_{i} \left( {w^{t} x_{i} + b} \right) \ge 1 - \xi_{i} \,for \,i = 1. \ldots .M$$
(8)

Evaluation performance

Independently, 4 versions of the pre-trained CNNs model were fine-tuned on 90% of data and then evaluated from the residual data. Due to the limited dataset, tenfold cross validation was used and the process is repeated 10 times, with each subsample used exactly once as the testing data until all the dataset has been used for testing and evaluation performance. Then, averaging the 10 results is reported. Then, the average and standard deviation of 10 results is reported. The accuracy, sensitivity and specificity measures are computed as follow:

$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}$$
(9)
$${\text{Sensitivity }} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$
(10)
$${\text{Specificity}} = \frac{{{\text{TN}}}}{{{\text{TN}} + {\text{FP }}}}$$
(11)

where TP, TN, FP and FN are true positive, true negative, false positive and false negative from confusion matrix, respectively.

Results

19 channels of EEG signals from each subject were preprocessed using the EEGLAB [63] toolbox in MATLAB software (version 2019a). Then, EEG signals were converted to scalogram images by the CWT method by Morse (3,60) wavelet. Scalogram images were built from 19 channels of each subject. Figure 3 shows a sample scalogram of EEG channels for healthy subject and SZ patient. Horizontal and vertical axes represent time (second) and frequency (Hz) contents, respectively. Then, the scalogram images resulted from EEG signal via CWT transform are used as input and convolutional and pooling layers of pre-trained CNN models are used as feature extractor and fed into the SVM classifier (Fig. 3). Independently, four versions of pre-trained CNNs, Inception-v3, VGG-19, ResNet-18 and AlexNet are used. Then, we have tuned the parameters of SVM to classify SZ patients and healthy subjects. In other words, the fully connected layer and softmax layer of pre-trained CNN models are replaced with a SVM as classifier layer and the parameters of SVM are tuned. Tuning was performed on 90% of scalogram images and then the accuracy, specificity and sensitivity are computed on residual scalograms images. This procedure is done 10-times. Finally, mean and standard deviation of these measures were computed. All processing steps were done with the MATLAB software version 2019a. All codes were implemented on a laptop with an Intel (R) Core (TM) i7-6500U CPU @2.50 GHz 2.60 GHz.

Fig. 3
figure 3

Block diagram of the proposed method. A sample Scalogram image of EEG channels from: a healthy subject and b SZ patient is presented

Figure 4 shows the average accuracy for 19 EEG channels using the AlexNet-SVM, VGG-19-SVM, ResNet-18-SVM and Inception-v3-SVM for SZ detection from healthy controls. Maximum accuracy was achieved for ResNet-18-SVM in all EEG channels, followed by Inception-v3-SVM, VGG-19-SVM and AlexNet-SVM having the highest accuracy, respectively. Among all EEG channels, P4 and O2 achieved higher accuracies using ResNet-18-SVM with accuracy of 88.05% and 86.25%, respectively. According to psychological studies, the parietal and occipital are discriminant brain regions in SZ disorder. In [64] after analyzing MR images from SZ patients and normal subjects, they found that gray matter (GM) and white matter (WM) in these brain regions had significant differences between the two groups. Also, in [65], they found discriminant regions in parietal and occipital of SZ patients after investigating GM and WM in MRI.

Fig. 4
figure 4

Average accuracy values of SZ detection from healthy controls using the AlexNet-SVM, VGG-19-SVM, ResNet-18-SVM and Inception-v3-SVM on scalogram images of 19 EEG channels, separately

To improve the SZ recognition performance, EEG channel of each region are combined. So, 19 channels of EEG signals are divided into 5 brain regions (Table 1). Because the highest accuracy achieved using ResNet-18-SVM, this network is used for further analysis. Table 2 mentions the average accuracy, sensitivity and specificity values for scalogram images of 5 defined brain regions using the ResNet-18-SVM in classifying the SZ patients from healthy controls. The highest accuracy achieved 94.84% for parietal region. Finally, brain regions were combined to further improve the performance of SZ recognition. All possible combinations of two, three, four and five brain regions are considered here. The highest accuracy achieved was 98.60% ± 2.29 for scalogram images of combination of four regions of frontal, central, parietal, and occipital. Table 3 mentions the average and standard deviation of accuracy, sensitivity and specificity values for scalogram images of combinations of possible four and five brain regions using the ResNet-18-SVM in classifying the SZ patients. As it is observed, when temporal region is combined with other regions the accuracy is decreased and therefore combination of other four possible brain regions and even five regions (95.30%) with temporal have lower accuracy value. So, we can differentiate between SZ patients and healthy controls with accuracy of 98.60% ± 2.29 for combination of frontal, central, parietal, and occipital regions.

Table 2 Mean and standard deviation of accuracy, sensitivity and specificity values of SZ detection from healthy controls for scalogram images of brain regions using ResNet-18-SVM
Table 3 Mean and standard deviation of accuracy, sensitivity and specificity values of the SZ detection from healthy controls for scalogram images of combination of possible four and five brain regions using the ResNet-18-SVM

Discussion

In this research, we have used transfer learning with deep CNNs, CWT and SVM methods for automated detection of SZ patients and healthy controls. Accuracy value of 98.60% ± 2.29 is achieved for ResNet-18-SVM architecture in scalogram images of combination of frontal, central, parietal, and occipital regions of EEG signals. Screening of SZ patients is momentous for early diagnosis and treatment.

The relatively small database used in this study, limits the procedure of training enormous parameters of a deep CNN model adequately; Thus, the transfer learning concept to compensate this inadequacy have been exploited. In this study, we demonstrated the feasibility of 4 state-of-the-arts pre-trained CNNs architectures and then used these deep architectures on clinical dataset to perform SZ detection from EEG signals. Also, the fully connected layer and softmax layer were replaced with a SVM. The idea of using SVM as classifier layer is reasonable since prior to the deep learning methods popularity, it was one the most efficient classification methods which could perform discrimination with the highest performance.

As seen in the result section, it can be observed that in terms of accuracy, sensitivity and specificity the ResNet-18-SVM architecture is the best model. The highest average accuracy to recognize SZ patients was obtained by the ResNet-18 method than other pre-trained CNNs (Fig. 4). To understand why the aforementioned network performs better compared to others, one should consider the architectures of these pre-trained CNNs. The structure of VGG-19 and AlexNet is relatively similar but VGG-19 has more convolutional layers with higher accuracies than AlexNet. So, it seems the number of layers affect the performance in this situation. But, ResNet-18 has lower convolutional layers (18) than Inception-v3 (48) with higher accuracy. Consequently, the number of layers is not the only effective factor. ResNet-18 has the residual unit containing multiple stacked identity maps and shortcuts, while, Inception-v3 has multiple parallel convolutional layers in its Inception units. So, according to results on accuracy, it seems that the residual unit performs better than Inception module for discriminating this task.

As observed in Table 3, the combination of frontal, central, parietal, and occipital regions had achieved the highest average accuracy among other combinations. It can be deduced that these regions are the most related regions in recognition of SZ patients and healthy controls. Our findings about best regions are consistent with related studies with other methods [23, 25, 27, 28, 30]. In Table 4, results of this study are compared with related studies that used EEG signals of the same database [25,26,27, 46, 47]. As it is observed, accuracy achieved in this study is higher than those studies with the other machine learning methods and proves the preference of the proposed method.

Table 4 Comparison of proposed method with other recent studies about SZ identification

The main limitation of the research can be considered the dataset size to train the networks. By performing regularization terms and simplifying deep models, we were able to overcome this problem. Our aim in the future is to collect more samples and employing the developed methodology on other types of EEG data. Also, applying different methods for converting 1-D EEG signals into 2-D image which represents information flow between different EEG channels to be fed to CNN Architecture are presented in the future.

Conclusion

Transfer learning with one popular version of deep CNN named ResNet-18-SVM and CWT method is used with very success for automated detection of SZ patients from healthy controls using EEG signals. The accuracy, sensitivity and specificity of the mentioned method are 98.60% ± 2.29, 99.65% ± 2.35 and 96.92% ± 2.25 for combination of frontal, central, parietal, and occipital regions, respectively. Relying on the results, newly proposed deep learning model is capable of effectively analyzing the brain function and can help health care professionals to identify the SZ patients for early identification and intervention.