Introduction

Depression is forecasted to be the second cause of disability and death before 2030 world wide (Mathews and Loncar 2006; Murray et al. 2013; World Organization of Mental Health 2012; World Health Organization 2017). The effects of depression tend to extend beyond the individual patient, negatively impacting patients’ immediate social environment. In the Netherlands alone, depression was the second most prevalent cause of a sick leave in 2016, making up some 11% of workforce loses (WHO Europe, Data and resources 2014). The underlying pathogenesis of depression is still not known. Nevertheless, considerable evidence from neuroimaging studies shows structural and functional changes, which affect different brain regions and neurotransmitter systems (Arnone et al. 2012, 2013; Koolschijn et al. 2009; Kwaasteniet et al. 2013; Vederine et al. 2011).

In present clinical practice, depression is diagnosed using clinical interviews and structured and semi-structured symptom severity scales (Beck’s Depression Scale, American Psychiatric Association—DSM 5, ICD-10), all of which require accurate self-report from the patient. To the best of our knowledge, clinical professionals are not using any kind of quantification of EEG in the diagnostic process. Multiple factors (NESDA 2018) make detection and diagnosis of depression often difficult. The use of biomarkers could facilitate diagnosis and potentially prevent future episodes, all of which garnered significant attention in the past decade.

Most quantitative EEG (QEEG) measures, connectivity measures, vigilance-based measures, and sleep-related EEG measures use power spectrum analysis to quantify changes in brain activity related to depression. This approach repeatedly showed abnormal frontal asymmetry in alpha activity (Allen et al. 2004; Stewart et al. 2010) reduced slow-wave activity in sleep (Nissen et al. 2006), increased alpha activity (Kemp et al. 2010; Köhler et al. 2011; Basar et al. 2011), and decrease in alpha synchronization in the right fronto-central and centro-parietal connections (Kim et al. 2013). Other studies focused on changes in theta (Knott et al. 2000; Ricardo-Garcell et al. 2009), and beta activity (Roh et al. 2016). Van der Vinne et al (2017) showed that frontal asymmetry although well established in the literature fail to serve as a prognostic biomarker for depression.

Among others, a recent study offered a new insight into the possible mechanisms behind major depressive disorder (MDD) (de Kwaasteniet et al. 2013). De Kwaasterniet and colleagues confirmed abnormal functional connectivity in the fronto-lymbic system by utilizing digital tractography imaging (DTI) and fMRI. A compromised second part of uncinate fasciculus in MDD seems to be correlated with increased functional connectivity but also with the severity of the disease. Kim et al (2013) showed by applying graph theory to EEG that ‘functional topological architecture of the resting-state brain network is disrupted in bipolar disorder’. That study also revealed impaired neural synchronization at resting state as well as a disruption of functional connectivity. Both studies confirmed that the disturbance of white matter tracts is associated with abnormal functioning of fronto-lymbic system—a key indicator of depression. Having proven the efficacy of highly sophisticated methodologies, for the purpose of detection, this study proposes to investigate simple and low-cost EEG as a measure of electrical activity in the human brain which might reflect that deep brain regions’ change on cortex. It is well described in the literature that a brain is a highly complex, nonlinear and mostly irregular system (Eke et al. 2002; Goldberger et al. 2002; Stam 2005). There is increasing recent evidence that the use of non-linear methods may provide significant advantages in deciphering the physiological processes underlying EEG signals (Acharya et al. 2005; Liang et al. 2015; Stokić et al. 2015) and EEG remains the most accessible method (compared to others like fMRI which are far more expensive) for supporting decisions in clinical practice. EEG has the potential to be utilized as a screening method for a variety of psychiatric disorders.

There is a kind of a consensus between researchers that relying on one nonlinear measure might be misleading (Burns and Ramesh 2015), and there are also extremes in published studies with ten or more nonlinear measures applied (Liang et al. 2015), showing that every one of them is providing another kind of information about the signal under study. In our previous work, while testing the effectiveness of different combinations of nonlinear measures it was found that Higuchi’s fractal dimension (HFD) and sample entropy (SampEn) are particularly well matched in a methodological sense (in press) showing different sensitivity for frequency content of the signal. SampEn shows better performance in the lower frequency band, and HFD in higher frequencies of EEG. Both nonlinear measures are used to examine the complexity of signal: HFD is a complexity measure operating directly in time-domain, while SampEn is regularity statistics showing how predictable/irregular signal is. We then turned to establishing a methodology with measures which are computationally fast and robust to artifacts in the signal, and which could be clinically applicable. That methodology could be applied at different points in the diagnostic process to give the clinician additional support for a diagnostic decision. It was not attempt here to examine the correlation between previous medical histories of patients, but to show how proper quantification of their EEG can be used as an independent biomarker. Therefore, this study tested the use of HFD and SampEn to discriminate the complexity of the brain’s neuronal activity in patients diagnosed with depression.

‘Data mining is the extraction of implicit, previously unknown, and potentially useful information from data’ (Witten and Frank 2005), and machine learning as a part of that discipline attracted a lot of attention lately in the field of epileptic seizures detection (Hussain 2018) and classification (Raghu et al. 2017), methamphetamine use disorder (Khajehpour et al. 2019) as well as in cognition (Mora-Sánchez et al. 2019; Tafreshi et al. 2019). There is a relatively small number of publications dealing with the application of machine learning (data mining) algorithms to depression recognition using EEG. Ahmadlou et al. (2012) compared two main algorithms for calculating fractal dimension from EEG. HFD was shown to provide better discrimination (91.3%) compared to Katz’s Fractal Dimension (they used Enhanced Probabilistic Networks). Bachmann et al. (2013) used HFD together with SASI (a novel spectral measure) as a discriminator. They found an increased complexity of EEG in MDD. In their recent study (Bachmann et al. 2018) they also used Linear regression with HFD as feature with accuracy of 77%. Hosseinifard et al. (2013) used spectral and three nonlinear measures and found that classical spectral measures did not prove to be useful for classification. The aim of their study was to improve the accuracy of classifiers in the combination with different nonlinear measures. In addition, it has been found that Support Vector Machines (SVM) provided the best classification results compared to other methods such as Decision Tree (DT), k-Nearest Neighbor (kNN) and Naïve Bayes (NB) (Bairy et al. 2015). Recently published detailed mathematical description of interconnection of HFD and Fourier analysis (classical spectral analysis applied usually in EEG) components strongly suggest that the use of Higuchi’s’ Fractal Dimension and Fast Fourier Transform (FFT) is redundant because fractal dimensions are weighting functions of Fourier’s amplitudes (Kalauzi et al. 2012). Complex signals are information-rich. They are also fractal in its nature for showing multiscalling and self-similarity. Another reason behind our preference to utilization of nonlinear measures in analysis of EEG is that nonlinear (nonstationary and irregular) signals ‘defy comprehensive understanding by a classic reductionist approach’ (Goldberger et al. 2002). Even the simplest nonlinear signals (originating from complex systems) will ‘foil the criteria of proportionality and superposition characteristic for linear systems’ (Goldberger et al. 2002; physionet.org). After performing complexity analyses we decided to compare seven different classifiers with a different combination of features and with a different number of principal components (PCs) from Principal Component Analysis (PCA) (Jolliffe 2002). The aim of the study was to test the usefulness of employing nonlinear measures of complexity changes in EEG and machine learning to separate patients diagnosed with depression from healthy controls. Our study utilizes seven different methods of classification (Logistic Regression, Support Vector Machines both with the linear and polynomial kernel, Multilayer Perceptron, Decision Tree, Random Forests, and Naïve Bayes). The aim was to show that with properly selected nonlinear features and additional PCA processing every supervised learning method used can obtain high accuracy. The idea of this study is that utilization of appropriate nonlinear measures to characterize an EEG signal is crucial for highly accurate classification; if features are appropriately generated, any classifier applied with those features is performing with high accuracy.

Materials and methods

Since the overall goal of the study was to demonstrate the usefulness of non-linear features for classification of patients diagnosed with depression based on EEG signals, Higuchi’s Fractal Dimension (HFD) and Sample Entropy (SampEn) of EEG time series (series of data points indexed in time order from raw signal) were calculated. Subsequently, we applied supervised machine learning algorithms to assess classification accuracy. To determine the linear dependence of EEG features, a correlation analysis was utilized. A PCA was applied to determine the influence of linear feature extraction on classification accuracy. Also, PCA is known from the literature for its possibility to reduce dimensionality of feature set making machine learning models more sensitive. Various classification algorithms were examined, ranging from simple and linear to highly non-linear: Logistic Regression (LR), Support Vector Machines (SVM) both with the linear and polynomial kernel, Multilayer Perceptron (MP), Decision Tree (DT), Random Forests (RF) and Naïve Bayes (NB).

The non-linear features (both HFD and SampEn) were computed using a custom program written in Java. The classification was performed using Weka software (Weka v. 3.8, University of Waikato) (Hall et al. 2009). Principal component analysis was computed using Matlab (Matlab v. R2015b, Mathworks). Statistical analysis was performed using SPSS software (IBM SPSS Statistics 20).

Participants

The data used for this research were recorded at the Institute for Mental Health in Belgrade, Serbia. The subjects were 23 patients diagnosed with depression (13 women and 10 men), 24 to 68 years old (mean 31.53, SD 10.21). All of them were examined by senior clinical psychiatrist (the diagnosis was made according to the ICD-10 classification) and all were medicated. As a control we used the EEG records of 20 age-matched (mean 30.14, SD 8.94) healthy controls (10 males, 10 females) with no previous history of any neurological or psychiatric disorders, recorded at the Institute for Experimental Phonetic and Speech Pathology in Belgrade, Serbia. We did that since healthy people cannot be recorded in psychiatric institution, according to Institute’s internal rules. Healthy controls underwent general medical examination and testing with clinical psychologist to confirm their inclusion in the study. The participants from both groups were all right-handed, according to the Edinburgh Handedness Inventory (Oldfield 1971). All the participants were informed about the experimental protocol and signed informed consent forms. The protocol was approved by the Ethics Committees of the participating institutions (Ethics Committee of the Institute for Mental Health, October 27th 2015, Approval number 30/59, and Ethics Committee of the Scientific Council of Institute for Experimental Phonetics and Speech Pathology, September 25th 2015, Approval number 87-EO/15). All procedures were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Data acquisition

The patients’ EEGs were recorded after a visit to a recommended psychiatrist. EEG was recorded in the resting state with standard 10-20 system using NicoletOne Digital EEG Amplifier (VIAYSYS Healthcare Inc. NeuroCare Group), sitting upright in the comfortable chair, with closed eyes and without any stimulus. Both rooms were in Faraday’s cage by design of building, noise was kept on both places below 42 dB (measured using Phonometer), temperature was kept at 22 degrees Celsius, the light is a dimly daily light, the person was sitting surrounded by white curtains with some daily light through it. All the participants (from both groups) were recording between 10am and 12 h (noon). EEGs were obtained from 19 electrodes in a monopolar montage with reference set to earlobes (Electro-cap International Inc. Eaton, OH USA) using sampling rate of 1 kHz and electrode resistance of less than 5 kΩ. Bandpass was 0.5–70 Hz. For the control group, the same setup was used (10/20 system for electrode placement, the same electroconductive gel, resistance, calibration, etc.), but on the Nihon Kohden apparatus, EEG 1200 K Neurofax with Electrocap (model number 16755) International, Inc. Previous studies (for a detailed review see Pivick et al. 1993) compared results for the same subject on different equipment and concluded that intra-subject variability is small. We kept all the settings identical in all measurements.

Every recording lasted 3 min. Subjects were instructed to reduce any movement. Records of two patients had to be discarded from further analysis due to low voltage EEG in one male participant’s case and epileptic seizure very close in time from the recording of one female participant. Finally, records from 21 patients diagnosed with depression and 20 age-matched healthy controls were used for this study. Artifacts were carefully inspected by two independent experts. The artefacts were not removed from the EEG signal. Only artefact free segments of the EEG trace were analyzed, in order not to distort the signal by “fusing” selected parts (after artefact removal). From artifact-free traces three epochs for further analysis were extracted; every epoch was 5 s (5000 samples) long. Altogether there were three epochs for every person recorded, resulting in 2337 epochs for further analysis.

Fractal analysis

Fractal dimension (FD) of a signal is a measure of its complexity and self-similarity in the time domain. FD is a number in the interval [1, 2]. Generally, higher self-similarity and complexity result in higher FD (Eke et al. 2002). The fractal dimension of EEG was calculated using Higuchi’s algorithm (Higuchi 1988) demonstrated to be the most appropriate for electrophysiological data (Esteller et al. 2001; Castiglioni 2010). This method works directly in the time domain, gives a reasonable estimate of the fractal dimension even in the case of short signal segments and is computationally fast (since it does not attempt to reconstruct the strange attractor, like described in Stam 2005). The Higuchi algorithm was computed with the maximal scale (Higuchi 1988) kmax = 8 shown to perform the best for this type of signals (Spasić et al. 2005). Fractal dimensions were calculated for each electrode for the same duration of signal (the epoch of recorded EEG) for all the participants, and the calculated values formed ensembles (sets of variables) for further analysis. For calculating HFD an in-house written script in Java programming language was used.

Sample entropy analysis

Another nonlinear measure, Sample Entropy (SampEn) was computed according to Richman and Moorman (2000). SampEn estimates signal complexity by computing the conditional probability that two sequences of a given length, m, similar for m points, remain similar within tolerance r at the next data point (when self-matches are not included). Mathematically, SampEn is the negative natural logarithm of the conditional probability that two sequences similar for m points remain similar at the next point. Thus, SampEn measures the irregularity of the data (the higher values, the less regular signal) that is related to signal complexity (Goldberger et al. 2002). Based on the changes of SampEn it can be concluded in which direction the changes of the signal went (is it more or less complex). In accordance with a previous study (Molina-Picó et al. 2011), a tolerance level of r = 0.15 times the standard deviation of the time series (series of samples from raw recording EEG) and m = 2 was used. For calculating SampEn an in-house written script in Java programming language was used.

Statistical analysis

HFD resulted in 19 features (the number of electrodes when recording EEG), and SampEn also resulted in 19 features. All features were merged to get 38 features for further (supervised) machine learning analysis. To determine whether the HFD and SampEn feature values significantly vary between EEG electrodes and between the groups (patients vs. control) the MANOVA (SPSS Statistics version 20.0, SPSS Inc, USA) was used, followed by posthoc Bonferroni tests (for comparison of each of 19 electrode’s HFD and SampEn value between groups-HC and DP, resulting in 19 comparisons).

Classifiers

This study compared the performance of several classifiers implemented in Weka software (Hall et al. 2009) with their default parameter values to discriminate between patients diagnosed with depression and controls. All classifiers are applied to normalized features. Normalization was performed by subtracting sample means and dividing by sample standard deviation such that the inputs of algorithms have zero means and unit standard deviation. To reduce the dimensionality of the feature set and decorrelate the features, we utilized PCA (Jolliffe 2002) to obtain m principal components (PCs) corresponding to the largest eigenvalues of the sample covariance matrix. We defined the percentage of the explained variance by first m PCs as the ratio between sums of variances of m PCs and original variables.

Naïve Bayes classifier (John and Langley 1995) is a maximum a posteriori classifier that outputs the class \(c_{i}\) with the highest probability given the observed feature values. The posterior probability conditioned by the set of attributes \(\{ x_{1} , \ldots ,x_{k}\)} (assumed independent) is calculated using the Bayes’ theorem (Bishop 1995) as the product of posterior Gaussian probabilities (Witten and Frank 2005).

Logistic regression (Cox 1958) estimates the class conditional probability using a linear combination of features and logistic regression function. The model coefficients are estimated using the LogitBoost algorithm (Friedman et al. 2000).

Multi-layer perceptron (MLP) is a generalization of logistic regression with an additional processing layer. MLP estimates a class based on thresholding the value of its output processing unit (Friedman et al. 2000; Haykin 2008). The output unit applies a non-linear transfer function to a linear combination of the outputs of hidden neurons. Each hidden neuron applies a transfer function to a linear combination of the inputs. We utilize MLPs with \(\frac{k + 1}{2}\) hidden neurons, back propagation algorithm with learning rate 0.3 and momentum 0.2 and 500 training epochs to determine the coefficients (Haykin 2008).

Support vector machines(SVM) classify by partitioning a feature space by a decision boundary, linear in transformed space, defined by the kernel function, and uniquely determined by a subset of data—support vectors (Jolliffe 2002). SVMs produce a maximal margin classifier that maximizes the distance between the decision boundary and the support vectors. In this study, we utilized linear and polynomial (quadratic) kernel functions (Jolliffe 2002), soft-margin classifier with regularization constant C = 1 and a sequential minimal optimization algorithm (Platt 1998). SVMs by design, maximize classifier margin, and hence, probably, minimize overfitting.

Decision trees (Quinlan 1993) recursively partition feature partition space in regions corresponding to classes by choosing a feature that provides the highest information gain. The partition stops when the minimal number of 2 samples per node of a decision tree is reached. In the pruning phase, based on the estimation of the classification error (using a confidence level here set to 0.25) the complexity of the model may be reduced and its generalization capacity thus improved (Vapnik 1988).

Random forests classifier utilizes an ensemble of unpruned trees (Breiman 2001). The classification is performed by combining classes predicted by ensemble members. Unlike C4.5 algorithm, in each node of a tree, a random subset of the features is considered for partitioning. This study utilized ensembles with 100 members and consider int(log2k) + 1 random features for each split.

Evaluation of classifier’s performance

Classification accuracy was evaluated through a cross-validation procedure in which the dataset was split into K subsets of approximately equal size K-1 subsets were used to fit a classification model and the remaining subset to evaluate the classifier. This procedure was repeated K times such that a classifier was evaluated in each subset. In this study, we used K = 10 (Picard asnd Cook 1984; Kohavi 1995). The classification accuracy was assessed through overall accuracy—the percentage of correctly classified samples—and using the area under the ROC curve (AUC). The overall accuracy of useful classifiers in two-class problems ranges from 50 to 100%. The ROC curve is created by plotting true positive rate (the proportion of samples with depression that are detected as such) vs. false positive rate (the ratio of the total number of controls incorrectly detected as with depression and the total number of controls). AUC ranges from 0.5 (for a classifier that randomly guesses a class) to 1 (for an ideal classifier) (Fawcett 2006; Hand and Till 2012).

In machine learning, model complexity (of the classifier not of the signal) is defined as ability of a classifier to distinguish among classes that are separated with multivarious surfaces. Here, model complexity was measured using the Vapnik-Chevronenkis (VC) dimension. Burges (1998) indicates that the models with a small number of parameters may have larger VC dimension and complexity. According to statistical learning theory (Vapnik 1988), the classification accuracy on test data (measured by tenfold cross-validation method in this study) decreases with a factor that is directly proportional to the VC dimension of the model and inversely proportional with the size of the training data set. Among the classification method considered in this paper, multilayer perceptron and decision tree have VC dimension that increases with the number of utilized features for classification.

Results

Fractal and SampEn analysis

Results showed that HFD for patients diagnosed with depression ranged from 1.0812 to 1.1553 and for controls from 1.0194 to 1.0198. SampEn of patients diagnosed with depression ranged from 0.3999 to 0.4160 and in the control group from 0.1417 to 0.1591. The first level of analysis was testing the existence of differences in HFD values and SampEn values between Patients (P) and Control (C) groups. MANOVA has utilized with factors Electrode (19 electrodes: Fp1, Fp2, F3, F4, C3, C4, P3, P4, O1, O2, F7, F8, T3, T4, T5, T6, Fz, Cz, Pz) and Group (P and C).

Results of HFD analysis (Fig. 1a) showed a statistically significant effect of group: F(1, 570) = 159. 965, p < 0.001, as well as interaction of group and electrode: F(18,570) = 1.677, p = 0.039 on values of calculated HFD.

Fig. 1
figure 1

Differences between Patients (P) and Control (C) in HFD (a) and SampEn (b) values (all electrodes averaged). For both HFD and SampEn mean values + standard error are presented. *** (p < 0.001), HFD (Higuchi’s fractal dimension), SampEn (Sample Entropy)

There is no statistically significant difference between electrodes inside the P and C group. Our results show higher values of HFD in P group compared to C group. A significant difference in HFD values exists between P and C group for every electrode (post hoc Bonferroni correction, p < 0.05), except for electrodes P3 and F4.

Results of SampEn analysis (Fig. 1b) showed a statistically significant effect of group:

F(1, 570) = b625,914, p < 0.001 on SampEn values. Results showed higher values of SampEn of EEG from P group when compared to C group. There were no statistically significant differences among electrodes inside the P and C group when tested independently (post hoc Bonferroni correction, p < 0.05). A significant difference between P and C group in SampEn values exists for every electrode (post hoc Bonferroni correction, p ≤ 0.001). Subjects from P group have higher values of SampEn calculated from EEG recorded on all the electrodes when compared to C group values.

Correlation of HFD and SampEn values

Figure 2 contains calculated Pearson’s correlation coefficients (Devore 2012) between the features. Figure 2a shows correlation coefficients between HFD values calculated for 19 electrodes. The minimal and the maximal correlation coefficient values were 0.77 and 0.99, respectively. Figure 2b displays the correlation coefficient values between SampEn features. The minimal and maximal values of the correlation were respectively 0.86 and 0.99. Figure 2c depicts the correlation between pairs of SampEn and HFD features; the minimal and maximal correlation coefficients were 0.44 and 0.79. All the estimated values of correlation coefficients were significantly different from 0 (p < 0.005).

Fig. 2
figure 2

Pearson’s correlation coefficients computed between pairs of features. The value of correlation coefficient is color-coded (the scales at each figure are different). The labels on x and y axis denote the electrode corresponding to a feature; a Correlation coefficients between pairs of HFD features; b Correlation coefficients between pairs of SampEn features; c Correlation coefficients between SampEn (on x axis) and HFD (on y axis) feature

When two feature sets are less correlated, but each of them separately provides good classification accuracy, then the addition of one feature set to another in principle could results in a higher classification accuracy than when the feature sets are used separately. In such a case, we could benefit from (relative) orthogonality of two informative feature sets.

Classification

Table 1 shows classification results with different classifiers: multilayer perceptron, logistic regression, Support Vector Machines (SVM) with the linear and polynomial kernel (p = 2), Decision Tree, Random Forest, and Naïve Bayes. Accuracy and area under the ROC curve (AUC) are shown for three different sets of features: HFD, SampEn, and their combination. Also, the average accuracies for all classifiers are shown on each feature set and for each method.

Table 1 Classification results for different classifiers and three different sets of features

Figure 3 shows the percentage of explained variance of a set of HFD and SampEn features as a function of the number of principal components. Figure 4 shows first two normalized principal components and Fig. 5 shows absolute values of loads used to calculate first ten principal components from HF and SampEn features. Table 2 shows classification results of different classifiers that use various numbers of principal components. The principal components were computed on a dataset containing HFD and SampEn features and were normalized to have zero mean and unit standard deviation. The variance of the features explained by the corresponding principal components is also shown.

Fig. 3
figure 3

Percentage of explained variance vs. number of principal components of HFD and SampEn features. For each number of principal components m we calculated the percentage of explained variance. When the number of principal components reaches 20, the percentage of explained variance saturates to a value close to 100%

Fig. 4
figure 4

First two normalized principal components (PC) for samples considered in this study. The principal components are computed on a feature set containing HFD and SampEn features (the total of 38 features). Each symbol denotes one sample in the transformed space: red squares denote patients, while green diamonds correspond to controls. The vertical line denotes a separation line corresponding to a constant value of the first principal component (PC1)

Fig. 5
figure 5

Absolute values of principal components loads for first 10 principal components. Each row contains indicates the coefficients multiplying corresponding non-linear feature in order to generate a principal component

Table 2 Classification results for different classifiers and different number of principal components of the features from SampEn and HFD sets

Figure 3 and Table 2 indicate that the large portion of the variance of combined HFD and SampEn features can be explained by a small number (Pokrajac et al. 2014) of principal components. e.g., the first principal component explains 87.53% of the variance; first three components explain more than 95% and the first 10 components close to 99% of the variance.

Using only the first principal component, it is possible to achieve a classification accuracy of up to 95.12% (Table 2). The best performance was achieved using the Naïve Bayes method (in this case, when only one feature is utilized, the assumption of feature independence is automatically satisfied). The classification accuracy generally increases with the number of principal components used as classifiers’ inputs (the average accuracy of all classifiers is 88.15% with 1 and 93.73% with 10 principal components used).

Discussion

The major finding of this study is that the extraction of non-linear features linked to the complexity of EEG signals can lead to high and potentially useful separation between signals taken from control subjects and patients diagnosed with depression. Specifically, this study demonstrated that Higuchi’s Fractal Dimension (HFD) and Sample Entropy (SampEn) could be used as suitable features for various machine learning classification techniques. In other words, proper choice of a non-linear feature extraction method (HFD/SampEn) simplifies an important classification problem and makes it tractable. To the best of our knowledge, this study is the first to apply this specific feature extraction method on this particular classification task.

When compared to the present literature our research has substantial originality, including: combined use of HFD and SampEn on broadband EEG signal (with minimal pre-processing); a variety of classification methods applied and demonstrated robustness on choice of classification method when non-linear features are utilized and application of principal component analysis (PCA) and demonstration of their power for feature extraction. The rest of this Discussion concentrates on these aspects of our work.

In the present literature, only a few studies applied an approach similar to ours (Bachmann et al. 2018; Ahmadlou et al. 2012; Hosseinifard et al 2013; Acharya et al. 2015). Fractal dimension (Ahmadlou et al. 2012) and both linear and nonlinear measures of EEG (Hosseinifard et al. 2013) were applied to the classification of patients diagnosed with depression and healthy controls. It was found that nonlinear features gave better results, when compared to spectral ones, in the classification of patients diagnosed with depression (Hosseinifard et al. 2013). Note that the use of reductionistic approaches, such as Fourier’s analysis, was found inferior (Rabinovich 2006; Klonowski 2007). The rationale here is a part of Complexity theory which led to consensus among researchers dealing with nonlinear analysis; key properties of linear systems are proportionality and superposition (Goldberger et al. 2002). Nonlinear systems defy comprehensive understanding by a classic reductionist approach (like Fourier’s analysis), since they do not obey proportionality and superposition. Since human brain is one of the most complex systems we know of, analyzing the signal originating from it (EEG) by utilization of reductionistic method could be misleading.

In line with previous findings (Bachmann et al. 2013; Ahmadlou et al. 2012) about measures used for characterization of EEG, in our study HFD detected increased complexity of EEG recorded from patients diagnosed with depression in comparison to healthy controls. This is also in agreement with a study by Hosseinifard et al. (2013) who demonstrated that non-linear features, such as HFD, correlation dimension, and Lyapunov coefficient are more discriminative than linear features. The main difference with our study is that broad band signal was analyzed, while other studies divided the signal on standard spectral bands. Ahmadlou et al. (2012) reported classification accuracy of 91.3% when using two differently calculated fractal dimension algorithms (Higuchi and Katz) as features and enhanced probabilistic neural networks for classification. Our results are in qualitative agreement with this finding. In addition to confirming previous results, this study not only showed that SampEn can also effectively discriminates these two categories of EEG signals, but also it could have performance superior to HFD, see Table 1.

A variety of classification methods have been used in domains similar to ours: support vector machines (SVM), linear regression (LR), linear discriminant analysis (LDA), k-nearest neighbors (kNN), enhanced probabilistic neural networks (Ahmadlou et al. 2012; Hosseinifard et al. 2013; Acharya et al. 2015). The choice of a specific classification method is frequently a matter of bias of researchers (Pokrajac et al. 2014). Moreover, in absence of standardized data repositories, that exist in other domains (Lichman 2013) and a strict statistical test for comparison of, frequently non-linear and non-parametric, classifiers (Efron and Tibshirani 1997) direct comparison of accuracies among different methods and results from different publications is challenging.

Instead of attempting to compare classification of classifiers, our goal was to demonstrate that the usage of non-linear features can result in high classification accuracy regardless of a classifier choice. Observe that a similar methodological approach to validate the usefulness of feature extraction was taken in Pokrajac et al. (2014) in another domain. For this reason, our study did not try to optimize classifier parameters, but utilized their default values (similar as in Unnikrishnan et al. 2016). The reported average accuracy of all the methods, as an indicator of the quality and usability of the features, was 95.12% on SampEn features (and 89.90% on HFD), see Table 1. To estimate accuracy of each particular method, a standard tenfold cross-validation technique (Devijver and Kittler 1982) was utilized. Note that it resulted in qualitatively similar results as a bootstrapping technique applied by Ahmadlou et al. (2012).

Note that this study examined classification methods with a range of underlying paradigms and complexity; the methods belong to statistics and supervised machine learning. Even the simplest methods, such as logistic regression (widely accepted in the medical community although not as a classification method in the strict sense) provided excellent classification accuracy. In fact, high classification accuracy of methods such as SVM with linear kernel indicate that, after a non-linear transformation, the data may become close to linearly separable; we are however well-aware that this may be specific for a particular dataset and should be tested on further data. Consistent with known properties of the Naïve Bayes classifier (Witten and Frank 2005; Mitchell 1997), it had good accuracy (e.g., 92.68%, AUC of 0.983 when applied on SampEn and HFD features combined) albeit the underlying assumption about feature independence is not satisfied. Presumably, due to the high correlation of features (see Fig. 2), there was no benefit of using random forests in comparison to standard decision tree classifiers.

Our results indicate that the use of SampEn features may result in classification results comparable to or better than HFD. Five out of seven examined classifiers provided better accuracy while six provided a higher area under ROC curve when applied on SampEn features. The accuracy of a linear model (SVM with the linear kernel) increased by almost 10% when applied to SampEn features. Similarly, the accuracy of SVM with the polynomial kernel increased by almost 15%. The relatively low accuracy of polynomial SVMs is in agreement with previously reported results (Hosseinifard et al. 2013).

To the best of our knowledge, there are no previous attempts to utilize a combination of HFD and SampEn features. In our study, the use of an augmented feature set consisting of HFD features and SampEn features did not lead to further improvement of accuracy for the majority of attempted classifiers (Table 1). Note that HFD features are less correlated to SampEn features, then HFD features or SampEn features are correlated among themselves. Namely, the maximal correlation between one HFD and a SampEn feature is 0.79. In contrast, the maximal correlation between two SampEn features or between two HFD features is larger than 0.98. The combination of two relatively uncorrelated features, such as HFD features and SampEn features, provides an opportunity for training of more expressive classification models that could result in better classification accuracy. This would fully exploit orthogonality of features when using more complex classification models that would generalize better if trained on larger datasets (Kecman 2001). However, the prerequisite for achieving such increased accuracy is large enough size of the data set, which may be produced in follow-up studies. If the available data set is not sufficiently large, some machine learning algorithms suffer from potential of overfitting— when a learning algorithm, in attempts to minimize error on a training set, results in a model that has poor generalization abilities (Vapnik 1988). Our results suggest this may be a case with random forests and multilayer perceptron, where the accuracy achieved with the combined (HFD+SampEn) feature set is smaller than using HFD or SampEn separately (Table 2). In contrast, machine learning models that have small or controlled complexity [e.g., expressed through VC dimension (Vapnik 1988)], such as support vector machines, Naïve Bayes or logistic regression, did not express this behavior; in this model, the use of the augmented data set led to the same or increased accuracy.

Principal component analysis, a feature extraction method where a linear transformation on the original feature vector is applied to reduce its dimensionality (Jolliffe 2002) is applied in this paper in order to demonstrate that accurate classification is possible using a small number of principal components. Note this technique is typically utilized to decorrelate features, such as HFD and SampEn in our case (see Fig. 2).

Using only the first principal component, it is possible to achieve a classification accuracy of up to 95.12%, using the Naïve Bayes method (Table 2). The classification accuracy generally increases with the number of principal components used as classifiers’ inputs (the average accuracy of all classifiers is 88.15% with 1 and 93.73% with 10 principal components used). Since the data are close to linearly separable, SVM with linear kernel resulted in relatively high accuracy (85.37%) when only one PC is used. The accuracy further increased to 90.24% with 10 PCs. This can be in part explained by Cover’s theorem (Witten and Frank 2005) that data in higher dimensional spaces tend to be more linearly separable. The random forest method benefited from a larger number of utilized principal components since the method is based on randomly choosing one from a set of available features to split a decision tree node (when the number of PCs used is small, the set of available features is small).

The determination of the minimal number of PCs depends on the desired classification accuracy and is generally problem dependent (Pokrajac et al. 2014; Kecman 2001). In this specific case, if a minimal AUC is set to 0.85 (corresponding to diagnostic tests considered good) (the source that was used to generate http://gim.unmc.edu/dxtests/Default.htm), three PCs are sufficient, according to Table 2.

Our results suggest that classifiers that could be implemented on a simple and inexpensive hardware and embedded to existing EEG devices. This itself may ultimately lead to potential everyday clinical usage of our methodology for providing computer-aided diagnostics of depression. The technology could be of interest, e.g., when burnout or extreme amount of stress, using the current diagnostic methods, are mistaken as symptoms of depression. Other researchers are testing whether similar methodology can help identify a patient as a good responder to particular treatment be it medication or transcranial direct electrical stimulation (Shahaf et al. 2017; Al-Kayasi et al. 2017).

Note that the values PCA loads (used to weight non-linear features in order to compute principal components), Fig. 4, indicate that all non-linear features contribute to calculated principal components. In other words, the use of PCA implies that the information is not contained in a signal from a particular electrode, but distributed through multiple electrodes. Therefore, the use of multiple electrode signals can contribute to better distinction between controls and patients diagnosed with depression. This is in agreement of previous fMRI and DTI findings (de Kwaasteniet et al. 2013). Namely, in patients diagnosed with depression, a decreased functional connectivity within fronto-limbic network and anatomical difference in second part of uncinate fasciculus—deep white matter tract connecting prefrontal cortices with limbic system is observed. Our hypothesis was that such dysfunction might result in compensation which might be detected on surface (cortex). The brain compensation in turns translates to an alleviated excitability on cortex, which can be observed on signals from multiple EEG electrodes. Note that in available literature, the number of utilized electrodes is smaller than in our study. Ahmadlou et al. (2012) used seven electrodes (prefrontal), while Bachmann et al. (2013) recommended two electrodes or one electrode (Bachmann et al. 2018).

Finally, it should be emphasized that an extension of the method on larger data sets is needed prior to making a final conclusion about class separability and the potential applicability of the classification techniques for diagnostic purposes.

Conclusion

This study demonstrated that Higuchi’s Fractal Dimension and Sample Entropy are capable of distinguishing between participants diagnosed with depression and healthy controls’ EEG. If a feature extraction method results in good classification accuracy regardless of applied machine learning technique, this provides the evidence that the feature extraction method is useful. These results encourage further investigation with larger sample sizes towards potential diagnostic application in clinical medicine and psychiatry.