1 Introduction

With its a lifelong prevalence of up to 4–5 %, bipolar disorders (BD) are one of the most prevalent psychiatric diseases [1]. Regrettably, BDs are still a diagnostic challenge, and criteria for bipolarity are still controversial [2] which causes misdiagnosed bipolar subjects as having unipolar depression and therefore leading to insufficient treatment and poor outcomes [35]. BD and unipolar disorder (UD) have specific pathophysiologies, but similar depressive appearances and current diagnoses are determined mainly according to structured clinical assessment based on Diagnostic and Statistical Manual (DSM) of Mental Disorders-IV, with a symptom-based rather than an etiology-based approach [6]. Patient’s self-report of past history or depressive appearance in bipolar patients may also cause misdiagnosis inadvertently [7]. So, an effective classification method is required to dichotomize unipolar and bipolar subjects in order to apply the right treatment to the right patient [8]. Recent studies used neuroimaging methods for bipolar and unipolar disorders to reveal discrete patterns of functional and structural abnormalities in neural systems critical for emotion regulation [911], while some other studies were employing traditional statistical techniques that rely on the basic assumption of linear combinations but which may not be appropriate for such tasks [12].

Classification is considered as a useful tool for medical problems which has a common application area focusing on medical diagnosis [13, 14]. Fundamentally, classification policy could be established by medical experts to enable better understanding of the problem. Recent engineering studies contributed to the classification of the diseases using the techniques such as expert systems, artificial neural networks, linear programming, database systems, evolutionary algorithms, and swarm intelligence [1419].

Over the past decade, machine learning (ML) methods have been used increasingly in the field of affective disorders and in the comparison of these patients to those with other psychiatric disorders [20]. With its extensive use and promising results, ML approaches avoid oversimplification by incorporating high-order interactions between predictive variables and underline the superiority of ANNs over linear methods in a number of areas of medical research [2124]. In most ANNs applications, gradient-based algorithms are used for training which may lead to a local minimum traps. Although training process may be repeated a number of times starting from different initial conditions, this rarely certifies to reaching the global optimum of a multimodal high-dimensional problem.

It is possible to overcome the emerging problem for training process that is the application of optimization algorithms, which has become more popular in recent years [25]. Increasing number of features in medical domain requires FS process, and recent studies enunciate swarm intelligence algorithms as a crucial step to evaluate and process the data in an efficient way [14]. An appropriate and relevant feature set selection process also reduces the risk of overfitting, thus improving model generalization by decreasing the model’s complexity [26]. This is particularly important in small-sized high-dimensional datasets, where the curse of dimensionality is present and a significant gain in terms of performance can be achieved with a small subset of features [27, 28].

In a recent study, a hybrid particle swarm optimization–back-propagation algorithm was used for feed-forward neural network training. The combined method could overcome the problem of slow searching process of PSO around the global optimum. In another study, two methods of neural network training using PSO and back-propagation learning for medical decision-making have been proposed, and the experimental results proposed that using back propagation is generally preferable over PSO for imbalanced training data, especially with small datasets and large number of features [29]. In this context, the motivation for the present research was an interest in developing a robust classification tool to address the diagnosing problem of unipolar and bipolar disorders. With its multidisciplinary nature, this study combined ML and metaheuristic approaches to discriminate the subjects of unipolar and bipolar disorders with decreased number of features using PSO algorithm and employing QEEG cordance as biomarker.

2 Materials and methods

2.1 Subjects

We conducted a retrospective investigation involving a study group of 89 patients selected among a larger population of 1200 patients, all of whom were consecutively admitted for a BD or UD at the Neuropsychiatry Istanbul Hospital Department of Psychiatric Outpatient Clinics between January 2010 and December 2013. We matched 31 bipolar disorder depressive episode patients and 58 unipolar depressive episode patients from various age groups and genders. Eligible subjects were outpatients suffering from a depressive episode associated with BD or UD. All participants were given a primary diagnosis of either BD or UD according to DSM-IV criteria and specifically the Structured Clinical Interview for Axis I Disorders (SCID-I). We included subjects with a diagnosis of UD who received at least the scores of 8 on the Hamilton Depression Rating Scale 17-item version (HDRS) or subjects with a diagnosis of BD episode and scoring higher than 13 points in Young Mania Rating Scale (YMRS) [30]. We excluded the subjects experiencing their first depressive episode or episode with current psychotic features as well as those with a history of rapid cycling (≥4 cycles during a year), history of mixed episodes, current psychiatric comorbidity on axis I, serious unstable medical illness or neurologic disorder (e.g., epilepsy, head trauma with loss of consciousness), alcohol or substance abuse within 6 months preceding the study, and patients treated by electroconvulsive therapy within 3 months before their participation to the study. All patients were medication-free for at least 48 h. Participants met the routine laboratory studies (complete blood count, chemistry, thyroid stimulating hormone); urine toxicology screen and electrocardiogram were performed at study screening, and subjects were required to be medically stable before enrollment to the study.

2.2 EEG recordings and cordance calculations

For all the patients, EEG was recorded for 12 h in drug-free condition. In order to observe and reveal the efficacy of cordance, QEEG data were collected from 89 subjects who were seated in a sound-attenuated, electrically shielded room in a reclining chair with eyes closed (wakeful resting condition). The technicians monitored the QEEG data during the recording and re-alerted the subjects every minute as needed to avoid drowsiness. Electrodes were placed with an electrode using 19 recording electrodes distributed across the head according to the international 10–20 system arrangement. Three minutes of eye-closed EEG at rest were acquired using Scan LT EEG amplifier and electrode cap (Compumedics/Neuroscan, USA) with the sampling rate of 250 Hz. Sintered Ag/AgCl electrodes positioned according to the 10–20 international system with binaural reference. For each individual, the cordance values were calculated using the EEG data gathered from recording electrodes and ten regions (prefrontal, frontocentral, central, left temporal, right temporal, left parietal, occipital, midline, left frontal and right frontal) in delta, theta and alpha frequency bands.

The cordance combines complementary information from absolute and relative power of EEG spectra to yield values having stronger correlation with regional cerebral perfusion compared with stand-alone measures [31]. Absolute power, coherence, and cordance have been shown to be an index of cerebral local perfusion in previous studies [32, 33]. Increased slow-wave and decreased fast-wave activity on the electroencephalogram is common in brain dysfunction and may be caused by partial cortical deafferentation. Cordance is measured along a continuum of values: Positive values denote concordance, an indicator associated with normally functioning brain tissue, and negative values denote discordance, indicator associated with undercutting lesions, low perfusion, and low metabolism [34].

Raw EEG signals were filtered through a band-pass filter (0.15–30 Hz) before artifact elimination. Artifact detection was visually performed to remove the EEG segments with obvious eye and head movements, muscle artifacts, or a decrease in alertness. Manually selected (minimum 2 min) artifact-free EEG data which have minimum split-half reliability ratio of 0.95 and test–retest reliability ratio of 0.90 were used for cordance calculations. The EEG reviewer was blind to the subject’s treatment condition and clinical status. Fast fourier transformation (FFT) was used to calculate absolute and relative power in each of four non-overlapping frequency bands [35], delta (1–4 Hz), theta (4–8 Hz), alpha (8–12 Hz), and beta (12–20 Hz) by using NeuroGuide Deluxe 2.5.1 software (Applied Neuroscience; St. Petersburg, FL, USA). Cordance values were calculated using custom algorithm in MATLAB® 7.10.0.499 available for research purposes.

This algorithm normalizes power across both electrode sites and frequency bands in three consecutive steps: first, absolute power values are reattributed to each individual electrode by averaging the power from all bipolar electrode pairs sharing that electrode. This electrode-referencing method is similar to the Hjorth transformation [36] except that the current method averages the power from neighboring electrode pairs and thus enabling stronger correlation between surface-measured EEG and perfusion of underlying brain tissue than either the linked ears reference or the conventional Hjorth transformation [37]. In the following phase, the relative power values are calculated on the basis of dividing absolute power values by total power values for each electrode site and frequency band. In the second step, the maximum absolute and relative power values (AMAX f , RMAX f ) in each frequency band (f) are determined to obtain normalized absolute (A NORM(s,f)) and normalized relative (R NORM(s,f)) power values [absolute and relative power values at each electrode site(s) and for each frequency band (f) are divided by (AMAX f , RMAX f ), respectively]. Finally, the cordance values at each electrode site for each frequency band (f) are calculated by summing the ANORM and RNORM values, after a half-maximal values (0.5 on the normalized scale) is subtracted: CORDANCE(s,f) = A NORM(sf) − 0.5) + (R NORM(sf) − 0.5) [38].

2.3 Feature selection

For feature selection process, feature interaction could be a step to overcome. The best features are usually a group of features with the presence of feature complementarity, and there could be two-way or multi-way interactions among features [39, 40]. Thus, an individually relevant and important feature may become redundant while considered with other features, but eliminating some features may also reduce the complexity. On the other hand, an individually redundant or weakly relevant feature may become highly relevant when considered with others. In order to overcome the dilemma, optimal feature subset should be a group of complementary features that is significant for large search space. The size of the search space increases exponentially with respect to the number of available features in the dataset [41] which makes an exhaustive search practically impossible in most situations. Although there are various searching algorithms applied to feature selection process, many of them still suffer from the problems of stagnation in local optima or being computationally expensive [42, 43]. In order to better address the feature selection problems, an efficient global search technique is required. Evolutionary computation (EC) techniques are well known for their global searching abilities. PSO [44, 45], a relatively recent and promising EC method, is computationally less expensive compared with other EC algorithms and has been employed as an effective feature selection method in many studies [43, 46].

2.4 Particle swarm optimization

Particle swarm optimization (PSO) is an evolutionary-based algorithm inspired by animals’ social behaviors such as fish schooling, bird flocking, and insect swarming [44]. PSO searches for the optimum solution/solutions by updating a population of possibilities and directing the search toward the regions of interest in the search space [47]. The search in PSO is performed using a population of s particles given as x i (i ∈ [1…s]) that update their location in the search space through a modified velocity, given as V ij i ∈ [1…s] and j ∈ [1…n], where n is the dimension of the search space, over some period of time. The velocity (V) and particle update equations are given in Eqs. (1) and (2) respectively,

$$ V_{ij} \left( t \right) = W \times xV_{ij} \left( {t - 1} \right) + C_{ij} + S_{ij} $$
(1)
$$ x_{i,j} \left( t \right) = x_{i,j} \left( {t - 1} \right) + V_{i,j} \left( t \right) $$
(2)
$$ C_{ij} = c_{1} r_{1,j} \times \left( {Pbest_{i,j} \left( {t - 1} \right) - x_{i,j} \left( {t - 1} \right)} \right) $$
(3)
$$ S_{ij} = c_{2} r_{2,j} \times \left( {Gbest_{i,j} \left( {t - 1} \right) - x_{i,j} \left( {t - 1} \right)} \right) $$
(4)

In the given equations, w is the inertia weight, t is the current iteration, i is the particle index in the population, and j is the dimension. Here, r 1,j and r 2,j are distinct random values in the range between 0 and 1, c 1 and c 2 are acceleration coefficients which control the effectiveness of social (S) and cognitive (C) components. Several methods have been suggested to adjust the parameters in Eq. (1), including linearly decreasing inertia weight (LDIW), time varying inertia weight (TVIW), linearly decreasing acceleration coefficient (LDAC), time varying acceleration coefficient (TVAC), random inertia weight (RANDIW), fix inertia weight (FIW), random acceleration coefficients (RANDAC), and fix acceleration coefficients (FAC). LDIW and FixAC are the common approaches among the proposed methods in several studies [4850]. Equation (5) represents the LDIW formulation as

$$ W = \left( {w_{1} - w_{2} } \right) \times \frac{{\left( {\max_{\text{iter}} - t} \right)}}{{\max_{\text{iter}} }} + w_{2} $$
(5)

where w 1 and w 2 are the initial and final inertia weight, t is the current iteration, and maxiter is the number of maximum iteration used to terminate the loop. Pbest i,j and Gbest i,j are the local and the global best solutions that represent the best solution found by an individual particle and the best overall solution found by the swarm and is updated using the Eqs. (6) and (7). In the equations, f represents the fitness function used to assess the feasibility of the particle (x), local best solution (Pbest) or global best solution (Gbest).

$$ Pbest_{i} \left( t \right) = \left\{ {\left. {\begin{array}{*{20}l} {Pbest_{i} \left( {t - 1} \right),} \hfill \\ {x_{i} \left( t \right),} \hfill \\ \end{array} \quad } \right|\quad \begin{array}{*{20}l} {if f(x_{i} \left( t \right)) \le f\left( {Pbest_{i} \left( {t - 1} \right)} \right)} \hfill \\ {otherwise} \hfill \\ \end{array} } \right\} $$
(6)
$$ Gbest\left( t \right) = \arg \hbox{min} \left\{ {f\left( {Pbest_{1} (t)} \right), f\left( {Pbest_{2} (t)} \right), \ldots f\left( {Pbest_{s} (t)} \right)} \right\} $$
(7)

The update mechanism of global best solution can be affected from the subset of particles that share their local best solutions. That is called as neighborhood topology. The common choices for neighborhood topology are local and global methods in which the local neighborhood allows the existence of sub-swarms of particles that update their global best solutions based on a set of personal best solutions found by the members of the sub-swarm, whereas the global neighborhood topology generates a single global best solution for the entire swarm [5153]. The pseudocode of the PSO approach is given in the following algorithm and the hybrid structure of PSO with ANN is given in Fig. 1.

Fig. 1
figure 1

Flow chart of hybrid PSO–ANN feature selection process

  • Initialization : Randomly initialize a population.

  • Initial Evaluation : Evaluate all members of the population using their fitness function f.

  • repeat

    1. 1.

      Updating the population: Update the velocity for each particle using Eq. (1) and then update the particle by applying the new velocity to Eq. (4)

    2. 2.

      Evaluation: Evaluate all members of the population using the fitness function f

    3. 3.

      Updating the best findings: Update PBest and GBest using Eqs. (6) and (7)

  • until termination condition is satisfied : The maximum iteration is reached or the best member of the population (GBest) is performing above the highest expected performance [54].

2.5 Artificial neural network

Artificial neural network (ANN) is an artificial intelligence method used to create a computational model inspired by the structural and functional features of biological neural networks. One of the main outstanding properties of ANN is its ability to model complex nonlinear relationships, potentially incorporating high-order interactions between predictive variables. The use of ANNs in the fields of research requiring classification or prediction processes such as psychiatry, robotics and biology is stimulating for the researchers [55, 56] due to their ability to adapt, learn, generalize, organize, and classify data. The superiority of the ANN method over linear data mining methods is well addressed in a number of areas of medical research as well [2123]. The ANN model is formed of neurons in layers and weighted connections transmitting signals between the neurons in a forward or looped way in order to transmit the information gathered from the input or former neurons to the output [57]. Thus, generated model represents a distributed adaptive system built by means of multiple interconnecting processing elements, just as real neural networks do.

In feed-forward neural networks (FNN), the processing elements, the neurons, are distributed in several layers. The intermediate layers are known as the hidden layers, while the first layer is called as input and the last one is known as output layer. In general terms, each neuron receives signals processed and transmitted by neurons in the preceding layer and transmits them to the next layer. The number of layers and the way in which the neurons are connected forms the architecture of the network. The signals in each layer are scaled in each connection according to an adjustable parameter associated with each connection between neurons called weight which is set randomly before the modeling process is initiated. Each neuron in the hidden layer collects the signals from the former layer/s and then adds them up to generate the output for the following layer using an activation function. Depending on the structure of the system, linear or nonlinear transfer function is preferred in the junction points of the neurons. The output of each layer is transmitted to the following layer, and finally the output layer generates the output to be compared with the reference to calculate the error value. The weights of neuron connections are then modified according to the selected training algorithm, in order to minimize the error. This process is repeated until a previously established criterion is reached, for example, when the error value gets to a threshold or stops decreasing [58].

One of the ways to minimize the error value is using back-propagation (BP) algorithm, a gradient-descent procedure which, ideally, requires infinitesimal changes in the connection weights. In BP, the network error for the given inputs is calculated, and the weights of the connections between the neurons in the last hidden layer and the output layer are modified according to the extent to which these connections have contributed to form the current error [59].

In this study, the ANN model used back-propagation learning algorithm with one hidden layer with 20 neurons. Because of its nonlinear structure, logsig transfer function was employed in the hidden layer and purelin transfer function in the output layer as given in Fig. 2.

Fig. 2
figure 2

Structure of used back-propagation neural network

Input data are collected from 19 electrodes in three frequency bands, trainlm training function was used to train the model, and sixfold cross-validation was used to test the classifier. In order to evaluate the classification algorithm receiver operating characteristic (ROC) curve, a plot of the sensitivity [true-positive rate (TPR)] as the function of false-positive rate (FPR) (1-specificity) was used.

3 Results

In this study, the combination of a swarm intelligence method, PSO, for feature selection and an ML approach, ANN, was used to appreciate the value of artificial intelligence methods for the diagnosis, treatment planning, and monitoring of psychiatric and neurological diseases. Initially, the classification performance of ANN model was expressed in terms of accuracy, then in order to improve the outcome of the model, it was transformed to a hybrid model incorporating a feature selection process. Twenty-eight inputs were used from Fp1, Fp2, F3, F4, F7, F8, T3, T4, T5, T6,C3, C4, P3, and P4 electrodes in alpha and theta frequency bands. A significant upgrade was observed with the contribution of PSO, and the classification accuracy increased despite the decreasing number of features. The classification results before and after feature selection process are given in terms of overall accuracy, sensitivity, and area under ROC curve parameters in Table 1. The ROC curve for the compared approaches is plotted in Fig. 3 as well.

Table 1 Classification performance of PSO–ANN and standalone ANN models
Fig. 3
figure 3

ROC curves of bipolar disorder subjects for ANN and PSO–PSO hybrid model

Throughout the classification process, the intersection point of TPR and FPR at each threshold is plotted to form the ROC curve. Each point on the ROC curve represents a sensitivity/(1-specificity) pair corresponding to a particular decision threshold. Depending on the classification performance, the relative changes of TPR and FPR may differentiate causing sharp transitions between cutoff points in ROC curve. After the frequency band and channel selection phase, PSO algorithm was used to reduce the feature set by considering the classification error as cost function. The contribution of feature selection process to the accuracy is quite satisfactory. The hybrid model classified unipolar subjects with 89.89 % overall accuracy, percentage of examples been classified correctly. Sensitivities also expanded to 83.87 % from 64.52 % for bipolar disorder subjects and 93.1 % from 77.59 % for unipolar disorder subjects.

Area under ROC curve (AUC) parameter was also used to underline the performance of PSO algorithm. The feature selection increased AUC value for bipolar disorder subjects from 0.757 to 0.905, and comparative plots are given in Fig. 3. Following the feature selection and classification process, 14 electrodes, namely C3, C4, Fp1, Fp2, F3, F7, T4, T5 from alpha frequency band and Fp1, F4, C4, P4, T4, T6 from theta frequency bands were considered as prominent features, thus eliminated 14 other remaining features due to their limited informative contributions.

4 Discussions and conclusions

In this paper, we generated a hybrid artificial intelligence approach combining particle swarm optimization and artificial neural network to discriminate unipolar and bipolar depressive disorders employing more informative features. The literature on feature selection techniques is very vast encompassing the applications of ML and classification. The proposed approach first eliminated the less informative features regarding their contribution to the output. Using the selected features a back-propagation neural network was then generated in order to classify the subjects into two classes as unipolar or bipolar. The outcomes of the combined model are promising for the clinicians and could be evaluated as a tool useful for the diagnosis process.

The clinical interpretation of the outcomes is noteworthy for the following interdisciplinary studies. Numerous clinical and neuroimaging studies were conducted with the aim of validating unipolar and BD differentiation. Results of the previous neuroimaging studies suggest that abnormal activation in prefrontal and subcortical regions underlie impaired cognitive control and impulsivity that are commonly reported in BD and MDD [60, 61]. A few number of neuroimaging studies compared the brain functioning of bipolar and unipolar depressive subjects [6264].

There is evidence that unipolar depression is associated with increased functional connectivity of three networks; the default mode network, the cognitive control network, and the affective network converging on the dorsal medial prefrontal cortex [65]. In another study, 14 individuals with bipolar II depression and 26 patients with recurrent unipolar depression, aged between 21 and 45 years [66]. All participants underwent functional magnetic resonance imaging (fMRI) and functional connectivity analyses while performing two repetitions of a motor activation task. The two groups did not significantly differ in their task performance. However, bipolar patients had significantly stronger functional connectivity between the posterior cingulate cortex and one cluster in the right parietal/insular region, compared with unipolar patients. This cluster included portions of the right inferior parietal lobule, the precentral gyrus and insula, and surrounding regions.

The functional neuroimaging studies suggest that in BD dysregulation of mood is caused by the disturbed prefrontal modulation of subcortical and medial temporal structures within the anterior limbic network. Elevated activity and volume loss of hippocampus, orbital frontal and ventral prefrontal cortex and hypometabolism of dorsal prefrontal cortex as well as bidirectional metabolic changes of anterior cingulate were described in BD [6769]. Results of other researchers also suggested similar dysfunctions in brain connectivity in unipolar depression [70, 71].

Effort of trying to develop a reliable biomarker to predict treatment response resulted in “cordance” studies [34]. In unipolar depression, one of the best-documented brain functional biomarkers predicting a response to an antidepressant is the decrease in quantitative EEG (QEEG) prefrontal cordance in theta frequency band [7274]. Furthermore, in another study it was described that the decrease in cordance value associated with a switch to mania [75].

Numerous former EEG studies have attempted to evaluate the distinct features of BD and UD as compared to other clinical and non-clinical populations. One well-replicated finding in UD is that, compared to healthy subjects, an inter-hemispheric frontal alpha asymmetry has been found due to an increased left frontal alpha power as it is well-known indicator of idling activity on that side [76]. Decreased alpha and increased theta power in the frontocentral regions are the most common findings in BD patients [77, 78]. A recent study reported that deficient left hemisphere alpha power in BD and decreased inter-hemispheric theta coherence in UD could discriminate these two groups. That study also underlined that BD patients, as compared to UD, exhibited greater central–temporal theta and parietal–temporal alpha and theta coherence [79]. So we used only theta and alpha activity data and removed delta and beta activity data in this study (green text will be removed).

Unfortunately, despite intensive research in the field, findings in cerebral metabolic studies of BD are controversial. When compared to unipolar depression, cerebral metabolic changes observed in bipolar disorder were suggested to be more associated with dysregulation of the dorsolateral prefrontal circuit [69] and the anterior cingulate [80]. In a study it was suggested that disrupted baseline metabolic status is reversed by effective treatment [81], but there is also some evidence of a persistence of metabolic abnormalities in euthymic patients [82].

Finally, the results demonstrate that EEG cordance values have potential to discriminate between UD and BD. The loss of temporal synchronization in the frontal interhemispheric and right-sided frontolimbic neuronal networks suggested to be the unique features that distinguish between BD and UD in previous research [79]. In this context, the paper puts forward a study using two-step hybridized methodology: PSO algorithm for feature selection process and ANN for training process. The noteworthy performance of ANN–PSO approach stated that it is possible to discriminate 31 bipolar and 58 unipolar subjects using selected features from alpha and theta frequency bands with 89.89 % overall classification accuracy.

Our findings support the potential utility of the proposed methodology to be used as a clinical tool in classifying UD and BD subjects. Functional neuroimaging methods provide information about differences in the neural processes associated with unipolar versus bipolar depression. Further studies are warranted to replicate this result in order to lead to the development of clinically useful diagnostic methodologies.