1 Introduction

Over the years, significant research has been conducted aiming at improving the interface between humans and machines. An interface aiming at developing coordination and correlation between man and his brain has been evolved by the name brain–computer interface (BCI). When our brain responds to an event or activity, a difference in potential is generated. This difference in potential is nothing but transmission of a message from one synapse to another through the release of a chemical. Movement of chemical inside brain cells generates brain potential or brain signal, which is then recorded via acquisition devices. There are several different techniques to record the brain signals or potentials.

Few commonly used noninvasive acquisition techniques include electroencephalograph (EEG), magnetoencephalography and functional magnetic resonance imaging. Invasive techniques such as electrocorticography (ECoG) place electrodes beneath the scalp, whereas noninvasive techniques as discussed above place electrodes on the scalp and are safe and easy to use [1]. Various types of brain potentials are generated depending on the type of events/stimuli/commands given to the brain. One of them is event-related potential (ERP) which is a psychological response, originated due to the reflex generated in the brain on performing some mental activity or observing something familiar. Using the advantage of ERP, lie detectors have been developed which, by analyzing variation in brain signals, identifies whether a subject is guilty or not. In the past, polygraph-based lie detectors were available which analyzes human behavior like sweating, increased heart rate. But this indirect view was not successful in classifying a subject as guilty or innocent. ERP-based lie detectors provide a direct view to understand and detect deception.

An ERP component called P300 is usually identified by researchers for lie detection [2, 3]. P300 is a response which is generated between 300 and 1500 ms on the occurrence of some meaningful stimulus or “oddball stimulus” [2]. “Guilty Knowledge Test (GKT)” or “Concealed Information Test (CIT)” is conducted to detect whether the given subject is guilty or innocent. To perform CIT, subjects are trained for a mock crime scene where three different stimuli are randomly given to the subject, i.e., target, irrelevant and probe. Target stimulus is recognized by guilty and innocent; the probe is crime-related stimulus which is identified by guilty only, and irrelevant stimulus is not related to crime and is unidentified. Out of the total stimuli given to a subject, 70% is irrelevant and rest includes probe and target [2]. Probe and target are rarely occurring meaningful stimuli; hence, they will generate P300 wave and the irrelevant stimulus will not generate any P300 component.

For the guilty subject, probe and target stimuli will generate a P300 response, whereas, for the innocent subject, only target stimulus will generate a P300 response. Many methods have been applied to analyze probe, irrelevant and target responses and identify whether responses are generated from guilty or innocent subject. We have categorized these methods into two approaches— statistical approaches and machine learning approaches.

1.1 Statistical approach

1.1.1 Bootstrapping techniques

Bootstrapping is a statistical technique to measure the similarity between the brain responses generated by stimuli presented to any subject. It is applied by randomly sampling the parameters with various iteration. A “double centered correlation” method [3] has been applied by Farwell et al., where they found a correlation between probe and irrelevant and between probe and target response. A conclusion has been made that if the correlation between probe and target stimuli is greater than the correlation between probe and irrelevant stimuli, then the subject is guilty and vice versa. A different mock crime scenario has been developed where instead of two groups, three group mock crime scenario have been conducted. One group was guilty, the other was innocent, and the third group was a countermeasure group, which had the freedom to perform any covert responses for irrelevant stimuli. To identify concealed information, “bootstrap amplitude method” has been applied by authors. Average of ERP amplitude is calculated from probe set (randomly with replacement) and from the irrelevant set (randomly with replacement) [4, 5]. The average of ERP from probe set and the irrelevant set is subtracted, and the process is iterated for 100 times. With 95% confidence, if the difference of the average of probe set with irrelevant set is greater than zero, a guilty decision is made. Similar bootstrapping method, known as “bootstrap reaction time,” was used by authors where instead of amplitude difference, reaction time was considered [4].

1.1.2 Analysis of variance

Analysis of variance (ANOVA) is used to analyze the difference between the means (i.e., variance) of various groups. The variance is partitioned into components attributable to different sources of variation; hence, it is useful for comparing the variance of several groups. Using its advantage, ANOVA has been applied to ERP responses [6]. A CIT using “cards” has been conducted by authors, where different suit cards were either probe or irrelevant. From a pack of cards, five selected cards and a joker card were presented as target stimulus.

Peak values of EEG waves have been considered as a measure of identification of deception in many research works [5,6,7]. Signal peak alone is not enough to explain various characteristics of ERP responses generated while testing deception. EEG data are also affected by the noise like ocular artifacts or muscular artifacts making it difficult to analyze the ERP response generated by stimuli. To overcome the issue of noise removal, machine learning techniques are applied which reduces noise from EEG data. Many authors have worked on removal of artifact and classification of EEG data, which is being discussed in the succeeding paragraph.

1.2 Machine learning approach

Machine learning approaches have become another tool for CIT as it analyzes the signal for each trial. A GKT [3] has been applied by Abootalebi et al. [2] in their work. Instead of analyzing only peak values, a set of relevant features from EEG data have been extracted. Morphological features like latency, amplitude, the ratio of latency and amplitude; frequency features like mean, mode, the median of frequency; and wavelet features are extracted from EEG waveform. But all the features do not give exact information about the data and do not provide better classification accuracy. Hence, genetic algorithm [2] has been applied for feature selection and further statistical classifier is applied to classify data as innocent or guilty. In another work, genetic algorithm for feature selection has been applied [8]. “Empirical Mode Decomposition (EMD)” has been applied for feature extraction on P300 responses. EMD decomposes a signal into components called “Intrinsic Mode Functions (IMFs).” The IMFs are frequency or amplitude modulated waveforms which provide the physical characteristics of a wave. IMFs have been used as features and post-feature selection LDA has been applied as a classifier.

To resolve the issue of denoising EEG data, independent component analysis (ICA) has been used to remove noisy components [9]. ICA partitions data into various independent components (ICs) and spatial map sets or categorizes EEG data into task related and task unrelated. But there has been an issue in identifying the difference between task-related and the unrelated components. To overcome this issue, a template matching method has been devised [9]. The spatial map set which is similar to the template is selected from different guilty subjects for different stimuli and these are regarded as P300 ICs, by which the P300 and non-P300 responses with high SNR are reconstructed at sensors. A birthday paradigm for CIT has been applied considering time constraint with accuracy to predict the deception [10]. Nonparametric-based feature extraction with LDA and KNN as classifier has been used. A spare representation method has been used on irrelevant and target ERP responses for single subject [11]. In [12], genetic SVM as classifier has been applied to identify guilty subject using a novel CIT method. The author in [13] also proposed a novel association-based CIT having similarities with reaction time–CIT, which considers reaction time differences between irrelevant and probe stimuli.

A single classifier gives a good performance, but combining various classifiers in an ensemble framework provides near to optimal results. Hence, we have applied ensemble classification approach in our work to obtain better results. Ensemble classification methods like bagging, boosting, random forest have been applied by many authors on different datasets. Bagging [14] combines various machine learning approaches to design one model by decreasing its variance, whereas boosting [15] decreases the bias values. Boosting and bagging were applied by [16] on 23 datasets using decision tree and neural networks. They showed that ensemble methods provide consistent results with single classifiers. Another ensemble approaches random forest (RF) [17], constructed multiple decision trees at training time, and gives a class label as output which takes mean or mode of multiple decision trees.

An ensemble framework has been proposed, where five classifiers, namely LDA, SVM, MLFFNN, KNN and naïve Bayes, have been applied to EEG data. Later, classification ranking has been applied and results of three best classifiers have been aggregated using weighted voting (WV) approach. In this study, we have recorded EEG signals by conducting a CIT and detailed explanation of it is given in Sect. 2. Brain signals are affected by a lot of noise; therefore, noise removal followed by feature extraction and proposed classification framework has been discussed in Sect. 3. The result of the proposed framework is given in Sect. 4. Later sections provide conclusion and future work followed by references.

Fig. 1
figure 1

Trial structure

2 Data acquisition

2.1 Subjects

An EEG-based CIT experiment has been conducted where 10 participants bearing age between 23 and 35 participated. No medical record of any psychological disorder was found with subjects, and they were having normal or corrected vision. Subjects are given a brief description of the complete experimental procedure. Before beginning the experimental procedure, subjects have given in written for their consent to this experiment. The EEG data recording is done by placing Ag/AgCl electrodes at Fz, FC1, FC2, C3, Cz, C4, CP5, CP1, CP2, CP6, P3, Pz, P4, O1, Oz and O2 sites (10–20 international system).

To record vertical electro-occulograph (VEOG) and horizontal electro-occulograph (HEOG), a signal was recorded from right eye, for VEOG above and below the eye and for HEOG from outer canthus. An electrode placed on mastoid for reference and an electrode on the forehead as ground. For signal acquisition, we have used EasyCap [18] (a 32-Channel EEG Standard Cap Set (Munich, Germany)), a V-amp amplifier, set of 16 electrodes and brain vision recorder [19]. Electrode placement protocol is similar as in [9].

Before execution of experiment, subjects are randomly grouped under to the “guilty” and “innocent” groups for two experimental sessions. All 10 participants behave as innocent for one session consisting of 30 trials and guilty for second session consisting of 30 trials. For a total of 60 trials, subjects are presented with a set of images of some known and unknown personalities for 31 seconds. Three types of stimulus are presented to the subjects:

  • Probe stimulus: It is a crime-related stimulus which is presented rarely and generates a P300 response for guilty subjects. Here, probe is an image of a known person (images of the person from institute)

  • Target stimulus: This stimulus is familiar to all subjects, as it is given to maintain the concentration of the subject in the experiment. Target will generate a P300 response for both guilty and innocent subjects. For the experiment, images of well-known personalities or celebrities have been used as target

  • Irrelevant stimulus: This stimulus is not related to crime and does not generate any ERP response that is guilty or innocent. These are random images downloaded from internet sources. For the innocent subject, responses generated from probe and irrelevant stimuli will be same.

In the initial phase of the experiment, subjects are trained for a mock crime scenario. They are trained as they have committed a crime with the person whom they know (i.e., a well-known colleague from college). There are 10 images to be presented to subjects where one image is a probe, two of them are target and rest are irrelevant. These images are presented to subjects on a 15.4-inch display screen. The images of known persons will act as a probe, and this will generate P300 response in the brain. Also, target images (celebrity images) will generate P300 response, whereas the images of random unknowns, i.e., irrelevant stimuli, will generate a non-P300 response. These generated responses are hence recorded and analyzed using brain vision analyzer.

Images were presented for 31 seconds; each stimuli image remained on screen for 1.1 second with 2 second inter-stimulus period (Fig. 1) [2]. On observing the stimulus, subject is instructed to respond “yes” or “no.” For probe stimuli, subjects who replied “no” indicates that they are lying. The probe is presented rarely, and these images are related to crime and hence will generate P300. For irrelevant stimuli, also subject has to reply “no” which represents that they are speaking the truth. As irrelevant stimuli are mostly occurring and are not related to crime, hence it will not generate P300 responses. For target stimuli, subjects are instructed to reply “yes” as they identify the target stimuli image. As the target is also rarely occurring and is not related to crime, but familiar to the subject, hence P300 responses will generate for the same.

Subjects go through a practice session of 5 min in which they perform few trials of the task identical to the full experiment as described above. After the training session, experiments are conducted of 30 trials as innocent and 30 trials as guilty. The complete experimental procedure developed in our work is depicted in Fig. 2. The experiment has been conducted for two sessions for each subject.

Fig. 2
figure 2

Experimental procedure

2.2 Signal preprocessing

Before proceeding for analysis of CIT data, EEG signal is preprocessed for artifact removal. VEOG- and HEOG-based ocular artifacts are corrected using brain vision analyzer. Further, to remove other noise mixed with the signal, we have applied bandpass filter on our data. The bandpass filter is applied from a range of 0.3–30 Hz which will eliminate high-frequency bands from the raw signal. This frequency range is mostly observed during mental tasks [4]; hence, bandpass filter is applied for this range.

Signals acquired using EEG device are hence converted into (.)mat form using Brain Vision Analyzer 2.1 [19]. Further, the feature extraction and classification are done using MATLAB R2015a on Intel i7 processor with 8 GB RAM and 64-bit Windows 10 Pro platform.

3 Proposed approach

This section describes about the proposed ensemble framework and discusses about the various feature extraction approaches applied.

3.1 Feature extraction

The EEG data have been recorded from 16 channels of 10 subjects (S-1 to S-10), but data of subject 6 (S-6) has not been considered for whole study process, due to the presence of lots of artifacts. Hence, we analyzed 16-channel data of 9 subject (9 \(\times \) 16) for 30 trials from two sessions (truth session and lie session). Various statistical and machine learning approaches for feature extraction have been applied. Statistical approaches alone are not sufficient to provide all the information of the signal; hence, we have combined it with various machine learning approaches and have tried to extract features from every domain of the signal. Following set of feature extraction methods have been used for the study:

3.1.1 Potential or amplitude

Potential gives the maximum peak values at any time instant t [2, 6]. Let x(t) represents the EEG signal at any time instance t. Potential (P(x(t)) is given as:

$$\begin{aligned} P(x(t))=\textit{maxpeak}(x(t)) \end{aligned}$$
(1)

3.1.2 Power

Power gives the energy of the signal which is given by the square of potential value (E(x(t)))

$$\begin{aligned} E(x(t))={(P(x(t))}^2 \end{aligned}$$
(2)

3.1.3 Frequency response

Frequency response \(\mu (x(t))\) [20] calculated using fast Fourier transform (FFT) gives complex values. Hence, magnitude of frequency is calculated as given in Eqs. 35.

$$\begin{aligned} X[N]=\Sigma _{m=0}^{M-1} x(m) W^{mN} \end{aligned}$$
(3)
$$\begin{aligned} x(m)=\frac{1}{S} \Sigma _{N=0}^{M-1} X[N] W^{-mN} \end{aligned}$$
(4)

where \(N=0,1, \ldots , M-1\) and S is number of samples in t time

$$\begin{aligned} \mu (x(t))=abs(x(m)) \end{aligned}$$
(5)

3.1.4 Hjorth features

Hjorth [21] developed three statistical parameters using time domain. We have used two of them, i.e., mobility and complexity, for extracting time features of EEG signals. The parameters are used for EEG feature extraction earlier for emotion recognition by [20]. Mobility gives root of ratio of “derivative of variance of signal” with “variance of signal,” and complexity provides ratio of derivative of mobility to mobility of signal. Value of complexity lies in range of [0, 1], and 1 shows that signal is similar to sine wave

3.1.5 Mobility

$$\begin{aligned} M(x(t))=\root \of {\frac{\sigma ({x'(t)})}{\sigma (x(t))}} \end{aligned}$$
(6)

where \(x'(t)\) represents differential of x(t) and \(\sigma \) represents variance of the data.

3.1.6 Complexity

$$\begin{aligned} C(x(t))=\frac{M(x'(t))}{M(x(t))} \end{aligned}$$
(7)

3.1.7 Wavelet transform

All the features explained above consider only time domain characteristics of ERP response. As ERP waveform has both time and frequency characteristics, nowadays authors are considering wavelet-like approaches [2]. Discrete wavelet transform (DWT) decomposes a signal into a variable frequency range which is represented in approximation or detail levels. For first-level decomposition, “approximation coefficients” consider high-frequency components, while “detail coefficients” consider low-frequency components of the signal. For next level, detail values are again decomposed into approximation and detail coefficients, and this process is continued for the number of levels assigned. DWT conserves time information while considering frequency components of signals. DWT has been applied to 4 levels using “db2” wavelet. After applying wavelet transform for 4 levels, the output vector consists of a set of approximation coefficients and four sets of detail coefficients which are used as the feature vector. For each trial, we received a large number of approximation and detail coefficient in the range of 0.5 to 30 Hz. Hence, root-mean-square value [20] of wavelet coefficients has been calculated and used as a feature.

$$\begin{aligned} Wv=\sqrt{\Sigma _{i=1}^N \frac{C_i^2}{N_i}} \end{aligned}$$
(8)

where \(C_i\) are wavelet coefficients and N gives a total number of coefficients generated from all levels.

Fig. 3
figure 3

Proposed ensemble framework for lie detection

Fig. 4
figure 4

ai Subject-wise results using various classifiers and different voting strategies for 3-classifier and 5-classifier system

3.2 Classification

For EEG-based CIT data, subjects are classified into two classes: innocent and guilty. A classifier can perform well on a particular dataset, while the same classifier may give unproductive results with some other dataset. Hence, one cannot predict the performance of a classifier for a dataset [22]. Many classifiers are available in the literature that outperform while dealing with binary classes. Previous CIT- based studies use LDA [8], SVM [9], KNN [10] as classifiers. In order to improve the performance of our system, we proposed an ensemble framework for lie detection, in which the decision of innocent or guilty is taken by aggregating the performance of different classifiers. Ensemble classification provides better performance as compared to single classifier performance [23]. Ensemble framework can be structured either by combining the same type of classifiers, i.e., homogeneous framework, or by combining different types of classifiers, i.e., heterogeneous framework. Here, in our work we have applied heterogeneous ensemble framework, by combining 5 classifiers, namely LDA, SVM, multilayer feedforward neural network (MLFFNN), KNN and naïve Bayes (NB).

3.2.1 Linear discriminant analysis (LDA)

LDA [24] is a typical linear classification technique, which provides separability by drawing decision region between data of two classes. To assure the maximum separability, it tries to optimize the ratio of within-class scatter to the between-class scatter. LDA searches for a linear solution to separate the data into classes.

3.2.2 Support vector machines (SVM)

SVM [25] is a non-probabilistic linear classification approach which chooses an optimal separating hyperplane such that it maximizes the distance between data points of different classes. To construct the best solution for separating hyperplane, training data points are used which are considered as vectors or support vectors. These support vectors help in determining the width of the hyperplane. SVM can be extended for the cases where data are not separated by a hard margin. Hence, a trade-off parameter is used which allows margin to be flexible and separates nonlinearly separable data. In addition to that, SVM can be used to classify nonlinear data using kernel trick. This is the reason behind extensive use to SVM as it achieves to an optimal solution for both linear and nonlinear data.

3.2.3 Multilayer feedforward neural network (MLFFNN or NN)

MLFFNN is successfully used in classification tasks, feature extraction, pattern mapping, etc. It takes input, processes it and produces an output. Various layers called hidden layers are added between input and the output layer to improve performance of the system. The nodes from input to output carry some weights. These weights are adjusted until the network reaches to an optimal solution. MLFFNN [26, 27] consists of one input layer, one output layer and many hidden layers. At the output layer, an activation function is applied which processes the final output. The activation function can be a linear or a nonlinear function according to the specified problem.

3.2.4 k-nearest neighbor

KNN [28] is a nonparametric classification approach that classifies test data into a particular class based on class of majority of its neighbors. To identify the class of given data point, k number of neighbors are chosen using different distance-based approaches like Euclidean distance, Mahalanobis distance, Manhattan distance. The k is a constant, which can be selected based on the number of data points to be classified. There is no predefined technique to calculate the value of K, and it is selected heuristically. KNN is best suitable for low-dimensional data.

3.2.5 Naïve Bayes

NB [29] is a probabilistic classification approach which uses the concept maximum likelihood and is based on Bayesian theorem. It considers the conditional probability model to classify a data into various classes. It is generally suitable for high- dimensional data.

3.3 Ensemble framework

For any particular dataset, classifiers discussed above perform differently. Hence, we are unable to predict which classifier may have better classification performance. Ensemble is a strong approach to achieve the optimal solution for every dataset [30]. Above discussed, five classifiers are combined to form a heterogeneous ensemble framework (Fig. 3) for classification of CIT-based EEG data. LDA provides the best solution for linearly separable data, SVM gives optimal solution by using a trade-off parameter, NN provides nonlinear data classification by making use of activation functions and synaptic weights, naíve Bayes classifies data on basis of its prior probability, and KNN uses distance as a measure to classify data. Hence, by combining these different properties of five different classifiers, we have tried to obtain a near to optimal solution for our dataset. There are three approaches which aggregate results of base classifiers, as follows:

  • Majority voting: It takes the decision in favor of a particular class if the majority (more than 50%) of classifiers classify it as that particular class.

  • Unanimous voting: For a given class, say, Class-Guilty, if any classifiers’ results classify data as Class-Guilty, then unanimous voting takes decision as Class-Guilty

  • Weighted voting: In this approach, an aggregation function is applied which assigns higher weight values to the classifier with better performance. This will increase its participation in ensemble framework, in turn increasing the performance of overall system.

Majority voting and unanimous voting approach utilizes homogeneous ensemble framework providing better classification performance toward a particular classifier. It assigns equal weights to all classifiers; hence, the classifier which is performing better is given same weight as the least performing classifier. This approach will reduce the performance of the system if heterogeneous classifiers are used. However, weighted voting gives better classification results as it assigns more weights to the better performing classifier. Hence in this work, ensemble framework with heterogeneous classifiers is aggregated using weighted voting approach. The output of base classifiers is aggregated using aggregation function as in Eq. 9.

$$\begin{aligned} y_i=\Sigma _{i=1}^m w_iC_i \end{aligned}$$
(9)

where m denotes number of classes and \(w_i\) and \(C_i \) denotes weights and output predicted by ith classifier.

Table 1 Subject-wise performance using various classifiers and ensemble framework applied to potential values

For assigning the weight to classifier, initially equal weights are assigned to each base classifier, and on basis of their performance, weights are updated according to the Eq. 10.

$$\begin{aligned} w_{i}= \frac{acc_i}{\Sigma _{i=1}^m acc_i} \end{aligned}$$
(10)

where \(acc_i\) is accuracy at ith classifier

For evaluating the performance of all five classifiers, we have utilized a classifier ranking approach, which selects the best- performing classifier among the five classifiers. In order to find the rank of the classifier, we have considered G-measure as the measure of calculation. Data have been partitioned into two parts training and testing in 9:1 ratio randomly, this procedure is repeated for n iterations, and mean of n iterations is considered for ranking of the classifiers. Mean of weights of respective classifier for all iterations is assigned to the classifier.

4 Experimental results and analysis

EEG data for deception detection have been recorded using brain vision analyzer and recorder (brain products, Germany [19]). Recorded data were analyzed on the system with 8 GB RAM Intel i7 processor and implemented on MATLAB 2015a. Results of various feature extraction approaches like potential, mobility, complexity, power, frequency response and wavelet are discussed in this section. Also, comparison of proposed 3-classifier ensemble framework with the 5-classifier framework and with the base classifier has been presented.

Table 2 Subject-wise performance using various classifiers and ensemble framework applied on complexity values

4.1 Feature extraction

EEG is a signal waveform, and its amplitude provides the value of a signal at each peak; power provides the strength of the signal; the frequency component of a signal is extracted using FFT and statistical parameter using Hjorth features (mobility and complexity); and wavelet is used to extract time–frequency components of EEG signal. Hence, instead of using a single type of feature extraction technique like [8, 11], we have used various feature extraction techniques to analyze EEG data more precisely. After signal preprocessing, each feature extraction approach (as discussed in Sect. 3) has been applied on single subjects’ 16 channel for 30 trials of one session (\(1\times 16\times 30\)). The experiment is conducted for two sessions where subjects behaved as guilty for one session and innocent for another session. While taking results, data of 9 subjects out of 10 subjects have been considered. One subject (i.e., subject number 6) data have not been acquired properly and have lots of artifacts; hence, it is not considered for this study. For comparative analysis, various feature extraction approaches with various classification techniques and ensemble frameworks (with 5-classifier ensemble framework and with 3-classifier ensemble framework) are aggregated by majority voting, unanimous voting and weighted voting with respective subjects. The results are depicted in Fig. 4. These results show mean of the performance evaluated by applying subject-wise fivefold cross- validation (5-FCV). Here, 5-MV represents majority voting using 5 classifiers, 5-UV represents unanimous voting using 5 classifiers, and 5-WV represents weighted voting using 5 classifiers. Similarly, 3-MV, 3-UV and 3-WV represent majority voting, unanimous voting and weighted voting using 3 top-ranked classifiers, respectively. From Fig. 4, it is inferred that in most of the cases mobility and wavelet have performed better than other feature extraction techniques. It is observed that the combined result of proposed approach (3-WV) is higher than others for most of the feature extraction approaches.

A similar set of features have been utilized by R. Jenke et al. for emotion recognition dataset [20]. After applying various feature selection approaches, they have achieved an average accuracy of 35.9% using LDA as classifier, with highest accuracy of 45% for subject 6 using mRMR feature selection approach. Using proposed 3-WV framework, an average accuracy of 84.6% has been achieved using comparatively less channels.

4.2 Classification

For classification, data have been labeled into two classes: guilty as Class-1 and innocent as Class-2. The performance of classifiers is measured using various performance measures such as accuracy, sensitivity, specificity and G-measure. Specificity and sensitivity provide negative class and positive class accuracy where G-measures consider both negative and positive class accuracies to calculate performance score. G-measure gives geometric mean of specificity and sensitivity. Five classifiers have been considered here for evaluating the performance of system. Using a classifier ranking approach, three best classifiers among five classifiers have been identified. Results of classifiers are aggregated following an ensemble framework. For aggregation, weighted voting approach has been used and its performance is compared with other aggregation approaches as majority voting and unanimous voting. For weight updation instead of using conventional weight update formula 11, we updated weight as Eq. 10.

$$\begin{aligned} w_{i0}=w_{i1}+\frac{1}{2} \log \left( \frac{acc_i}{1-acc_i}\right) \end{aligned}$$
(11)

where \(w_{i0}\) and \(w_{i1}\) represent old weight and updated weight, respectively, for ith classifier at nth iteration and \(acc_i\) represents accuracy of ith classifier. Equation 11 provides logarithm of the accuracy of classifier, which makes weight update as zero, provided accuracy is 0 or 100%. Hence, in order to overcome that limitation normalization approach is considered.

Using various classification performance measures, results are calculated and are presented in Tables 1, 2, 3, 4, 5 and 6. Results are evaluated by applying 5-FCV on subject-wise EEG data. Results of base classifiers with 5-classifier ensemble framework and 3-classifier ensemble framework are compared. Each table represents results evaluated from various feature extraction approaches. Result after applying 5-FCV on potential as the feature is given in Table 1. Table 1 compares performance to single classifiers with 5-classifier ensemble framework and 3-classifier ensemble framework. Performance of classification for each subject is compared using various measures like accuracy, sensitivity, specificity and G-measure. Using amplitude as feature, proposed framework achieves accuracy of 89% for subject 4 which is highest among all subjects. In most of the cases, G-measure value is higher for proposed framework and compromised performance of sensitivity and specificity. As sensitivity and specificity provides measure toward a particular class, its value is higher for unanimous voting approach, which provides decision in favor of one class.

In Table 2, performance of various subjects using complexity as feature is tabulated. All classifiers are applied on features extracted using complexity, and the results are compared with proposed ensemble framework. Using complexity as feature, an accuracy of 91.8% with subject 4 and an average accuracy of 80.4% has been achieved. Table 3 represents comparative performance of ensemble framework with other classifiers using frequency response. With frequency response as a feature, proposed framework has attained a highest accuracy of 98.8% for subject 4 and an average accuracy of all subjects as 76.3%.

Table 3 Subject-wise performance using various classifiers and ensemble framework applied on frequency response values

Table 4 shows results of various classification approaches using one of the Hjorth’s parameter, viz. mobility. An accuracy of 96.9% has been achieved using weighted voting approach for 3-classifiers and an average accuracy of 89.8%.

Table 4 Subject-wise performance using various classifiers and ensemble framework applied on mobility values

In Table 5, power values extracted from EEG data have been classified and compared achieving highest accuracy of 100% for subject 4 and an average accuracy of 88.2%.

Table 5 Subject-wise performance using various classifiers and ensemble framework applied on power values

Table 6 provides a comparative analysis of various classifiers and ensemble framework using wavelet as feature. After applying classifiers on wavelet features, an accuracy of 100% for subject 3 and an average accuracy of 92.4% are attained, which is highest among all.

Table 6 Subject-wise performance using various classifiers and ensemble framework applied on wavelet coefficients

From the results depicted in Tables 1, 2, 3, 4, 5 and 6, it can be inferred that from base classifier performance 5-classifier system (5-MV, 5-UV and 5-WV) and 3-classifier system (3-MV, 3-UV and 3-WV) have performed better for almost all feature extraction approaches. On comparing the performance of different feature extraction approaches, classification accuracy is improved when wavelets as feature is used (as depicted in Table 6). Also, performance of classifiers is similar for Hjorth’s parameters. Among various subjects, subject-4 has responded best for all features, thus providing highest classification accuracy for almost all classifiers. The classifier ranking approach is applied to aggregate the best-performing classifiers. A graph depicting the comparison between two systems is given in Fig. 5. It is observed that the ensemble framework with ranking approach has a great impact toward the improvement of classification accuracy. Results of 3-classifier ensemble system are aggregated using three different approaches namely majority voting, unanimous voting and weighted voting. From the results, it is observed that 3-WV (proposed) framework performs the best for almost all feature extraction approaches on subject- wise single-trial EEG data.

Further, the results obtained by proposed approach are compared with some state-of-the-art methods. All the approaches are applied on the EEG data recorded for this work. Ten subjects, each subjects’ 16- channel EEG data, have been recorded by conducting a CIT. Analysis of variance (ANOVA) is the most commonly used statistical technique for CIT-based studies such as [5, 6, 13]. In this work, ANOVA has been applied on subject-wise EEG data for single trial and it has been observed that means of two groups or two classes overlaps. ERP data generally overlap [31] (as shown in Fig. 6); hence, statistical approaches like ANOVA are not sufficient to identify the human behavior.

Figure 6 shows that the mean values calculated using a two-way ANOVA and overlapping of means of two classes show certain similarity in data recorded for both classes. Authors like Wang et al. [10] have applied nonparametric weighted feature extraction technique based on LDA and KNN as classifier. Nonparametric feature extraction techniques are useful if some specific number of features are needed to be extracted. It also reduces the effect of outliers present in data. We have applied the same approach on recorded EEG data for different subjects. After applying the same, accuracy of 76.8%, specificity of 70.0%, sensitivity as 73.1% and G-measure as 76.5% have been achieved. Arasteh et al. [8] used a machine learning approach known as empirical mode decomposition which divides signal into various intrinsic mode functions. EMD was used as feature extraction approach and LDA as classifier in their work. This framework has been applied on EEG data recorded, and results were obtained. An accuracy of 80.1%, specificity of 75.7%, sensitivity of 75.7% and G-measure of 77.8% have been obtained. Authors have also applied genetic algorithm for feature selection. As this work is focused on feature extraction and classification, feature selection will be performed as future work. In an other work [9], P300 components are separated from non-P300 components by ICA. In order to identify independent components, topographic template matching has been performed on extracted P300 components. SVM as classification approach has been used. In the same framework, we have applied on recorded EEG data and an accuracy of 60.17%, specificity of 51.33%, sensitivity of 67.83% and G-measure of 34.27% have been obtained. All the comparative results are tabulated in Table 7. From the results, it is observed that proposed approach gives better performance in terms of accuracy, sensitivity, specificity and G-measure. The approach attained an average overall accuracy of 84.7%, specificity of 83.9%, sensitivity of 82.5% and G- measure of 80.8%. From results, it can be inferred that among various feature extraction approaches, wavelets providing time–frequency information of signals gives the best performance. In comparison with applying base classifiers for classification, ensemble technique provides better results by combining the best-performing classifiers. The selection of classifier is tedious task as it depends on the type of dataset applied [22]. Therefore, with the knowledge of dataset and using ensemble approach better performance results can be obtained.

Fig. 5
figure 5

Comparison of the performance of 5-classifier ensemble framework and 3-classifier ensemble framework

Fig. 6
figure 6

Comparison of the means of two classes using ANOVA

Table 7 Comparison with existing approaches

5 Conclusion

In this paper, an ensemble framework has been proposed by aggregating three best-performing classifiers. A classifier ranking approach has been applied to select three classifiers among five classifiers, viz. LDA, SVM, MLFFNN, KNN and NB. The main aim is developing the ensemble framework to provide a better approach for classification of guilty and innocent subjects. The proposed framework is applied on EEG data recorded for a Concealed Information Test to analyze the human behavior while lying. During data acquisition, set of images are flashed in front of subjects in two different sessions. Signals acquired during the experimental sessions are analyzed to identify the guilty or innocent subject. A wide range of available feature extraction techniques has been applied on acquired EEG data. For classification of EEG data, five classifiers are applied and their results are aggregated using majority voting, weighted voting and unanimous voting. To improve the performance of ensemble framework, classifier ranking is applied. The best three performing classifiers are aggregated using weighted voting approach for developing the ensemble framework. The proposed ensemble framework (3-WV) results are compared with different base classifiers’ performance and with 5-classifier system. Among the various feature extraction approaches, applied results show that wavelet performs best. Using wavelet with proposed ensemble framework, subject 3 has attained a highest of 100% accuracy and 92.4% average accuracy. Results of proposed framework are compared with some existing approaches. An improved overall classification accuracy has been achieved using proposed ensemble framework (3-WV). Further in future, different time–frequency domain feature extraction techniques can be applied to extract useful information from EEG signals. Feature selection approaches can be applied in future, so as to feed the best set of features to the classifiers. Also, optimization using bio-inspired approaches like PSO can be applied to reach the optimal solution with faster rate and better accuracy.