1 Introduction

Logistic regression (LR) [1, 2] is a multivariable method devised for dichotomous outcomes. It is a standard statistical classification method which is particularly appropriate for models involving disease state (healthy/diseased), decision making (yes/no), or mortality (dead, living). It is widely used in binary classification problems in applied sciences such as medicine, biology and epidemiology. It has been widely applied due to its simplicity and great interpretability. Logistic regression needs special requirements regarding the data under consideration, such as, little or no collinearly among the independent variables and linearity of the independent variables with the logit. In contrast, SVM [3, 4, 5] recently, has become a very popular machine learning tool for classification. It is easy and uncomplicated as compared to LR. Nowadays, SVM is used intensively in data mining, which is a general term for the science of extracting useful information from large databases or data sets.

There are many empirical studies for comparing machine learning algorithms; these studies also include the comparison of LR and SVM. For example, Perlich et al. [6] constructed a curve analysis comparison between the decision trees and logistic regression using bagging, STATLOG [7] presented a study that included several machine learning algorithms and LR but it did not include SVM, [8] presented a study that compared logistic regression (LR), probabilistic neural network (PNN) and support vector machine (SVM) classifiers for discriminating between normal and Parkinson disease (PD). There are various other pair comparison studies including LR and/or SVM such as: LeXu and Gao [9] who presented a study that compares logistic regression with artificial neural network (ANN) on power distribution system, Chen et al. [10] constructed a comparison study between SVM and back propagation neural networks in forecasting the six major Asian stock markets. There are numerous studies in medical fields too, such as Song et al. [11] who performed Comparative Analysis of Logistic Regression and ANN for Computer-aided Diagnosis of Breast Masses. All these studies provide comparisons pair classifiers using only one dataset for a single problem. To the best of our knowledge, the only comparison between SVM and LR has been done for the prediction of hospital mortality in critically ill patients with hematological malignancies [12]. This comparison focuses only on the mortality prediction model. In this study the authors divided the data set into training and testing sets, the dataset they used has only 350 instances. None of the previous studies used statistical analysis for evaluating the performance of the classifiers under comparison. Moreover, there are plenty of new improvements that have been applied to the classification methods to improve their performance, such as, bagging and ensemble. The classifiers’ performance needs to be compared after incorporating the improvements.

The aim of this paper is to construct a standard, comprehensive comparative study between SVM and LR on multiple data sets. Recently, combining multiple classifiers has been a very active research technique. It is widely accepted that combining multiple classifiers can achieve better classification performance than a single classifier [13, 14], therefore, the bagging [15] predictors method has been used. Bagging is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The main idea of bagging is to make various samples of the training set. A classifier is generated for each of these training set samples by a selected learning algorithm. So, for k samples of the training data set we get k particular classifiers. There are many strategies for aggregating these classifiers. In this paper, the average of the estimated probabilities’ strategy was used for aggregation. A variety of performance metrics have been used: accuracy, sensitivity, specificity, precision and F-score, to asses the algorithms’ performances. These standard metrics were combined with other metrics that measure other properties, such as, failure avoidance and class discrimination. These metrics are Youden’s index, positive and negative likelihoods and diagnostic odds ratio (DOR), which may be useful in discovering unseen characteristics of the algorithm’s performance. Furthermore, The receiver operating characteristic (ROC) analysis has been used in this study, which is a more powerful evaluation tool and has not been included in previous studies. The statistical significant difference between each pair of ROC curves was tested using the Mann–Whitney nonparametric test. Statistical evaluation of experimental results is an essential part of the comparison validation, in this study, a detailed concept of using statistical analysis in comparing two algorithms has been given.

2 Classification methods

The most widely studied and well understood learning protocol is supervised learning, where a learning algorithm uses labeled instances to formulate a predictive model [16]. Logistic regression and support vector machine are two supervised classification methods which are broadly used. Logistic regression is a parametric method to analyze dichotomous response variable and finds the relationship between the response variable and the independent variables. It has been widely applied in medicine fields, but seldom used in machine learning studies. Support vector machine is also a parametric method which has been broadly used in machine learning studies. Recently, it has been extensively used in classification problems and successfully applied to many real fields [17, 18].

2.1 Support vector machine

Support vector machine (SVM) [35] is a comparatively new classification method. It has drawn much attention in recent years [19]. The concept of SVM is as follows: input vectors x are mapped to a very high dimension feature space z through some nonlinear mapping \( \Upphi (x),z = \Upphi (x) \). In this space, an optimal separating hyperplane is constructed. For a given training dataset with n samples (x1, y1), (x2, y2), …, (xn, yn), where x i is a feature vector in a d-dimensional feature space Rd and \( {\text{y}}_{\text{i}} \in {\text{\{1, + 1\} }} \) is the corresponding class label. The task is to find a classifier with a decision function f (x, w, b) = \( w^{T} \Upphi (x) + b \), SVM finds an optimal hyperplane with the maximal margin that separates the data points into two classes. To find the optimal separating hyperplane having maximal margin, a learning machine should

$$ \min \frac{1}{2}w^{T} w\quad i = 1, \ldots ,n $$
(1)

Subject to

$$ y_{i} [w^{T} \Upphi (x_{i} ) + b] \ge 1\quad i = 1, \ldots ,n $$
(2)

where w is the normal vector for the “separating” hyperplane, \( (W,\Upphi (x_{i} )) + b \) = 0, this can be transferred into its dual form by minimizing the following primal lagrangian

$$ L_{d} (w,b,\alpha ) = \frac{1}{2}w^{T} w - \sum\limits_{i = 1}^{n} {\alpha_{i} } \{ y_{i} [w^{T} \Upphi (x_{i} ) + b] - 1\} $$
(3)

In respect to w and b by using \( \partial L_{d} /\partial w = 0 \) and \( \partial L_{d} /\partial b = 0 \), i.e., by exploiting

$$ \frac{{\partial L_{d} }}{\partial w} = 0,\quad w = \sum\limits_{i = 1}^{n} {\alpha_{i} y_{i} \Upphi (} x_{i} ) $$
(4)
$$ \frac{{\partial L_{d} }}{\partial b} = 0,\quad \sum\limits_{i = 1}^{n} {\alpha_{i} } y_{i} = 0 $$
(5)

Substituting w from (4) and using (5), this lead to the following dual lagrangian problem

$$ L_{d} (\alpha ) = \sum\limits_{i = 1}^{n} {\alpha_{i} } - \frac{1}{2}\sum\limits_{i,j = 1}^{n} {y_{i} y_{j} } \alpha_{i} \alpha_{j} k(x_{i} ,x_{j} ) $$
(6)

where, \( k(x_{i} ,x_{j} ) = \Upphi^{T} (x_{i} )\Upphi (x_{j} ) \) is a Mercer’s kernel that allows us to calculate the dot product in high-dimensional space without explicitly knowing the nonlinear mapping. The \( L_{d} (\alpha ) \) in (6) should be solved subject to the following constraints:

$$ \begin{gathered} \alpha_{i} \ge 0\quad i = 1, \ldots ,n \hfill \\ \sum\limits_{i = 1}^{n} {\alpha_{i} } y_{i} = 0 \hfill \\ \end{gathered} $$
(7)

In a more general case, when the problem is not separable or is judged too costly to separate due to an overlapping of training data points, the constraints in solving dual lagrangian problem in (6) change to the following constraints:

$$ \begin{gathered} 0 \le \alpha_{i} \le c\quad i = 1, \ldots ,n \hfill \\ \sum\limits_{i = 1}^{n} {\alpha_{i} } y_{i} = 0 \hfill \\ \end{gathered} $$
(8)

where \( (\alpha_{1} , \ldots ,\alpha_{n} ) \) are the weights assigned to the training sample x i . If α i  > 0, xi is called a support vector. c is a “regulation parameter” used to trade-off the training accuracy and the model complexity so that a superior generalization capability can be obtain. There are different forms of kernel function, however, support vector machine (SVM) with the Gaussian (RBF) kernel has been popular for practical purposes, since it can handle the case when the relation between classes and features is nonlinear, and it also has less parameter than other nonlinear kernels such as the polynomial kernel [2022], therefore, RBF kernel function which is given in Eq. (9) is used in this paper.

$$ k(x_{i} ,x_{j} ) = \exp\left( - \frac{{\left\| {x_{i} - x_{j} } \right\|^{2} }}{{2\sigma^{2} }}\right)$$
(9)

After the lagrangian variables (\( \alpha_{1} , \ldots ,\alpha_{n} \)) calculated by solving (6) subject to (8) and using (4), the decision function can be formulated as follows:

$$ f(x) = w^{T} \Upphi (x) + b = \left( {\sum\limits_{i = 1}^{n} {\alpha_{i} y_{i} k(x,x_{i} ) + b} } \right) $$
(10)

where x is the d-dimensional vector of the test examples and b is the SVM bias term which depends upon the applied kernel, it can be implicitly part of the kernel function. It will be found by fulfilling the requirements that the values of a decision function at the support vectors should be the given y i, (y i  = ±1). i.e. f(x s ) = y s  = ±1. For the given Pattern x p, if f(x p ) > 0, Pattern x p belongs to class (y = +1), otherwise, it belongs to class (y = −1).

2.2 Logistic regression

Logistic regression (LR) [1, 2] is a well known statistical approach to model dichotomous (binary) data; logistic regression is a member of generalized linear models. In logistic regression, a single outcome variable y i , where \( i = 1, \ldots ,n \), each \( y_{i} \) takes only two values 0 or 1 (but not both), so it follows a Bernoulli Probability density function \( p(y_{i} ) = (\pi_{i} )^{{y_{i} }} (1 - \pi_{i} )^{{1 - y_{i} }} . \) that takes the value 1 with probability π i and 0 with probability (1 − π i ). Our interest is in y i  = 1 with the interest probability π i , which varies over the observations as an inverse logistic function of a vector x i , which includes a constant (x0) and k explanatory variables (x1,…, xk). Its function can be given as follows:

$$ \begin{gathered} y_{i\sim } {\text{Bernoulli}}(\pi_{i} ) \hfill \\ p \, (y_{i} = 1) \, = \, \pi_{i} = (1 + e^{ - x\beta } )^{ - 1} \hfill \\ \end{gathered} $$
(11)

where \( \beta = (\beta_{0} , \beta_{1}^{\prime } )^{\prime } \) is a (k +1) × 1 vector that contains the parameters that need to be estimated, β0 is an intercept term corresponding to x0 and β is (k × 1) vector with elements corresponding to the explanatory variables. The odd ratio of y = 1 is p(y = 1)/(1 – p (y = 1) = π i /(1 − π i ). By using this odd ratio; the following transformation can be obtained.

$${\text{log}}it(y_{i} = 1) = \ln [odd] = \ln \left(\frac{{\pi_{i} }}{{1 - \pi_{i} }}\right)= \beta x_{i} $$
(12)

The above logit function can be expressed in matrix form as follows:

$${\text{log}}it[p(y = 1)] = x\beta$$
(13)

The importance of the transformation in (13) is that it has many of the desirable properties of the linear regression model. The logit is linear in the parameters vector β. These parameters will be estimated using the maximum likelihood function. The maximum likelihood function of Bernoulli density function is \( L(\pi_{i} /y_{i} ) = (\pi_{i} )^{{y_{i} }} (1 - \pi_{i} )^{{1 - y_{i} }} \). By assuming independence over the observations, the maximum likelihood function for \( y = y_{1} , \ldots ,y_{n} \) can be written as follows:

$$ L(\beta /y) = \prod\limits_{i = 1}^{n} {(\pi_{i} )^{{y_{i} }} (1 - \pi_{i} )^{{1 - y_{i} }} } $$
(14)

By taking the logarithm, the log-likelihood will be

$$ L(\beta /y) = \sum\limits_{i = 1}^{n} {[y_{i} \ln (\pi_{i} )} + (1 - y_{i} )\ln (1 - \pi_{i} )] $$
(15)

After estimating the parameters, the significance of each of these parameters will be assessed by comparing the observed values of the response variable to the predicted values obtained from the model with and without the variable in the model. In logistic regression this comparison is based on the log likelihood function defined in (15). This can be obtained by using the following statistic:

$$ G = - 2\left[ {\frac{likelihood\;with\;out\;the\;variable}{likelihood\;with\;the\;variable}}\right] $$
(16)

This statistic will be compared with \( \chi^{2} (\alpha ,1) \) to test the hypothesis whether the parameter is equal to zero or not, if G > \( \chi^{2} (\alpha ,1) \), then the parameter is not significant and should be deleted from the model. There are several selection procedures used to construct the best fitting model such as forward selection which looks at each explanatory variable individually and selects the single explanatory variable that fits the data the best on its own as the first variable included in the model, among the remaining variables the one that adds the most is included. This is repeated until none of the remaining variables will add significantly. Backward selection starts with a model that contains all of the explanatory variables, and then a variable that, if removed, would cause the smallest change in the overall fit of the model is removed. This continues until all variables in the model are significant. For assessing the goodness-of-fit for the model, there are several goodness-of-fit tests that can be obtained by comparing the overall difference between the observed and fitted values. Among these tests Pearson Chi-Square χ2 and Deviance D test are used the most. Suppose the number of the covariate patterns is j, let j < n, let m i denote the number of (y i  = 1) among these patterns. The Pearson statistic is defined as follows:

$$ \chi^{2} = \sum\limits_{i = 1}^{n} {\frac{{(y_{i} - m_{i} \hat{\pi }_{i} )^{2} }}{{m_{i} \hat{\pi }_{i} (1 - \hat{\pi }_{i} )}}} $$
(17)

And the residual deviance statistic is defined as follows:

$$ D = 2\sum\limits_{i = 1}^{n} {\left[ {y_{i} \ln \left( {\frac{{y_{i} }}{{m_{i} \hat{\pi }_{i} }}} \right) + (m_{i} - y_{i} )\ln \left( {\frac{{(m_{i} - y_{i} )}}{{m_{i} (1 - \hat{\pi }_{i} )}}} \right)} \right]} $$
(18)

It is clear that the above two statistics rely on the principle of comparing observed y i to predicted \( m_{i} \hat{\pi }_{i} \) values and they should be small if the model fits the data well. These two statistics are compared to the value of \( \chi^{2} (\alpha ,n - k - 1) \) to judge their statistical significance. These statistics are used when j < n. Their results are invalid when j ~ n [1, 23]. In this case there are other alternative statistics that can be used, such as Osius and Rojek statistic, Farrington statistic and Hosmer–Lemeshow statistic.

The predicted label for the logistic regression model will equal to 1 if \( \hat{\pi }_{i} \) is greater than or equal to some threshold (the default is 0.5), as shown below:

$$ \begin{gathered} if\,(p(y = 1)) \ge 0.5\quad the\, instance \in class\,(y = 1) \hfill \\ if\,(p(y = 1)) < 0.5\quad the\, instance \in class\,(y = 0) \hfill\\ \end{gathered} $$
(19)

3 Materials and methods

3.1 The data sets

The data sets used in this study were composed of 13 data sets with binary class attributes, 11 from the UCI repository (ftp to ics.uci.edu/pub/machine learning-databases) and 2 from the LIB-SVM data: classification (binary class) at (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets). These data sets are of different sizes, six of them are almost balanced and the remaining seven are unbalanced. Table 1 gives a numerical summary of the data sets.

Table 1 Summary of the data sets

3.2 Bagging and aggregating classifier decisions

Ensembles of classifiers represent one of the main research directions in machine learning [24]. Empirical studies showed that both in classification and regression problems ensembles are often much more accurate than the individual base learner that make them up [25], recently there have been different ensemble methods. Bagging [15] is one of the most important recent developments in classification methodology. This method was proposed by Leo Breiman in 1996. Using bagging in many classification algorithms results in high improvement in performance and gives substantial gains in accuracy. Breiman shows that bagging works well for unstable procedures where a small change in the training data set can result in large changes in predictions (e.g., neural networks, decision trees). Although SVM is a stable classification method, its performance can generally be improved by bagging [26, 25]. Bagging has been applied widely to machine learning techniques, but it has rarely been applied to statistical tools such as logistic regression [6].

Bagging works by sequentially applying a selected classification algorithm in respect to modifications of the training data set. So for each sub sample of the training data a classifier should be created.

Our experiment was done according to the following bagging algorithm.

  1. (i)

    Initialization of the training data set T.

  2. (ii)

    Divide the training data set T into two sets T1 and T2, based on the data classes.

  3. (iii)

    Draw two random samples (bootstraps) with replacement from T1 and T2 with the same proportions (some of the examples can be selected repeatedly and some may not be selected at all).

  4. (iv)

    Mix the two samples (bootstraps) together to represent the new training data set, in this way the proportions of the classes will be the same as in the original data set, and all the training data sets will be of the same size.

  5. (v)

    Train a particular classifier using this sub training data set by a selected Learning algorithm.

  6. (vi)

    By repeating the previous steps K times, K classifiers will be obtained.

Any instance in the training data set T has the probability [1 − (1 − 1/K)K] of being selected, at least once in the K times randomly selected instances from the training data set. For a large k, as in this experiment where k = 100, this probability will approximately equal to 0.634 which means that each sub training sample contains about 63.4% unique instances from the original training data set. In this way we can build classifiers with samples that are not identical.

After, each classifier is trained independently for each algorithm. We have to aggregate their results in an appropriate combination approach. Some combination strategies are suggested by previous studies. First, the simplest one is a majority vote which can be used where only class’s labels are considered. Second, for the case of continuous-valued outputs like posteriori probabilities are available, the average of the estimated probabilities can be an appropriated strategy. In this case the decision is made according to the mean of posteriori probabilities of the combined classifiers. Third is the average of estimated parameters, where the final classifier is obtained by averaging the coefficients of the combined classifiers. Since both SVM and LR support estimated probabilities, the average of estimated probabilities’ strategy has been used. Each training set (bootstrap) generates estimated probabilities \( \hat{p}(j/x) \), which is an object with prediction vector x belonging to class j. Then the class corresponding to x is estimated as arg maxj \( \hat{p}(j/x) \). The bagging ensemble is obtained by averaging the \( \hat{p}(j/x) \) over all bootstrap replications to obtain \( \hat{p}_{Be}^{{}} (j/x) \), and then uses the estimated class arg maxj \( \hat{p}_{Be} (j/x) \) as a final prediction. This estimate was computed in all the classification examples in this paper. The resulting misclassification rate was always virtually identical to the voting misclassification rate [15].

3.3 Performance measures

Central to constructing, deploying, and using classification method is the question of method performance assessment. The support vector machine and logistic regression are now used in many domains, and different performance measures are appropriate for each domain. The different performance metrics measure different tradeoffs in the predictions made by algorithm and it is possible for learning algorithm to perform well on one metric, but be suboptimal on other metrics, because of this it is important to evaluate algorithms on a broad set of performance metrics. Therefore, in this comparative study a variety of performance metrics has been used. The performance metrics were divided in to three. The first one is the common metrics that are well known and have been widely used in machine learning comparisons which are threshold metrics. The default is 0.5. These metrics only consider the prediction above or below the threshold (0.5). These metrics are: accuracy (ACC), the number of correct predictions on the test data is divided by the number of test data instances, sensitivity (SN), specificity (SP): assesses the effectiveness of the algorithm on positive and negative classes respectively, F-score is a composite measure benefits algorithm with higher sensitivity and challenges algorithm with higher specificity, precision is the assessment of the predictive power of the algorithm for positive or negative classes. Secondly, new suggested metrics other than those common metrics have been used to assess the performance of the algorithm. The goal of using these metrics is to evaluate the performance of the algorithm in other ways. These new suggested measures are used in the medical area, and they are: Youden’s index (γ), likelihoods ratio (LR) and diagnostic odds ratio (DOR).

Youden’s index (γ) (1950) measures the ability of an algorithm to avoid failure. It equally weights the algorithm’s performance in negative and positive examples, it can be expressed as:

$$ \gamma = senstivity - (1 - specitificity) $$
(20)

A high value of γ indicates better ability to avoid failure [27].

Positive and negative likelihoods (LRS) [28] are familiar epidemiologic measures, used to select appropriate diagnostic test and are useful and helpful for comparing two algorithms. Their advantages over sensitivity and specificity are to evaluate the algorithm’s performance with respect to both classes. The values of the positive (ρ+) and negative (ρ−) can be expressed as:

$$ \rho + = \frac{sensitivity}{(1 - specificity)},\quad \rho - = \frac{(1 - sensitivity)}{specificity}$$
(21)

A higher positive likelihood and a lower negative likelihood indicate better performance on positive and negative classes respectively [28]. Here it should be mentioned that if ρ+ < 1 likelihood metrics should not be used. The relationship between the likelihoods and the performance of the two algorithms A and B is as follows [28]:

$$ \begin{gathered} if\,\rho_{ + }^{A} > \rho_{ + }^{B} \,and\,\rho_{ - }^{A} < \rho_{ - }^{B} \quad implies\,A\,is\,superior\,over\,all. \hfill \\ if\,\rho_{ + }^{A} < \rho_{ + }^{B} \,and\,\rho_{ - }^{A} < \rho_{ - }^{B} \quad implies\,A\,is\,superior\,for\,confirmation\,of\,negative\,\,examples. \hfill \\ if\,\rho_{ + }^{A} > \rho_{ + }^{B} \,and\,\rho_{ - }^{A} > \rho_{ - }^{B} \quad implies\,A\,is\,superior\,for\,confirmation\,of\,postive\,examples. \hfill \\ if\,\rho_{ + }^{A} < \rho_{ + }^{B} \,and\,\rho_{ - }^{A} > \rho_{ - }^{B} \quad implies\,A\,is\,inferior\,overall. \hfill \\ \end{gathered} $$

Diagnostic odds ratio (DOR) [29] is also a global performance measure. It has been suggested as a superior measure of diagnostic discrimination and it is used in medicine for the comparison of diagnostic accuracies between two or more diagnostic tests. Similarly this measure can be used in machine learning to measure the algorithm’s performance and compare them. It evaluates how the algorithm distinguishes between positive and negative examples. It is calculated using the following equation:

$$ {\text{DOR}} = \frac{sensitivity /(1 - sensitivity)}{(1 - specificity)/specificity}$$
(22)

Combining these three metrics with the common metrics helps to obtain balanced evaluation of the algorithm’s performance. Thirdly to assess the algorithm’s performance with respect to their estimated probabilities, the area under the ROC (receiver operating characteristic) curve (AUC) is used [30, 31] which compares visually the algorithm’s performance averaged across all possible probability thresholds. The ROC curve plots observed sensitivity versus (1-specificity) for all possible classification thresholds. It also measures the ability of the algorithms to separate the instances of the different classes. The power of the ROC curve comes from the fact that it characterizes the performance of a classification model as a curve rather than a single point. The important statistical property of (AUC) is that it is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance which is equivalent to the Mann-Whitney statistic [32]. A high value of the statistic test indicates that the probability ranking is generally better. Thus we used the area under the ROC curve for comparing class probability estimators of the two algorithms. To test the statistical difference between each pair of ROC curves of the two algorithms, the Wilcoxon test as an appropriate nonparametric test was used [33, 31].

3.4 Statistical comparison methods

Although there is no specific procedure for comparing algorithms over multiple data sets, there are different statistical tests and common-sense techniques to test whether the two algorithms are significantly different or not. The key question of using the statistical test is the suitability and the assumptions that should be satisfied. In this paper, the paired sample T test as a parametric test and the wilcoxon signed-ranks test as a nonparametric test have been used.

3.4.1 Paired sample T test

The paired T test is used to compare two population means where there are two samples in which observations in one sample can be paired with observations in the other sample. It is used in this paper to test the statistical difference between the two algorithms over the various evaluation measures. The hypothesis is whether the average difference in their Performance over the data sets is significantly different or not. Let c1j and c2j be the metric scores of the two algorithms on the j-th data set and let dj be the difference c2j − c1j. The T statistics is computed as (\( \bar{d}/s\bar{d} \) ) and is distributed according to T distribution with N −1 degrees of freedom, where N is the number of the data sets. Paired T test is true and can be safely used even if the variances of the two random variables under comparison are not homogeneous. However it may be less effective if the two random variables under comparison are not distributed normally [34]. In addition, the Paired T test requires the minimum sample size (number of the data sets) to be ~30. To ensure that the variable is normally distributed, there are many tests that can be used such as the Kolmogorov–Smirnov test. All these tests that are used for checking the normality assumption are affected by the sample size; therefore it would be useless to check the normality of the samples that are less than 30.

3.4.2 Wilcoxon signed-ranks test

The alternative test to the paired T test is the Wilcoxon signed-ranks test which is a nonparametric test. It does not need the assumptions of homogeneity and normality and it will not be affected by the sample size [35]. Therefore, it can be appropriated when the paired T test’s assumptions are violated. The Wilcoxon signed-ranks test ranks the differences in performance measurement of the two algorithms for each data set, ignoring the signs, and compares the ranks for the positive and the negative differences. The differences are ranked according to their absolute values; average ranks are assigned in case of ties. Let R+ be the sum of ranks for the data set, in which the second algorithm outperformed the first. Let R− be the sum of ranks for the opposite. Ranks of d i  = 0 are split evenly among the two sums. Let T = min (R+, R−), then the test statistics is computed as follows:

$$ Z = \frac{{T - \frac{1}{4}(N(N + 1))}}{{\sqrt {\frac{1}{24}N(N + 1)(2N + 1)} }} $$
(23)

The statistics in (23) is distributed approximately normally. The null hypothesis is rejected if Z > Z (α/2).

4 Experimental set up

The bagging method is used to construct the experiment. As shown previously, a random sample (bootstrap) was drawn with replacement from the original data set to form a training set. Each training set contains approximately 66% of the data from the original data set. The support vector machine (SVM) with the Gaussian (RBF) kernel and LR were used in this experiment for classification. With the methods like cross-validation, the relevant parameters of SVM can be chosen more scientifically, so they are widely used to choose the optimal parameters for SVM [36], hence, with each bootstrap 10 fold cross validation (CV) was used to determine the best values of γ and C [37]. Normally Cross validation (CV) is used to estimate the generalization capability on new samples that are not in the training dataset. A k-fold cross validation randomly splits the training dataset into k approximately equal-sized subsets, leaving out one subset, builds a classifier on the remaining samples, and then evaluates classification performance on the unused subset [3840]. This process was repeated k times for each subset to obtain the CV performance over the training dataset. The best values were determined for each training set. The training set with its associated best values was used to construct the support vector machine model. This model generated estimated probabilities \( \hat{p}(j/x) \). This procedure repeated 100 times. Finally we have 100 SVM classifiers. They were combined by taking the average of the \( \hat{p}(j/x) \) to obtain \( \hat{p}_{Be} (j/x) \). The same procedure was used for LR to construct 100 models. For the categorical variables we have deleted any training data set that does not include all the categorical variables, so all the testing and training data sets include all the categorical variables. The 2-way interactions between the independent variables were added to the model, the correlation and collinearly were checked before the analysis. The quartile method was used to assess the relationship between the continuous variables and the outcome to check whether the categorization for continuous variables was needed. The backward selection procedure was used with 0.05 as the default significance level. After the estimated class arg maxj \( \hat{p}_{Be} (j/x) \) was calculated for support vector machine and logistic regression, the various evaluation metrics were computed for both of them. The ROC curves were constructed using \( \hat{p}_{Be} (j/x) \). The paired T test and the Wilcoxon signed-ranks test were applied to the all performance measures. The probabilities multiplication rule was used to combine the results of the paired T test and the Wilcoxon signed-ranks test on those performance measures, to obtain the final decision.

5 Results and discussions

The results of the support vector machine were carried out by using LIBSVM (3.0–1) [41] software package, available at http://www.csi.ntu.edu.tw/~cjlin/libSVM under matlab (7.8.0347- R2009a) interface, while the results of multiple logistic regression were carried out using spss.16.0 (SPSS Inc, Chicago, IL, USA). The statistical tests were also calculated by using spss.16.0.

5.1 Performances by measures

The results of the support vector machine and logistic regression on the data sets for each common and new suggested performance measures are shown in Tables 2 and 3 respectively. For each table the first six rows represent the results for the balanced data sets. Each metric value in these tables represents the average of the 100 classifiers for the corresponding data set. The area under the curves (AUC), shown in these tables, were calculated using the average of the estimated probabilities among all of the 100 classifiers.

Table 2 The results of the performance measures for SVM
Table 3 The results of the performance measures for LR

5.2 The ROC curve analysis

As described above, the average of the estimated probabilities has been used to construct the ROC curves for SVM and LR. Figures 1 and 2 show the ROC curves for credit approval and Pima Indian diabetes as an example of balanced and unbalanced data sets respectively. From Fig. 1, the two ROC curves for the credit approval for the SVM and LR are almost the same, which is the situation in most of the balanced data sets. From Fig. 2, the ROC curve for the Pima Indian diabetes is higher for the SVM; however, some of the other unbalanced data sets have almost similar curves for SVM and LR. The relationship between the area under these ROC curves of SVM and LR is depicted in Fig. 3. Figure 3 represents the AUCs of each data set, using the same order in Table (1) for SVM and LR; it shows that the most pairs of AUCs of SVM and LR for each data set lie closely.

Fig. 1
figure 1

ROC curve for credit approval

Fig. 2
figure 2

ROC curve for Pima Indian

Fig. 3
figure 3

The relationship between RUCs of SVM and LR for the data sets

The Wilcoxon signed-ranks test with α = 0.05 for the balanced data sets shows no significant difference between the ROC curves of the SVM and LR since all the p values are in the range (0.078, 0.475). However, for the unbalanced data sets, it shows significant difference for the German number and page block data sets, where their p values are less than 0.025.

5.3 The statistical tests analysis

The Paired T test and the Wilcoxon signed-ranks test are used to see whether the two algorithms perform equally well or not. Because we have no guarantee for normality assumption and also because of the relatively small samples number, we applied both the paired T test and the Wilcoxon signed-ranks test. The results are shown in Tables 4 and 5. Each value in these tables represents the p value of the corresponding measure among all the data sets.

Table 4 Paired T test’s results
Table 5 Wilcoxon signed-ranks test’s results

Multiplication rule [42] is used to combine theses results, since all these tests are independent, this rule can be used to obtain one p value to make the final statistical decision. The formula of this rule, in case of independency is as follows:

$$ p\left(\mathop \cap \limits_{i = 1}^{n}\right) = p\left(\mathop \prod \limits_{i = 1}^{n}\right) $$
(24)

This means that the probability of no statistically significant difference between the algorithms is equivalent to the probability of no statistically significant difference between all their performance measures. The p value is defined as the probability of H0 is true, thus according to the above rule the p value for no significant difference between the two algorithms is equal to the product of all the p values. The p value for the Paired T test is (5.387 × 10−6), while the p value for the Wilcoxon signed-ranks test is (1.4 × 10−8). Similarly, α is the probability of rejecting H0 when it is actually true. So the α value of testing no significant difference between the two algorithms is equal to (0.025)10 = (9.536743 × 10−17).

5.4 Discussion

The results show that using several performance measures with different data sets can help in understanding and comparing the performance of the algorithms. Moreover, the results show that, it is not always reliable to compare algorithms using their performance measures scores only. Regarding, the comparison that has been done in this paper, for the balanced data sets, it has been found that support vector machine and logistic regression have much close overall performance measures in most of the data sets. For the Hared–Scale data set, which is the smallest one, the two classifiers performed equally in accuracy, sensitivity, specificity, precision, F-score and AUC. This indicates that the two algorithms can perform equally well in the small data sets. Also the results obtained, using the unbalanced data sets, show that overall the common performance measures is almost the same for the support vector machine and logistic regression. However, support vector machine achieves higher values in some unbalanced data sets. For the semi unbalanced spam data set, the two classifiers performed equally well. In the highly imbalanced data sets (German number and page block), logistic regression is found to be biased towards the majority class. However, this would not have a major effect on the general algorithms’ performance. When comparing the ROC curves, for the results of the balanced data sets, we found the minimum p value is (0.078) for the credit approval dataset. This indicates that there is no significant difference between the two classifiers on these data sets. While for the comparison of the ROC curves, the results of the unbalanced data sets show the only significant difference between the two classifiers is found when we used German number and page block data sets, because their p values are less than 0.025. Generally, according to the results of ROC curves, the two classifiers have equal performance. However, support vector machine outperforms LR in the highly unbalanced data sets. Because there is no evidence that our sample satisfies the normality assumption, both Paired T test and the Wilcoxon signed-ranks test are used. The results of the Paired T test with α = 0.05 show that there is no significant difference in the overall performance measures. Also the results of the Wilcoxon signed-ranks test with α = 0.05 show that there is no significant difference in the overall performance measures. Moreover, the general p values of the Paired T test and the Wilcoxon signed-ranks test are higher than the level of significance (α) for rejecting the null hypothesis. This indicates that there is no statistical significant difference between the SVM and LR, and both of them perform equally well.

6 Conclusion

This study has empirically compared two familiar classifiers; support vector machine and multiple logistic regression using bagging and ensemble over various different sizes of balanced and unbalanced data sets. The comparison was done in a different manner than the manner of most machine learning comparisons. This study represents a standard comparison. It includes numerous statistical analyses for several algorithm performance measures which enable us to make a warranted and verified conclusion. This study shows that, generally, the SVM and LR over all the performance measures have equal performance for balanced and unbalanced data. However, support vector machine may work better for the highly unbalanced data sets. The study also views that there are some measures higher in one classifier than in the other in some data sets, consequently, it is not appropriate to draw a conclusion from studies with one data set, that one classifier is better than the other. There is no golden standard for making such comparisons and the tests that are performed often have no statistical foundations. Logistic regression has higher interpretability while support vector machine is considered to be a black box predictor. It neither makes its prediction implicit nor gives incite in the rules governing its prediction, which is not the case in LR. Therefore, in the case of considering classification only, each of them can be used while when the interpretation is necessary such as in many medical studies, logistic regression should be used.