1 Introduction

Classification is a process of assigning a new subject or item to one of the G known groups or classes on the basis of how closely specific characteristics of this subject/item match with those of the groups. For example, on the basis of specific protein levels measured for a patient, an oncologist can determine with certain confidence whether or not that patient has a certain type of cancer; using a pixel-based satellite image, a geographer can classify land cover into different categories such as water, forested wetland, and upland forest; or using certain admissions criteria, a university can classify applicants as accepted or non-accepted into their program.

There are many different methods (or rules) used to achieve this goal of classification of subjects/items into different classes. Note that classification is not to be confused with clustering as classification involves assigning items to a known number of groups with specific characteristics, whereas clustering involves forming groups of items with similar characteristics when the existing number of groups is unknown. In a world of machine learning and computation, the process of classification is referred to as a supervised learning whereas the method of clustering is an example of unsupervised learning. For example, consider a chicken farm that packages chicken eggs. Before packaging, each egg has to be classified as medium, large, extra-large, or jumbo. This is a case of classification as the classes are well defined based on the egg size and machines can be taught to properly classify eggs based on their size. Sometimes clustering or other pattern recognition methods are used by scientists as a first step towards classification in which existence of number of groups is determined using multiple characteristics of its constituents. In this article, we discuss classification only in the sense of supervised learning, i.e., a procedure in which discriminating variables or functions are used to predict group membership.

Classification into one of the two groups (i.e., G = 2) is known as a binary classification and is relatively easier to implement. The earlier classification procedures developed were mostly binary classification methods. But with the development of computing and technology, more literature on different aspects of classification has become available (Hastie et al. 2001).

Broadly speaking there are two types of classifiers: hard and soft. As Wahba (2002) described, the soft classifiers also known as probabilistic classifiers typically provide conditional probability of membership for each of G (G ≥ 2) groups for each new subject to be classified and then put the subject in the group with the largest probability of membership. In contrast, hard classifiers only provide a hard classification boundary like a fence around a property for each group based on the explanatory variables. A new subject with characteristics within the boundary of a group is assigned to that particular group.

The development of a classification procedure involves two major steps, classification and validation. In the first step, a classification rule or an algorithm also known as a classifier is developed and in the second step, performance of this classifier is evaluated. Dataset used in the first step to develop a classification rule is known as the training dataset and includes information about group membership (response or output) of each subject along with other random variables (explanatory variables or features). A similar dataset is used in the second step to validate the classifier developed in the first step and is known as the validation dataset.

Let us assume that all outcomes are independently observed for random response variable Y and a vector of random explanatory variables X = (X 1, X 2,  … , X p). Let {(x i, y i) : i = 1, 2, …n} indicate a training dataset consisting of n independent measurements. Each measurement is a (p + 1)-tuple where x i = (x 1i,  x 2i,  … , x pi) is a vector of outcomes for p explanatory variables and y i (i = 1, 2, …, n) is the outcome for response variable, which only shows membership of ith unit to a certain class.

Consider the famous iris dataset by R.A. Fisher (Anderson 1935) containing 150 data points with four explanatory variables and three classes of iris. The explanatory variables are sepal-length, sepal width, petal-length, and petal-width of three types of irises, namely setosa, virginica, and versicolor. The goal is to identify type of iris using sepal and petal measurements. This is a classification problem. Sepal-length and petal-length distributions in Fig. 1 show that although setosa tend to have lowest and virginica tend to have highest sepal-lengths the separation among three classes based on sepal-length is not clear due to considerable overlap and hence sepal-length by itself is not a good classifier for these three types of irises. On the other hand, petal-length is clearly able to distinguish setosa from the other two but not between virginica and versicolor possibly resulting in misclassification. Hence there is need for more than one predictor to reduce misclassification. Figure 2 shows a scatterplot of iris petal-length versus petal-width. Once again setosa are clearly separated from the other two varieties, but there is still some overlap between virginica and versicolor possibly leading to misclassification.

Fig. 1
figure 1

Distributions of sepal- and petal-lengths of three varieties of iris

Fig. 2
figure 2

A scatterplot of iris petal-length vs petal-width

In this article, we aim to provide basic description of the most well-known and commonly used classification methods that are used to develop classifiers (or classification rules) based on relation between the response variable Y and explanatory variables X, which then are used to assign new objects to these known groups based on observed \( {\mathbf{x}}_0^{\hbox{'}} \). Two soft classifiers (logistic regression and naïve Bayes estimator) and four hard classifiers (linear discriminant analysis, support vector machines, K nearest neighbor, and classification trees), respectively, are described in Sects. 2 and 3 along with their strengths and weaknesses. Some discussion assessing performance of these classifiers for five different datasets, three real and two simulated, is provided in Sect. 4. Some concluding remarks about choice of classifiers in practice are provided in Sect. 5.

2 Soft Classifiers

Intuitively, a soft classifier should appeal to anyone who likes to incorporate the uncertainty of outcome provided by classifiers because it also shows the likelihood of a new observation being a member of different classes. Here two most commonly used soft classifiers, namely logistic regression and naïve Bayes classifiers are discussed.

2.1 Logistic Regression

As described by Cramer (2003), the first use of logistic function in logistic regression was traced to modeling population growth rate in the nineteenth century Africa. Berkson (1944) suggested the use of logistic probability density function (pdf) instead of normal pdf in certain bioassay procedures. He also coined the term logit model to describe the resulting model. Later many researchers in statistics and epidemiology started working on what would eventually become one of the most widely used methods in classification, namely the logistic regression, particularly with G = 2 groups. Cox (1969) is considered one of the pioneers in binary logistic regression. More generalized versions of logistic regression, which can classify new items into G (G ≥ 2) groups, are credited to Gurland et al. (1960), Mantel (1966), and Theil (1969).

For the sake of simplicity, let’s start with the case of binary logistic regression with two classes being coded as 1 and 2 (i.e., y i= 1 or 2). For π i1 = P(y i = 1| x i), i = 1, 2, … , n and under the assumption that the response variable y i has a Bernoulli distribution with parameters π i1, the logistic model is given by,

$$ {\pi}_{i1}=E\left({y}_i=1|{\mathbf{x}}_i\right)=\frac{\exp \left({\beta}_0+{\sum}_{j=1}^p{\beta}_j{x}_{ji}\right)}{1+\exp \left({\beta}_0+{\sum}_{j=1}^p{\beta}_j{x}_{ji}\right)} $$
(1)

where β 0, β 1, … , β p are (p + 1) regression coefficients. Note that in a binary case, π i2 = 1 − π i1.

Alternatively this model can be presented as,

$$ logit\left({\pi}_{i1}\right)=\log \left(\frac{\pi_{i1}}{1-{\pi}_{i1}}\right)={\beta}_0+\sum \limits_{j=1}^p{\beta}_j{x}_{ji} $$
(2)

The regression parameters β j, j = 0, 1, … , p are estimated from the available training dataset. The maximum likelihood (ML) estimates \( \hat{\beta_j},j=0,1,\dots, p \) are obtained by maximizing the likelihood function

$$ L\left({\beta}_0,{\beta}_1,\dots, {\beta}_p\right)=\prod \limits_{i=1}^n{\pi}_i^{y_i}{\left(1-{\pi}_i\right)}^{n-{y}_i}. $$

Since there are no closed-form solutions available for maximizing this likelihood function, iterative algorithms are used to obtain the ML estimates of regression parameters. According to Agresti (2013), the most popular choices for iterative algorithms are either the Newton–Raphson algorithm (Tjalling 1995) or iteratively reweighted least square (IRWLS) algorithm (Burrus et al. 1994). However, sometimes due to the use of too many explanatory variables or highly correlated explanatory variables, these algorithms fail to converge resulting in failure to estimate parameters. Another counter-intuitive situation sometimes occurs when there is a complete separation between two classes using some linear combination of explanatory variables. More information on estimating parameters of logistic regression is available in Menard (2002).

A simple extension of logistic regression from binary to multiclass classification is known as multinomial logistic regression. The multinomial logistic model is given by,

$$ {\fontsize{9.2}{11.2}\selectfont \begin{aligned}{\pi}_{ig}=\frac{\exp \left({\beta}_{0g}+{\sum}_{j=1}^p{\beta}_{jg}{x}_{ji}\right)}{1+{\sum}_{g=1}^{G-1}\exp \left({\beta}_{0g}+{\sum}_{j=1}^p{\beta}_{jg}{x}_{ji}\right)},\kern1.00em g=1,2,\dots, G-1\ \mathrm{and}\ i=1,2,\dots, n.\end{aligned} }$$
(3)

Extending notation used in the binary case to G ≥ 2 groups, we can write π ig = P(y i = g| x i), for i = 1, 2, … , n and g = 1, 2, … , G. Although it does not matter which category is chosen as baseline, generally category G is used as a baseline and π iG can be obtained using the fact that \( {\pi}_{iG}=1-\sum_{g=1}^{G-1}{\pi}_{ig} \). From the point of estimation, there are (p + 1)(G − 1) model parameters to be estimated. For estimating these parameters, the ML estimation or the maximum a posteriori (MAP) methods are commonly used (Murphy 2012). Estimation method MAP is similar to ML in the sense that it chooses that value of parameter which maximizes the value of a mathematical function, in this case the posterior distribution of the parameter itself. Most of the times, a closed-form solution is not available, hence different algorithms are used for estimation and IRWLS is a popular choice among practitioners.

If the G ≥ 2 classes are ordered using an ordinal response variable, an alternative popular model often used in practice is the proportional-odds cumulative logit model. For example, consider a typical Likert scale question where the responders are asked to grade certain experience on a scale of 1 to 5 with 1 being the worst rating and 5 being the best. It might be of interest to determine if there exist some explanatory variables that can explain how the responders rate their experience. First developed by Snell (1964), this model is given by,

$$ {\fontsize{9.2}{11.2}\selectfont \begin{aligned}{L}_{ig}=\log \frac{\sum_{c=1}^g\kern0.5em {\pi}_{ic}}{\sum_{c=g+1}^G\kern0.5em {\pi}_{ic}}={\beta}_{0g}+\sum \limits_{j=1}^p{\beta}_j{x}_{ji},\kern1.00em \mathrm{for}\ g=1,2,\dots, G\ \mathrm{and}\ i=1,2,\dots, n.\end{aligned}} $$
(4)

Here, L ig represents the log-odds of two cumulative probabilities. A manageable number of total (G − 1 + p) parameters are to be estimated from this model. Typically, ML estimates of parameters of this model are obtained using iterative algorithms such as IRWLS and majorization-minimization (Lange 2016).

2.2 Naïve Bayes Classifier

Naïve Bayes (NB) is a family of soft classifiers that uses the Bayes theorem (Bayes 1763) along with a very strong assumption of independence among explanatory variables which is often unrealistic. However, this classifier works very well in the presence of dependencies among many categorical explanatory variables (Rish 2001) and is quite fast to execute even with large datasets.

NB classifier differs from the logistic regression classifier in terms of how the probability π ig is modeled. When using a logistic regression classifier, π ig = P(y i = g| x i) is modeled directly from data. On the other hand, when using a Naïve Bayes classifier, first the estimates for P(x i| y i = g) are obtained from data and then assuming independence among explanatory variables, π ig is modeled using Bayes theorem as,

$$ {\pi}_{ig}\propto P\left({y}_i=g\right)\prod \limits_{j=1}^pP\left({x}_{ji}|{y}_i=g\right),\kern0.5em g=1,2,\dots, G\ \mathrm{and}\ i=1,2,\dots, n. $$
(5)

The estimate for P(y i = g) can be obtained from the training set as the proportion of training set observations that belong to class g (g = 1, 2,  … , G). The estimates for P(x ji| y i = g) are typically obtained via ML estimation technique. A new observation is assigned to a group for which probability π ig is maximum among all G groups.

Estimating parameters from the likelihood function depends on how the likelihood, P(x ji| y i = g), i = 1, 2, … , n, j = 1, 2, … , p, and g = 1, 2, … , G is modeled parametrically. If X j is a continuous random variable, then the popular choice of distribution is normal (Gaussian) such that (X j| Y = g)∼N(μ g, Σ g). If X j is a categorical random variable with m categories, then the most commonly used distribution is multinomial, i.e., (X j| Y = g)∼Multinomial(1, ϕ 1g,  … , ϕ mg) for one trial where ϕ lg, l = 1, 2, … m is the probability associated with the l th category such that \( {\sum}_{l=1}^m{\phi}_l=1 \).

The NB classifier is remarkably effective considering the assumptions needed to obtain the probabilities are almost always wrong (Hand and Yu 2001). This method is a building block to what is commonly known as a Bayesian spam filter (Nigam et al. 2000) used by the email providers. A semi-parametric version of NB classifier performs much better when the explanatory variables are obviously non-normal (Soria et al. 2011).

3 Hard Classifiers

Hard classifiers typically do not provide a probability of group association. In other words, there is no uncertainty associated with classification because classifier provides a hard boundary between groups and exactly for this reason some researchers like to use them. Four commonly used classifiers discussed here are linear discriminant analysis, K nearest neighbor, support vector machines, and classification trees.

3.1 Linear Discriminant Analysis

Linear discriminant analysis (LDA) is a hard classification method. Statistical literature indicates that LDA is one of the first methods developed for classification and its basic idea originated from none other than Fisher (1936). The basic idea behind LDA is to determine that linear combination of explanatory variables which will magnify the difference between two classes making it easier to achieve correct classification. The generalization of this idea for classification into G(G > 2) classes is credited to Rao (1948). The NB classifier is similar to LDA in nature (Hand and Yu 2001), although in LDA the aim is obtain a classifier while in NB there is more emphasis on identifying a class with the maximum posterior probability.

Fisher (1936) proposed a classification rule for two groups which involved determining a vector r that maximizes function δ(r) = (r Σr)−1(r (μ 1 − μ 2))2 under the assumption that (X| Y = g)∼N(μ g, Σ) for g = 1, 2. This is equivalent to finding a hyperplane that provides a solution to equation

$$ \mathit{\log}\left[\frac{P\left(Y=1|\mathrm{X}\right)}{P\left(Y=2|\mathrm{X}\right)}\right]=0. $$
(6)

Using Bayes’ rule, we can write,

$$ P\left(Y=g|\mathbf{X}\right)=P\left(Y=g\right)\frac{P\left(\mathbf{X}|Y=g\right)}{P\left(\mathbf{X}\right)}\kern1.25em \mathrm{for}\ g=1,2 $$

where p g = P(Y = g) for g = 1, 2 is the overall class probability and can be estimated from the training data. Under the assumption that the explanatory variables are multivariate normal, the hyperplane can be found by solving the following equation for r,

$$ \mathit{\log}\left[\frac{p_1}{p_2}\right]+{\mathbf{r}}^{\hbox{'}}{\boldsymbol{\Sigma}}^{-1}\left({\boldsymbol{\upmu}}_1-{\boldsymbol{\upmu}}_2\right)-\frac{1}{2}\left({\boldsymbol{\upmu}}_1^{\hbox{'}}{\boldsymbol{\Sigma}}^{-1}{\boldsymbol{\upmu}}_1-{\boldsymbol{\upmu}}_2^{\hbox{'}}{\boldsymbol{\Sigma}}^{-1}{\boldsymbol{\upmu}}_2\right)=0. $$
(7)

Solution to (7) leads to a linear classifier (or a linear boundary between two groups) because Eq. (6) is a linear function of explanatory variables. The first step in LDA is to estimate the mean vectors (μ 1 and μ 2) and variance–covariance matrix (Σ) from the training dataset. For any new observation, x 0, one can estimate \( {\Delta}_{{\mathbf{x}}_0} \) from (8) as,

$$ {\hat{\Delta}}_{{\mathbf{x}}_0}=\mathit{\log}\left[\frac{{\hat{p}}_1}{{\hat{p}}_2}\right]+{\mathbf{x}}_0^{\hbox{'}}{\hat{\boldsymbol{\Sigma}}}^{-1}\left({\hat{\boldsymbol{\upmu}}}_1-{\hat{\boldsymbol{\upmu}}}_2\right)-\frac{1}{2}\left({\hat{\boldsymbol{\upmu}}}_1^{\hbox{'}}{\hat{\boldsymbol{\Sigma}}}^{-1}{\hat{\boldsymbol{\upmu}}}_1-{\hat{\boldsymbol{\upmu}}}_2^{\hbox{'}}{\hat{\boldsymbol{\Sigma}}}^{-1}{\hat{\boldsymbol{\upmu}}}_2\right). $$
(8)

Using this \( {\hat{\Delta}}_{{\mathbf{x}}_0} \) value a new observation x 0 is assigned to one of the two groups as follows:

$$ \mathrm{Assign}\ {\mathbf{x}}_0\ \mathrm{to}\ \left\{\begin{array}{l}\mathrm{Group}\ 1\kern1.25em \mathrm{if}\ {\hat{\Delta}}_{{\mathbf{x}}_0}>0\\ {}\mathrm{Group}\ 2\kern1.25em \mathrm{if}\ {\hat{\Delta}}_{{\mathbf{x}}_0}<0.\end{array}\right. $$

In cases where the assumption of homoscedasticity of variance–covarinace matrix is not justified and a more general underlying assumption is that (X| Y = g)∼N(μ g, Σ g) for g = 1, 2, a quadratic classifier (i.e., a quadratic function) is used to describe a boundary between two classes. This procedure is known as quadratic discriminant analysis (QDA) (Hastie et al. 2001). In QDA, the hyperplane can be obtained by solving (9) for r.

$$ {\displaystyle \begin{array}{l}\mathit{\log}\left[\frac{p_1}{p_2}\right]-\frac{1}{2}\mathit{\log}\frac{\left|{\boldsymbol{\Sigma}}_1\right|}{\left|{\boldsymbol{\Sigma}}_2\right|}+{\mathbf{r}}^{\hbox{'}}\left({\boldsymbol{\Sigma}}_1^{-1}{\boldsymbol{\upmu}}_1-{\boldsymbol{\Sigma}}_2^{-1}{\boldsymbol{\upmu}}_2\right)\\[8pt] {}\kern2.5em -\frac{1}{2}{\mathbf{r}}^{\hbox{'}}\left({\boldsymbol{\Sigma}}_1^{-1}-{\boldsymbol{\Sigma}}_2^{-1}\right)\mathbf{r}-\frac{1}{2}\left({\boldsymbol{\upmu}}_1^{\hbox{'}}{\boldsymbol{\Sigma}}_1^{-1}{\boldsymbol{\upmu}}_1-{\boldsymbol{\upmu}}_2^{\hbox{'}}{\boldsymbol{\Sigma}}_1^{-1}{\boldsymbol{\upmu}}_2\right)=0\end{array}} $$
(9)

As can be seen from (7) and (9), a QDA requires more parameters to be estimated from the training dataset, precisely (2 + 2p + 2p 2) parameters for QDA compared to (2 + 2p) for LDA. That can lead to a serious issue if the training dataset is small. To overcome this issue, Srivastava et al. (2007) proposed an effective Bayesian solution.

A simpler method under the assumption of homoscedasticity of variance–covariance matrices is to use Mahalanobis distance (Mahalanobis 1936) for classification. For any new observation, x 0, a linear discriminant function LDF g is computed for each group (see (10)) under the assumption that μ g and Σ , respectively, are the unknown mean vector and variance–covariance matrix of X.

$$ {LDF}_g\left({\mathbf{x}}_0\right)={\mathbf{x}}_0^{\hbox{'}}{\hat{\boldsymbol{\Sigma}}}_0^{-1}{\hat{\boldsymbol{\upmu}}}_g-\frac{1}{2}{\hat{\boldsymbol{\upmu}}}_g^{\hbox{'}}{\hat{\boldsymbol{\Sigma}}}_0^{-1}{\hat{\boldsymbol{\upmu}}}_g+{\hat{p}}_g $$
(10)

where \( {\hat{\boldsymbol{\Sigma}}}_0 \) is the pooled estimate of the common variance–covariance matrix Σ and \( {\hat{p}}_g \) is the estimate of probability p g = P(Y = g) for g = 1, 2, … , G obtained from the training data. Then the new observation x 0 is assigned to the group with the highest discriminant function value, i.e., the group corresponding to max{LDF g(x 0), g = 1, 2,  … , G}.

Although LDA is quite effective in many situations (Hand 2006), in some situations the joint pdf of explanatory variables differs considerably from the multivariate normal distribution. In such cases semi-parametric LDA technique derived by Mai and Zou (2015) under the assumption of sparse variance–covariance matrix is more effective.

3.2 K Nearest Neighbor

Assumed to have originated in long past, the history of K(1 < K < n) nearest neighbor (KNN) classification is not really that well known. In modern times, Sebestyen (1962) described this method as proximity algorithm and Nilsson (1965) called it the minimum distance classifier. Cover and Hart (1967) were the first to name this algorithm as the nearest neighbor and that name became popular.

Although mostly used as a hard classifier, KNN can be used as a soft classifier too. The idea behind KNN is quite simple and no parametric assumption is required. Given a training dataset of size n(n > K), this classification algorithm starts when a new observation, x 0 = (x 01,  … , x 0p), is recorded with known values for all explanatory variables but unknown class. The first step is to calculate the K nearest neighbors in terms of the explanatory variables. Using some well-defined distance measure, distance d i = d(x 0, x i), i = 1, 2, … , n between this new observation and each observation from the training dataset is calculated and these distances are ordered as d (1) ≤ d (2) ≤  …  ≤ d (n). Considering the lowest K distances, {d (1), d (2),  … , d (K)}, the class membership of these closest K neighbors in the training dataset is determined. Then the new observation is placed in the class that has the largest number of these K neighbors. For example, suppose k g of the nearest K neighbors belong to group g(g = 1, 2,  … , G) such that \( {\sum}_{g=1}^G{k}_g=K \), then the new observation is placed in the group c if k c =  max {k g, g = 1, 2,  … , G}. Note that there is a possibility that no such unique maximum exists for a given new observation and a chosen K, thus resulting in ties. Although not exactly a group inclusion probability, these nearest neighbors can be used to provide a group membership indicator of the new observation using relative fractions (k g/K), g = 1, 2, … , G.

Now the question is: how to choose value of K, the number of nearest neighbors to be used? Given a large dataset one can always use cross-validation and choose the K value corresponding to the lowest misclassification rate in the validation dataset. Note that choice of a too small value for K indicates that the space generated by the explanatory variables is divided into many small subspaces and the class membership of a new observation depends on which subspace the new observation belongs to. In that case outliers in the original dataset can create problems in predicting the class membership of a new observation that is close to the outlier resulting in a higher variance in prediction. However, choice of a large value for K basically leads to division of the training data space into G smooth subspaces which in turn creates the problem of misclassification of any outlier of these subspaces and subsequently higher bias in prediction. As a rule of thumb, \( K=\sqrt{n} \) is considered to be a sensible choice for number of classes in practice. If the number of groups in the data is 2 (i.e., G = 2), then K should be an odd number to avoid the possibility of ties in group membership indicators.

The most popular choice for a distance measure is the Euclidean distance which for a new observation, x 0 = (x 01,  … , x 0p), is calculated as,

$$ {d}_i=\sqrt{\sum \limits_{j=1}^p{\left({x}_{0i}-{x}_{ji}\right)}^2},\kern0.5em i=1,2,\dots, n. $$
(11)

Some other distance measures used commonly in practice are Hamming distance (Hamming 1950) and Chebyshev distance (Grabusts 2011). Chomboon et al. (2015) looked at eleven different distance measures and found that the Euclidean, Chebyshev, and Mahalanobis distance measures perform well. For a synopsis of different distance measures, please refer to Mulekar and Brown (2014). Different explanatory variables tend to have different range of possible values and some distance measures such as the Euclidean distance tend to be affected by the range of measurements. Hence in practice, datasets are typically normalized before classification to reduce the influence of explanatory variables with larger range of measurements. When using a dataset with a large number of explanatory variables, to reduce the computation time, a dimension reduction technique such as principal component analysis (PCA) is used. To overcome the problem of choosing a value for K, Samworth (2012) suggested the use of weighted nearest neighbor algorithm in which instead of choosing K nearest neighbors (i.e., essentially assigning a weight of 1/K to K nearest neighbors and 0 to the remaining observations in the training dataset while assigning a class to the new observation), all observations in the training dataset are assigned a weight using some optimal weighting scheme. When dealing with a big dataset, an approximation to the method of nearest neighbors proposed by Har-Peled et al. (2012) is useful.

3.3 Support Vector Machine

Support vector machine (SVM) is a class of hard classifiers. For a binary classification with p explanatory variables, an SVM classifier constructs a (p − 1)-dimensional hyperplane in the p th dimension to maximize the margin. Here margin refers to the distance between the observation closest to the boundary of a group and the remaining groups. Points on or closest to the boundary of decision surface are called support vectors and they are used in learning models associated with the classification algorithm. The idea behind SVM is to find that hyperplane which provides the maximum margin from support vectors among infinitely many possible hyperplanes that can separate two groups provided the two groups are completely separable. For a binary classification, one hyperplane known as the maximum margin hyperplane is constructed. For G > 2 groups, more than one such maximum margin hyperplanes need to be created to separate groups and a combination of these hyperplanes is used for the classification of a new observation.

Consider the case of binary classification, and assume that there actually exists a linear hyperplane of the form

$$ W\left(\mathbf{X}\right)={w}_0+\sum \limits_{j=1}^p{w}_j{X}_j $$
(12)

that can perfectly differentiate between two classes. Then a method described by Vapnik and Lerner (1963) can be used to find a maximum margin hyperplane. Maximum margin hyperplane is a hyperplane for which W(X) = 0. In SVM, only support vectors obtained using the training data are used to estimate the coefficients of explanatory variables in (12). Since the decision surface differentiates the classes completely, the linear function in (12) should be positive for one group and negative for another. Without any loss of generality, assume that for support vector(s) in group 1, \( \hat{W}(X)=-1 \) and for those in group 2, \( \hat{W}(X)=1 \). In order to maximize the margin, it is sufficient to minimize \( {\sum}_{j=1}^p{w}_j^2 \) subject to \( {v}_i\hat{W}\left({\mathbf{x}}_i\right)\ge 1,i=1,2,\dots, n \) where v i =  − 1 if y i = 1 and v i = 1 if y i = 2. Thus this hyperplane can be obtained by minimizing the Lagrangian formulation,

$$ L=-\sum \limits_{i=1}^n{a}_i\left({v}_iW\left({\mathbf{x}}_i\right)-1\right) $$

where a ii = 1, 2,  … , n) are Lagrange multipliers. Once this hyperplane is estimated, \( \hat{W}\left({\mathbf{x}}_0\right) \) is computed for any new observation x 0 and the new observation is assigned to the group 1 if the \( \hat{W}\left({\mathbf{x}}_0\right)<0 \) and to group 2 if \( \hat{W}\left({\mathbf{x}}_0\right)>0 \).

In many practical situations, a perfectly differentiating hyperplane does not exist. For such situations, Cortes and Vapnik (1995) proposed a modification to the maximum margin hyperplane to differentiate between two groups. They proposed estimating the hyperplane with the help of a hinge loss function, \( h\left(\mathbf{x}\right){=}\max \left(0,1-v\hat{W}\left(\mathbf{x}\right)\right) \). Note that unlike a linearly separable case where \( {\nu}_i\hat{W}\left({\mathbf{x}}_0\right)\ge1 \) \( \forall i \); for linearly non-separable cases, the possibility of \( {\nu}_i\hat{W}\left({\mathbf{x}}_0\right)<0 \) exists for a few support vectors. So, the hinge loss function is 0 for such support vectors and this is used to penalize such support vectors while estimating the decision boundary. Thus, instead of minimizing \( {\sum}_{j=1}^p{w}_j^2 \) subject to \( {v}_i\hat{W}\left({\mathbf{x}}_i\right)\ge 1,i=1,2,\dots, n \), function

$$ \theta \sum \limits_{j=1}^p{w}_j^2+\frac{1}{n}\sum \limits_{i=1}^n\max \left(0,1-{v}_i\hat{W}\left({x}_i\right)\right) $$
(13)

is minimized where θ is a penalty parameter. This is known as the soft version of SVM, although this is not a soft classifier.

Since a linear classifier does not always exist, researchers extended the idea of SVM to find a non-linear classifier. Using ideas first promoted by Aizerman et al. (1964), Boser et al. (1992) proposed the use of kernelization to obtain a non-linear classifier by improving a classifier obtained via SVM. The trick is to use a transformation Z of X such that the new transformed explanatory variables Z(X) provide a better classifier than the one provided by X. Then, proceed to obtain an SVM based on the new transformed explanatory variables, Z. A carefully chosen transformation Z can possibly result in a linear classifier. Note that Z is not observed or calculated from data but it is replaced by the kernel function, κ , such that \( \kappa \left({\mathbf{x}}_{\boldsymbol{i}},{\mathbf{x}}_{\boldsymbol{l}}\right)={\mathbf{z}}_{\boldsymbol{i}}^{\hbox{'}}{\mathbf{z}}_{\boldsymbol{l}} \) for i ≠ l = 1, 2, … , n. There are many kernel functions in use, but the most used Gaussian kernel (Schölkopf et al. 1997) is given by,

$$ \kappa \left({x}_i,{x}_l\right)=\exp \left(-\xi \sum \limits_{j=1}^p{\left({x}_{ij}-{x}_{lj}\right)}^2\right). $$

To develop a multiclass SVM classifier (i.e., for G > 2), there are few options available. In a method known as one-against-all, G SVM classifiers are obtained for each class separately and a new observation is assigned to the class chosen by maximum number of these classifiers (Bottou et al. 1994). In another method known as one-against-one (Kressel 1998), G(G − 1)/2 SVM classifiers each separating a pair of classes are obtained and a new observation is assigned to the class that is predicted by the most classifiers (Kressel 1998). Hsu and Lin (2002) who compared their performances concluded that one-against-one performs better than one-against-all in most of the situations that they studied. There are many modifications of SVM proposed by researchers from different fields that work better in certain specific situations. Typically, SVM works really well if there exists a good separation between classes or when the number of explanatory variables is large compared to the sample size of the training dataset. SVM is not computationally effective when using a very large training dataset.

3.4 Classification Trees

Classification trees (CT) are methods used to partition the space of explanatory variables into disjoint subsets and assign a class to each subset by minimizing some measure of misclassification also known as impurity. It is a visually pleasing method and can be easily as well as effectively described to those from the non-scientific communities. CT can handle large datasets as well as missing data, and it can easily ignore bad explanatory variables. However, sometimes depending on the dataset CT can produce a really bad partition of the space of explanatory variables leading to high misclassification rates.

CT produces a flowchart or tree structure starting with a root node (one explanatory variable) and then, proceeds with splits (internal nodes) until no split is deemed necessary (leaf nodes). Each leaf node is assigned to a class. There are many algorithms on how to select a root node, how to split a node, how many splits of each node are needed, and when to stop splitting a node to make it a leaf node. An example of classification tree is shown in Fig. 3. It shows classification of a random sample of n = 78 from iris data by R.A. Fisher (Anderson 1935) using JMP 12 into one of the three classes using two explanatory variables petal-width and petal-length.

Fig. 3
figure 3

Classification tree using a random sample of iris data (n = 78)

The root node is typically chosen with an explanatory variable that provides the lowest rate of misclassification. This is easily achieved when the number of explanatory variables is small. For example, let there be two classes (G = 2) and one explanatory variable (p = 1). Consider the rule for using two complementary subgroups A g created by a split with the explanatory variable such that, y i = g  if  x1i ∈ A g,   g = 1, 2. For each split a Gini impurity measure is computed as,

$$ I(CT)=\sum \limits_{g=0}^1\left(1-\sum \limits_{j=0}^1\widehat{q_g(j)}\right) $$

where

$$ \widehat{q_g(j)}=\frac{\sum \limits_{i=1}^nI\left({y}_i=j;{x}_{1i}\in {A}_g\right)}{\sum \limits_{i=1}^nI\left(\ {x}_{1i}\in {A}_g\right)} $$

is the misclassification rate for group g and the splits are chosen such that the Gini impurity measure is minimized (Witten et al. 2011). As shown in Fig. 3, LogWorth for each model defined as −log10(p‐value) is also another measure used to decide where to split. The p‐value can be based on a chi-square test for a split that maximizes LogWorth value.

The first classification tree algorithm was proposed by Messenger and Mandell (1972). However, Breiman et al. (1984) have provided what became the most popular classification tree algorithm, namely classification and regression trees (CART). Several improved versions of it were proposed later and are still used in practice. Many modifications of the CART method have been proposed for various reasons, but mainly because CART produces biased and high-variance trees, i.e., changing the training set can drastically change the tree diagram. A few Bayesian versions of this algorithm are also available in the literature (Chipman et al. 1998; Denison et al. 1998). Loh (2009) provides a classification algorithm, called GUIDE which is computationally faster and incorporates nearest neighbor algorithm to improve the CT.

To reduce the variance in CT, two new methods were proposed, namely the bagging method (Breiman 1996) and the random forests method (Breiman 2001). Note that these methods do not produce a tree diagram but they focus on obtaining many classification trees from the training data so that a new observation that needs to be classified is put into the class suggested by majority of these trees. Bagging is simply achieved by obtaining bootstrap samples with replacement from the training data. Random forests are similar to bagging but in each tree only a randomly chosen subset of typically \( \sqrt{p} \) number of explanatory variables is considered when determining the nodes.

Freund and Schapire (1997) introduced the concept of boosting which aims to reduce bias in a CT. This is achieved by refitting the data into trees with higher weights for misclassified data points. In the initial calculation of the first CT, all data points are given equal weight. However, the weights are updated after each iteration and the impurity measure is updated by assigning higher weights to the misclassified observations. Then the final classifier is selected via weighted average of the trees.

4 Assessment and Comparison of the Performance of Classifiers

The performance of a classifier is typically judged by cross-validating the classification rule with a separate dataset of size s, called the validation data. Sometimes cross-validation is also used to estimate unknown parameters such as the number of neighbors to be considered in KNN method. In the absence of a separate validation data, the idea of Jackknife sampling (Quenouille 1949, 1956; Tukey 1958) is used to obtain a K-fold cross-validation. In this special case of cross-validation, the training dataset is divided into K smaller datasets of equal size, and (K − 1) of them are used as the training data and the remaining K th one as the validation dataset. This process is repeated K times until each of them is used once as the validation data.

The simplest performance measure of a classifier is the misclassification rate R(0 ≤ R ≤ 1), which is the proportion of validation sample that is misclassified. Hence a small value of R(R → 0) is an indication of more accurate classification. Although a very simple metric, this is an effective measure of performance. It works well as long as the cost of misclassification for and sample sizes from all classes are relatively similar. If sample sizes differ considerably, then the use of an uncertainty coefficient is recommended (Mills 2011). Uncertainty coefficient U is calculated as U = (H − H c)/H (0 ≤ U ≤ 1) where

$$ H=-\sum \limits_{g=1}^GP\left(Y=g\right)\log \left(P\left(Y=g\right)\right) $$

and

$$ {H}_c=-\sum \limits_{g=1}^G\sum \limits_{l=1}^GP\left(Y=g,\hat{Y}=l\right)\log \left(P\left(Y=g|\hat{Y}=l\right)\right) $$

can be estimated from the training data. Larger values of U indicate a better classifier. If the accuracy of classification for only one class is very important, then one can calculate the sensitivity (also known in medicine as the true positive rate or in machine learning as the recall rate) for that class. Sensitivity also takes values between 0 and 1 but a good classifier is expected to have a higher sensitivity. The sensitivity for class g can be calculated as,

$$ {sen}_g=\frac{\sum_{i=1}^nI\left({y}_i=g,\widehat{y_i}=g\right)}{\sum_{i=1}^nI\left({y}_i=g,\widehat{y_i}=g\right)+{\sum}_{i=1}^nI\left({y}_i=g,\widehat{y_i}\ne g\right)} $$

where I is the indicator variable taking values 1 or 0 depending on whether the condition is satisfied or not.

Besides misclassification rate, sensitivity, and uncertainty coefficient, there are many other performance measures available to judge classification methods, a detailed discussion of which is provided by Hand (2012).

Some articles dedicated to comparison of different classification methods are available. The earlier research was mostly focused on comparing logistic regression against LDA (Hosmer et al. 1983 and McLachlan and Byth 1979). Their outcomes indicate that LDA is a better performer if the explanatory variables are normally distributed but the advantage diminishes as sample size becomes larger. Meshbane and Morris (1996) recommended that QDA should be used instead of LDA if the distributions of explanatory variables are skewed. After comparing outcomes using classification tree and KNN, Liu and White (1995) concluded that KNN performs better than classification tree unless the number of explanatory variables is large. A study by Bhattacharya et al. (2011) compared SVM to logistic regression for detecting credit card fraud and found no advantage in using more complicated method like SVM over simpler logistic regression. Finch and Schneider (2006) compared performances of logistic regression, discriminant analysis, and classification trees based on simulated data while Kiang (2003) compared performances of logistic regression, LDA, KNN, and classification tree based on a separate set of simulated data. Asparoukhov and Krzanowski (2001) compared performances of all but one (namely SVM) classifiers mentioned in this paper using five real-life datasets for binary classification. They also discussed the effect of choosing different sized training set along with changing the number of explanatory variables. Steel et al. (2000) argue that simply comparing the methods is not completely meaningful unless the model selection process (i.e., the choice of explanatory variables in the final prediction model) is included. The only conclusion that can be drawn from all these studies is that there is no winner among all methods that work in every situation very effectively. The performance of a classifier depends very much on the dataset for which a classification is needed.

In this section, we compare the performance of six classifiers discussed earlier using two simulated datasets and three real-life datasets with respect to the misclassification rates and uncertainty measure U. All computations were done using available R packages rpart, e1071, class, naivebayes, and MASS. Of the two simulated datasets, one is visually separable albeit not linearly while the other is not separable using any reasonable curve and provides a challenge in terms of classification (see Fig. 4). Both datasets have two explanatory variables as presented in scatterplots in Fig. 4 where each class is represented by separate point type and color. For NB classifier, Gaussian prior was used. For KNN classifier, the next larger odd integer to \( \sqrt{n} \) was used as the value of K, except in one real data example (skin data), where this value was too large due to large dataset. The observed misclassification rates for six methods and five examples are listed in Table 1 and the uncertainty coefficients are listed in Table 2.

Fig. 4
figure 4

Visualization of the simulated datasets

Table 1 Comparison of misclassification rates for different classifiers for five datasets
Table 2 Comparison of uncertainty measure for different classifiers for five datasets

Example 1 (Iris)

Consider the famous iris dataset by R.A. Fisher (Anderson 1935) described in the Introduction. Fifty observations are available for each type of iris. Of the 150 measurements available, a training dataset of 99 observations was created with 33 observations each from three groups. The misclassification rate was estimated based on the remaining 51 observations that constituted a validation sample. Misclassification rates lower than 0.10 (Table 1) and uncertainty measures over 0.70 (Table 2) show that all the methods did a commendable job of correct classification for this data. However, NB, KNN, and LDA are slightly better than other classifiers.

Example 2 (Skin)

Refer to the skin segmentation dataset from the UCI machine learning repository (Bhatt et al. 2009). This dataset contains 245,057 observations randomly sampled from photos of faces of people of different age group, gender, and color. Of those, 50,589 observations are for samples of skin while the rest are for samples of non-skin parts of the face. The three explanatory variables in this example are RGB triplet, i.e., red, green, and blue colors used in displaying images. RGB values are typically given as an integer value in the range of 0–255, and combined together they determine the color of the image which in this case is part sampled. Distribution of RGB pixels for skin data is presented in a 3-dimensional plot in Fig. 5. Without rotating the plot around three axes it is difficult to tell if there is clear distinction or some overlap between two groups, skin and non-skin. A training sample of size 150,000 was used, out of which 30,000 were skin samples. Tables 1 and 2 show that KNN and SVM perform best for this data, followed by NB and CT. To save computation time, K = 19 was used for KNN algorithm.

Fig. 5
figure 5

Visualization of RGB pixel distribution for skin data

Example 3 (Glass)

Consider the glass dataset from UCI machine learning repository (Lichman 2013). The original dataset describes six types of glass samples along with the refractive index and weight percent of oxides formed with sodium, magnesium, aluminum, silicon, potassium, calcium, barium, and iron in the sample. Since some of the classes have small sample sizes, only three types of glass were used for classification purpose in this example. They are float-processed building window glasses (70 measurements), nonfloat-processed building window glasses (76 measurements), and non-window headlamp glasses (29 measurements). Fifty samples each from two classes of building window glasses and 20 samples from headlamp were used as the training data. Simulations to obtain parameters for a multinomial logistic regression failed due to non-convergence of iterations. Outcomes in Tables 1 and 2 indicate that KNN performs the best followed by CT and LDA.

Example 4 (Bivariate Uniform)

Consider two independent univariate uniform distributions, namely X i ∼ Uniform(−1.25, 1.25) for i = 1, 2. A sample of 3000 observations was generated with seed 1234. With the unit circle providing the class boundary, the i-th observation is assigned to group 1 if \( {x}_{1i}^2+{x}_{2i}^2\le 1 \) and to group 2 otherwise. The first 2000 observations generated were used as a training data and the remaining 1000 as a validation sample. In this training dataset, 1012 observations were from group 1 and the remaining 998 from group 2. In the validation dataset, 520 observations were from group 1 and the remaining 480 from group 2. Note that in this situation a linear classifier is not supposed to perform well because of non-normal distributions and that is reflected in the misclassification rates listed in Table 1 and uncertainty measures listed in Table 2. Logistic regression and LDA seems to be only as good as a coin toss in this situation whereas KNN and SVM perform admirably well.

Example 5 (Bivariate Normal)

Now consider the bivariate normal populations. A sample of 1500 observations from two homoscedastic bivariate normal distributions that differ only in mean vector was generated using the mvtnorm package in R with seed 5678, resulting in a total sample of size 3000. The difference in the mean vectors and the variance–covariance matrix used in the simulation were, respectively,

$$ {\mu}_1-{\mu}_2=\left[\begin{array}{l}1.0\\ {}1.0\end{array}\right]\kern1em \mathrm{and}\kern1em \Sigma =\left[\begin{array}{cc}1.0& 0.5\\ {}0.5& 1.0\end{array}\right]. $$

The first 1000 rows were used as a training sample for both groups (resulting in a total sample size of 2000) and the remaining 500 as the validation sample (resulting total sample size 1000). Tables 1 and 2 show KNN as a clear winner while the other classifiers are almost equally bad. Although we expected LDA to perform better, to our surprise the results say otherwise.

5 Concluding Remarks

In this paper, the basic ideas that dominate the world of statistical classification were described. Detailed discussions of them are scattered in different textbooks, but none discusses them all together. For example, logistic regression is discussed in detail by Kleinbaum and Klein (2010), LDA by McLachlan (2004), SVM by Steinwart and Christmann (2008), classification trees by Breiman et al. (1984), and different classification methods by Izenman (2008) and James et al. (2013).

For data with highly correlated explanatory variables or a large number of explanatory variables, the use of some dimension reduction technique such as principal component analysis, low variance filter, and high correlation filter before classification is recommended (Farcomeni and Greco 2015). In cases where p > n, dimension reduction becomes necessary. Alternatively, although random forests method is not a dimension reduction technique for explanatory variables, in cases where \( \sqrt{p}<n \) this method can be effectively used without reducing dimension of explanatory variables.

A very basic question on this topic should be about the preference for any particular classification method. Alternatively, should there be preference for a certain classification method over the others. It depends on the circumstances. There is no single method that stands out as the best. Typically for complex problems in which the misclassification rate is higher among all classifiers, the use of soft classifiers is recommended. However, hard classifiers remain popular as their outcomes are easier to interpret in practice. Also hard classifiers like SVM and KNN generally provide good outcomes as seen from the situations discussed here. In this age of computation, the most recent research emphasis is on effective ways of implementing bagging and random forests (James et al. 2013) which can be computationally more effective than other classifiers like KNN. Liu et al. (2011) describe a suave large-margin unified machine that combines margin-based hard and soft classifiers, and that hard classifiers tend to perform better than soft classifiers when the classes are either easily separable or when the training sample size is relatively small compared to number of explanatory variables.

Research over the years has led to the development of many classifiers. As a result, the toolbox from which a classifier can be chosen provides an extensive list of options which to some extent depends on software used by and the computing power available for the researcher. Also comparative performance of different classifiers is changing with changing technology and results of past studies might lead to different conclusions with the current technology. Thus, one can entertain the idea of using all possible classifiers and assign a new observation to a class assigned by most of the classifiers.