1 Introduction

Consider the following standard two-class classification problem. Let \((\mathbf{Z}, Y)\) be a random pair with an underlying distribution \(F_{\mathbf{Z},Y}\), where \(\mathbf{Z}\in \mathbb {R}^s\), \(s\ge 1\), is a vector of observed covariates, and \(Y\in \{0,1\}\) is the unobserved class membership of \(\mathbf{Z }\). The problem is then to predict Y based on \(\mathbf{Z}\). More specifically, in classification, one seeks to find a function (classifier) \(\psi :\mathbb {R}^s \rightarrow \{0,1\}\) for which the misclassification error probability, \(P\{\psi (\mathbf{Z })\ne Y\}\), is as small as possible (Devroye et al. 1996). The optimal or best classifier, denoted by \(\psi _B\), is given by

$$\begin{aligned} \psi _B(\mathbf{z })= {\left\{ \begin{array}{ll} 1 \quad \text { if } E(Y\vert \mathbf{Z }=\mathbf{z } )>\frac{1}{2}\\ 0 \quad \text { otherwise, } \end{array}\right. } \end{aligned}$$
(1)

and is sometimes called the Bayes classifier or the Bayes decision or rule in the literature on statistical classification [see, for example, (Devroye 1981; Devroye et al. 1996; Györfi et al. 2002)].

When the distribution \(F_{\mathbf{Z},Y}\) is completely known, finding \(\psi _B\) does not pose a challenge. In practice however, \(F_{\mathbf{Z},Y}\) is usually unknown (fully or partially), thus finding \(\psi _B\) is virtually impossible. One only has access to a random sample (data), denoted by \(D_n=\{(\mathbf{Z}_i, Y_i); i=1,\ldots , n\}\), where \((\mathbf{Z}_i, Y_i)\) are independently and identically distributed (iid) according to \(F_{\mathbf{Z},Y}\). The goal is then to construct a sample-based classifier \(\psi _n\) whose error rate is in some sense small. A data-based rule \(\psi _n\) is said to be strongly consistent if

$$\begin{aligned} P\{\psi _n({ \mathbf Z})\ne Y\,|\,D_n\} \overset{a.s.}{\longrightarrow }P\{\psi _B(\mathbf{Z})\ne Y\}. \end{aligned}$$

If the convergence holds in probability, we say \(\psi _{n}\) is weakly consistent. Several nonparametric classifiers such as general partitioning methods (Glick 1973; Gordon and Olshen 1978, 1980), nearest neighbor rules (Devroye and Wagner 1982), and kernel rules (Devroye and Krzyzak 1989; Devroye and Wagner 1980; Krzyzak 1986) have been proposed in the literature with strong consistency properties.

In this paper we focus on the important problem where there may be missing covariates in \(\mathbf{Z}_i, ~i=1,\ldots ,n\), and in the new unclassified covariate vector \(\mathbf{Z}\). We consider kernel classifiers and propose methods to estimate the smoothing parameters of the kernels when missing covariates are present.

Much of the research on statistical estimation in the presence of missing covariates has relied on assumptions such as missingness at random (MAR); see, for example, Hu and Zhang (2012), Hirano and Ridder (2003) ,Wang et al. (2004), and Hazelton (2000). The MAR assumption means that the probability that a covariate is missing does not depend on that covariate itself. In classification, most of the existing results only deal with the case where covariates can be missing in the data (i.e., in \(\mathbf{Z}_i,~i=1, \ldots , n\)), but not in the new observation \(\mathbf{Z}\), which has to be classified. See, for example, Chung and Han (2000) for the parametric case, and Pawlak (1993) for the nonparametric case. One source of difficulty is the fact that the optimal classifier in (1) corresponding to the case with no missing covariates is not necessarily the best when there are missing covariates in the new unclassified observation \(\mathbf{Z }\). In fact, Mojirsheibani and Montazeri (2007) derive the optimal classifier in the presence of missing covariates, which is in general different from (1) and which works without imposing any MAR-type assumptions. Another representation of this classifier is given in Mojirsheibani (2012). Furthermore, Mojirsheibani and Montazeri (2007) propose nonparametric kernel estimators of the optimal classifier for this setup. Although these authors establish the strong consistency of their proposed kernel classifiers, they do not provide any directions as to how to estimate the unknown smoothing parameter of the kernels used. Finding good data-driven estimates of the kernel smoothing parameter is not just an important theoretical consideration; from an applied point of view, a carefully selected bandwidth can help to minimize the error probability.

In this paper, we propose methods for selecting kernel smoothing parameters in the complicated case where covariates can be missing in both the data and in the new unclassified observation. In Sect. 2.1, we revisit kernel-based estimators of the optimal classifier corresponding to this setup. The proposed classifiers do not require any MAR-type assumptions on the missingness probability mechanism. In Sects. 2.2 and 2.3, we turn to the question of bandwidth selection by considering the methods of data-splitting and resubstitution. To evaluate the performance of the corresponding classification rules, exponential performance bounds are established on the deviations of their error probabilities from those of the optimal classifier. In Sect. 3, we present several numerical examples highlighting the effectiveness of the proposed methods. Proofs are postponed to the end of the paper.

2 Main results

2.1 Kernel classifier with missing covariates

Our discussion and results for classification with missing covariates are based on the following setup. Let \(\mathbf{Z }=(\mathbf{X }',\mathbf V ')'\in \mathbb {R}^{d+p}\) be the vector of covariates to be used to predict the class membership \(Y\in \{ 0,1\}\), where \(\mathbf{X }\in \mathbb {R}^d, d\ge 1\) is always observable but \(\mathbf V \in \mathbb {R}^p, p\ge 1\) can be missing. Also, let \(\delta \) be a \(\{0,1\}\)-valued random variable defined by

$$\begin{aligned} \delta = {\left\{ \begin{array}{ll} 0 \quad \text { if } \mathbf V \text { is missing}\\ 1\quad \text { otherwise.} \end{array}\right. } \end{aligned}$$

Mojirsheibani (2012) and Mojirsheibani and Montazeri (2007) show that in this case, the optimal classifier is given by

$$\begin{aligned} \phi _{B}(\mathbf{z }, \delta ):= {\left\{ \begin{array}{ll} 1 \quad \text { if } ~ \delta \dfrac{E(\delta Y\vert \mathbf{Z } = \mathbf{z })}{E(\delta \vert \mathbf{Z } = \mathbf{z })}+(1-\delta ) \dfrac{E[ (1-\delta ) Y \vert \mathbf{X } = \mathbf{x }]}{E[(1-\delta )\vert \mathbf{X } = \mathbf{x } ]}>\dfrac{1}{2}\\ 0\quad \text { otherwise, } \end{array}\right. } \end{aligned}$$
(2)

with the convention \(0/0=0\). The corresponding probability of misclassification is given by

$$\begin{aligned} L(\phi _B) = P \{ \phi _B(\mathbf{Z }, \delta ) \ne Y \}. \end{aligned}$$
(3)

Observe that the classifier \(\psi _B(\mathbf{z })\) in (1) is a special case of \(\phi _{B}(\mathbf {z}, \delta )\) in (2); to see this, simply note that \(\phi _{B}(\mathbf {z}, \delta )\) reduces to \(\psi _B(\mathbf{z })\) whenever \(P\{\delta =1\}=1\) (i.e., whenever there are no missing covariates).

Since the joint distribution of \((\mathbf{Z },Y)\) is almost always unknown, the classifier \(\phi _{B}\) in (2) is not available in practice and must be estimated by some sample-based classifier. One typically has access to only some iid data \(D_n=\{ (\mathbf{Z }_1,Y_1),\ldots , (\mathbf{Z }_n,Y_n)\},\) which can also be represented by

$$\begin{aligned} \{(\mathbf{X }_1,\mathbf V _1,Y_1,\delta _1),\ldots ,(\mathbf{X }_n,\mathbf V _n,Y_n, \delta _n)\}. \end{aligned}$$

To define the kernel classification rule corresponding to (2), we replace the quantities \(E(\delta Y \vert \mathbf{Z }=\mathbf{z })/E(\delta \vert \mathbf{Z }=\mathbf{z })\) and \(E[(1-\delta ) Y \vert \mathbf{X }=\mathbf{x }]/E[(1-\delta ) \vert \mathbf{X }=\mathbf{x }]\) that appear in (2) by their corresponding kernel estimates given by

$$\begin{aligned} \hat{\tau }_{1}(\mathbf {z}):=\dfrac{\sum \limits _{i=1}^{n}\delta _{i} Y_{i} K _{1}\left( \dfrac{\mathbf{Z }_{i}-\mathbf{z }}{h_1} \right) }{\sum \limits _{i=1}^{n}\delta _{i} K _{1}\left( \dfrac{\mathbf{Z }_{i}-\mathbf{z }}{h_1} \right) } \text { and } \hat{\tau }_{0}(\mathbf {x}):=\dfrac{\sum \limits _{i=1}^{n}(1-\delta _{i}) Y_{i} K _{0}\left( \dfrac{\mathbf{X }_{i}-\mathbf{x }}{h_0} \right) }{\sum \limits _{i=1}^{n}(1-\delta _{i}) K _{0}\left( \dfrac{\mathbf{X }_{i}-\mathbf{x }}{h_0} \right) }, \end{aligned}$$
(4)

respectively. Here, the kernels \(K_1\) and \(K_0\) are maps of the form \( K_{j}: \mathbb {R}^{d+jp} \rightarrow \mathbb {R}^{+}\) for \(j=0,1\), and \(h_1\in \mathbb {R}^{+}\) and \(h_0\in \mathbb {R}^{+}\) are called the smoothing parameters or bandwidths of the kernels \(K_1\) and \(K_0\) respectively. The idea of replacing the unknown regression functions with their kernel regression estimates is fairly common in the literature on regression function estimation with missing data. For example, Mojirsheibani and Reese (2015) prove the strong consistency of such kernel regression estimates when the response variables can be missing; see also Karimi and Mohammadzadeh (2012) and Toutenburg and Shalabh (2003) for more on the estimation of regression functions for correlated data in the presence of missing response variables.

The kernel classification rule corresponding to (2) is then given by

$$\begin{aligned} {\phi }_{n}(\mathbf {z}, \delta )= {\left\{ \begin{array}{ll} 1 \quad \text { if } \delta \hat{\tau }_{1}(\mathbf{z }) + (1-\delta ) \hat{\tau }_{0}(\mathbf {x}) > \frac{1}{2}\\ 0 \quad \text { otherwise. } \end{array}\right. } \end{aligned}$$
(5)

The classifier in (5) would be useful if the values of \(h_0\) and \(h_1\) were known. As for the choice of the unknown parameters \(h_1\) and \(h_0\), we propose a number of methods to construct the estimates \(\hat{h}_1\) and \(\hat{h}_0\); these include a data-splitting approach and the resubstitution method.

2.2 Data splitting

We start by randomly splitting \(D_n\) into a training sequence \(D_m\) of size \(m \equiv m(n)\) and a testing sequence \(T_\ell =D_n-D_m\) of size \(\ell \equiv \ell (n)\), where \(m+\ell =n\). Here, \(D_n=D_m \cup T_\ell \) and \(D_m \cap T_\ell = \emptyset .\) The training sequence \(D_m\) is used to construct a class \(\Phi _m\) of kernel classifiers of the form

$$\begin{aligned} {\phi }_{m}(\mathbf {z}, \delta )= {\left\{ \begin{array}{ll} 1 \quad \text { if } \delta \hat{\tau }_{1,m}(\mathbf {z}) + (1-\delta ) \hat{\tau }_{0,m}(\mathbf {x}) > \frac{1}{2}\\ 0 \quad \text { otherwise, } \end{array}\right. } \end{aligned}$$
(6)

where

(7)

and where \(K_1\), \(K_0\), \(h_1\), and \(h_0\) are defined as before. Observe that the class \(\Phi _m\) is indexed by the values of \(h_1\) and \(h_0\). In what follows, we consider two cases: the case where the cardinality of \(\Phi _{m}\) is finite (this is the case where the free parameters \(h_{1}\) and \(h_{0}\) can take a finite number of values), and the case where \(\Phi _{m}\) has an infinite number of classifiers (i.e., the case where \(h_{1}\) and \(h_{0}\) can take an infinite number of values).

Case (i): \(\Phi _m\) is finite

In the case where \(\Phi _m\) is a finite class of such kernel rules, the parameters \(h_1\) and \(h_0\) will be chosen from a finite set of possible values. Specifically, assume that \(h_1\in H_1= \{ h_{1,1}, h_{1,2}, \ldots , h_{1,N_{1}}\}\) and \(h_0\in H_0= \{ h_{0,1},h_{0,2},\ldots ,h_{0,N_{0}}\}\). Observe that in this setup, the cardinality of \(\Phi _m\) is \(\vert \Phi _m \vert =N_1N_0.\) Next, we use the testing sequence \(T_\ell \) to select a classifier from \(\Phi _m\) that minimizes the following estimate of the error probability

$$\begin{aligned} \widehat{L}_{m,\ell }(\phi _m)=\dfrac{1}{\ell } \sum _{i : (\mathbf{Z }_i,Y_i,\delta _i)\in T_\ell } I\{ \phi _m(\mathbf{Z }_{i}, \delta _{i})\ne Y_{i}\}. \end{aligned}$$
(8)

Let \(\hat{\phi }_n\) denote the classifier selected from \(\Phi _m\) that minimizes (8), i.e., \(\widehat{L}_{m,\ell }(\hat{\phi }_n)\le \widehat{L}_{m,\ell }(\phi _m)\) for all \(\phi _m\in \Phi _m\). Equivalently, \(\hat{\phi }_n\) is the classifier that minimizes (8) as \(h_1\) and \(h_0\) vary over \(H_1\) and \(H_0\) respectively. Here, the subscript n of \(\hat{\phi }_{n}\) emphasizes the fact that it depends on the entire data set of size n.

How good is \(\hat{\phi }_n\) for predicting Y? To answer this question, let

$$\begin{aligned} L({\phi }_{m}) = P\{ {\phi }_{m}(\mathbf{Z }, \delta )\ne Y\ \vert D_m \} \text { and } L(\hat{{\phi }}_{n}) = P\{ {\hat{\phi }}_{n}(\mathbf{Z }, \delta )\ne Y |D_{n}\} \end{aligned}$$

denote the error probabilities of \(\phi _m\) and \(\hat{\phi }_n\), respectively. The following result gives exponential performance bounds for \(\hat{\phi }_n\):

Theorem 1

Let \(\phi _m\) and \(\hat{\phi }_n\) be the data-based classifiers defined as above. Then for any distribution of \((\mathbf{Z }, Y)\) and every \(\epsilon >0\),

$$\begin{aligned} P\left\{ L(\hat{\phi }_{n}) - \inf \limits _{\phi _{m}\in \Phi _{m}} L(\phi _{m})>\epsilon \right\} \le 2N_{1}N_{0}e^{-\ell \epsilon ^2/2}. \end{aligned}$$

Remark 1

Of course, as \(\ell \rightarrow \infty \), the bound in Theorem 1 yields

$$\begin{aligned} L(\hat{\phi }_{n}) - \inf _{\phi _{m} \in \Phi _{m}} L(\phi _{m}) \rightarrow 0 \end{aligned}$$

with probability one. But this falls short of achieving strong consistency for \(\hat{\phi }_{n}\). To appreciate this, consider the decomposition

$$\begin{aligned} L(\hat{\phi }_{n}) - L(\phi _{B}) = \left\{ L(\hat{\phi }_{n}) - \inf _{\phi _{m} \in \Phi _{m}} L(\phi _{m})\right\} + \left\{ \inf _{\phi _{m} \in \Phi _{m}} L(\phi _{m}) - L(\phi _{B})\right\} . \end{aligned}$$

Then although the first bracketed term goes to zero, there is no guarantee that the term \(\{ \inf _{\phi _{m} \in \Phi _{m}} L(\phi _{m}) - L(\phi _{B}) \}\) can become small unless the class \(\Phi _{m}\) can capture the classifier \(\phi _{B}\) as \(m \rightarrow \infty \). To overcome such difficulties, we next consider the case where \(\Phi _{m}\) has an infinite cardinality.

Case (ii): \(\Phi _m\) is infinite

Now consider the case where \(\Phi _m\) is the class of all kernel rules of the form in (6), but with parameters \(h_1\) and \(h_0\) chosen from an infinite set of possible values. Once again, let \(\hat{\phi }_n\) denote the classifier that minimizes the empirical misclassification error in (8). To study the performance of \(\hat{\phi }_n\), we first need to state the definition of the shatter coefficient of a set. Let \(\mathcal {A}\) be a class of measurable sets in \(\mathbb {R}^s\), where \(s \ge 1\). The \(n^{\text {th}}\) shatter coefficient of \(\mathcal {A}\), denoted by \(S(\mathcal {A},n)\), is defined by

$$\begin{aligned} S(\mathcal {A},n)=\max _{\mathbf{z }_1,\ldots ,\mathbf{z }_n\in \mathbb {R}^s} \{ \text {number of different sets in } \{ \{ \mathbf{z }_1,\ldots ,\mathbf{z }_n \} \cap A \vert A\in \mathcal {A} \} \}. \end{aligned}$$

The shatter coefficient \(S(\mathcal {A},n)\) measures the richness of the class \(\mathcal {A}\). Let \(\mathcal {A}_{\Phi }\) be the class of all sets of the form

$$\begin{aligned} A=\{ \{ \mathbf{z }\vert \phi (\mathbf{z })=1\} \times \{ 0\} \} \cup \{ \{ \mathbf{z }\vert \phi (\mathbf{z })=0\} \times \{ 1\} \},\quad \phi \in \Phi \end{aligned}$$
(9)

and define the \(n^{th}\) shatter coefficient of the class of classifiers \(\Phi \) to be \(S(\Phi ,n)=S(\mathcal {A}_{\Phi },n)\). Note that the size of \(S(\mathcal {A}_{\Phi },n)\) depends on the class \(\Phi \). If, for example, we take \(\Phi \) to be the class of all linear classifiers of the form

$$\begin{aligned} \phi (\mathbf{z })= {\left\{ \begin{array}{ll} 1\quad \text { if } a_0 + a_1z_1 +\cdots + a_{d+p}z_{d+p}>0\\ 0\quad \text { otherwise, } \end{array}\right. } \end{aligned}$$

where \(a_0,a_1,\ldots ,a_{d+p}\in \mathbb {R}\), then \(S(\mathcal {A}_{\Phi },n) \le n^{d+p+1}\); see, for example, Devroye et al. (1996) and Pollard (1984).

To state our main results, we shall assume that the chosen kernels are regular: a nonnegative kernel K is said to be regular if there are positive constants \(b>0\) and \(r>0\) for which \(K(\mathbf{x })\ge bI\{ \mathbf{x }\in S_{0,r}\}\) and \(\int \sup _{\mathbf{y }\in \mathbf{x }+ S_{0,r}}K(\mathbf{y })d\mathbf{x }<\infty ,\) where \(S_{0,r}\) is the ball of radius r centered at the origin. For more on this see, for example, Györfi et al. (2002).

Theorem 2

Let \(\phi _B\) be as in (2) and let \(\hat{\phi }_n\) be the classifier that minimizes (8) as \(h_0\) and \(h_1\) vary over sets of the form \(H_0=[0,A_0]\) and \(H_1=[0,A_1]\), where \(A_0>0\) and \(A_1>0\) are arbitrary real numbers. Also, let \(K_0\) and \(K_1\) be regular kernels. Then for any distribution of \((\mathbf{Z }, Y)\) and every \(\epsilon >0\), there is an integer \(n_0\equiv n_{0}(\epsilon )>0\) such that for all \(n>n_0\)

$$\begin{aligned} P\{L(\hat{\phi }_n)-L(\phi _{B})>\epsilon \} \le 4e^{8}E\left[ S(\Phi _m,\ell ^2)\right] e^{-\ell \epsilon ^2/8} + 2e^{-c_{1}m\epsilon ^2}, \end{aligned}$$

where \(c_{1}\) is a positive constant that does not depend on n.

In passing we also note that the bound in Theorem 2, along with the Borel–Cantelli lemma, immediately provides the following strong consistency result:

Corollary 1

If \(\ell ^{-1}\log \{E[S({\Phi _m},\ell ^2)]\} \rightarrow 0,\) as \(n \rightarrow \infty ,\) then under the conditions of Theorem 2,

$$\begin{aligned} L(\hat{\phi }_{n}) - L(\phi _B)\overset{a.s.}{\longrightarrow } 0 ~~~ \text {as} ~~~ n \rightarrow \infty . \end{aligned}$$

In other words, the error probability of the classifier \(\hat{\phi }_n\) converges, with probability one, to that of the optimal classifier.

Some special kernels

In most situations, the term \(S(\Phi _m,\ell ^2)\), which appears in the bound of Theorem 2, may be difficult to compute, in which case an upper bound on the shatter coefficient may be convenient. Fortunately, we can bound \(S(\Phi _m,\ell ^2)\) based on the notion of kernel complexity. Doing so will allow us to find computable performance bounds for several widely used classes of kernels including the Gaussian kernel. To present such bounds, first note that when \(\delta = 1\) (i.e., when the new observation \(\mathbf{Z }\) has no missing components), the kernel estimator (6) of the optimal classifier \(\phi _B\) can be written as

$$\begin{aligned} \phi _{m,1}(\mathbf{z }):= {\left\{ \begin{array}{ll} 1\quad \text { if } \sum \limits _{i : (\mathbf{Z }_i,Y_i,\delta _i)\in D_m}(2Y_i-1)\delta _{i} K_{1}\left( \dfrac{\mathbf{Z }_i-\mathbf{z }}{h_1}\right) >0\\ 0 \quad \text { otherwise. } \end{array}\right. } \end{aligned}$$
(10)

If \(\delta =0\), however, the kernel classifier in (6) becomes

$$\begin{aligned} \phi _{m,0}(\mathbf{x }):= {\left\{ \begin{array}{ll} 1 \quad \text { if } \sum \limits _{i : (\mathbf{Z }_i,Y_i,\delta _i)\in D_m}(2Y_i-1)(1-\delta _{i}) K_{0}\left( \dfrac{\mathbf{X }_i-\mathbf{x }}{h_0}\right) >0\\ 0 \quad \text { otherwise. } \end{array}\right. } \end{aligned}$$
(11)

Next, we borrow the following definitions from Devroye et al. (1996, Chap. 25). Define the quantities \(\kappa _{m}^{(1)}\) and \(\kappa _{m}^{(0)}\) as follows:

$$\begin{aligned} \kappa _{m}^{(1)}&= \sup _{\mathbf{z },(\mathbf{z }_{1},y_{1}),\ldots ,(\mathbf{z }_{m},y_{m})}\left\{ \text {Number of sign changes of} \right. \\&\left. \sum \limits _{i:(\mathbf{z }_i,y_i,1)\in D_m}(2y_i-1)\delta _{i} K_{1}\left( \dfrac{\mathbf{z }_i-\mathbf{z }}{h_1}\right) \text { as }h_1 \text { varies from 0 to infinity}\right\} , \end{aligned}$$
$$\begin{aligned} \kappa _{m}^{(0)}&= \sup _{\mathbf{x },(\mathbf{x }_{1},y_{1}),\ldots ,(\mathbf{x }_{m},y_{m})}\left\{ \text {Number of sign changes of }\right. \\&\left. \sum \limits _{i : (\mathbf{z }_i,y_i,0)\in D_m}(2y_i-1)(1-\delta _{i}) K_{0}\left( \dfrac{\mathbf{x }_i-\mathbf{x }}{h_0}\right) \text { as }h_0 \text { varies from 0 to infinity}\right\} . \end{aligned}$$

Finally, define the kernel complexity to be

$$\begin{aligned} \kappa _{m}^{*}=\max (\kappa _{m}^{(1)},\kappa _{m}^{(0)}). \end{aligned}$$
(12)

The kernel complexity of a classifier is closely related to the shatter coefficient of the class \(\Phi _m\). To appreciate this, suppose we have a rule \(\phi _m\) with complexity \(\kappa _{m}^{*}\). Then, as \(h_1\) and \(h_0\) vary from 0 to infinity, the binary \(\ell \)-vectors

$$\begin{aligned}&\left( \text {sign}\left[ \sum \limits _{i : (\mathbf{Z }_i,Y_i,\delta _i)\in D_m}(2Y_i-1)\delta _{i} K_{1}\left( \dfrac{\mathbf{Z }_i-\mathbf{Z }_j}{h_1}\right) \right] \right) _{j=m+1}^{m+\ell }\\&\quad \text { and } \left( \text {sign}\left[ \sum \limits _{i : (\mathbf{Z }_i,Y_i,\delta _i)\in D_m}(2Y_i-1)(1-\delta _{i}) K_{0}\left( \dfrac{\mathbf{X }_i-\mathbf{X }_j}{h_0}\right) \right] \right) _{j=m+1}^{m+\ell } \end{aligned}$$

change at most \(\ell \kappa _{m}^{*}\) times each. Therefore, they can take at most \(\ell \kappa _{m}^{*}+1\) different values, which implies that

$$\begin{aligned} S(\Phi _{m},\ell )\le \ell \kappa _{m}^{*}+1. \end{aligned}$$
(13)

The following corollary is an immediate consequence of Theorem 2 and the bound in (13):

Corollary 2

Suppose the kernel classifier \(\phi _m\) in (6) has complexity \(\kappa _m^{*}\) as defined above. Then, under the conditions of Theorem 2, one has, for large n,

$$\begin{aligned} P\{ L(\hat{\phi }_{n}) - L(\phi _B)>\epsilon \} \le 4e^{8}(\ell ^{2}\kappa _{m}^{*}+1)e^{-\ell \epsilon ^2/8}+2e^{-c_{3}m\epsilon ^2} \end{aligned}$$

where the positive constant \(c_{3}\) depends only on the choice of kernels used.

Corollary 2 applies to a broad range of kernels. In what follows, we examine such bounds for two popular classes: the exponential kernels and the polynomial kernels [also, see (Devroye et al. 1996, Sect. 25)].

(i) Exponential kernels

Consider exponential kernels of the form

$$\begin{aligned} K(\mathbf u )=e^{-\Vert \mathbf u \Vert ^{\alpha }} \end{aligned}$$

where \(\alpha >0\), \(\mathbf u \in \mathbb {R}^s\), and \(\Vert \cdot \Vert \) is any norm on \(\mathbb {R}^s\) (the popular Gaussian kernel falls into this category). If \(\delta =1\) then (10) becomes

$$\begin{aligned} {\phi }_{m,1}(\mathbf {z})= {\left\{ \begin{array}{ll} 1 \quad \text { if } \sum \limits _{i: (\mathbf{Z }_i,Y_i,\delta _i)\in D_m}(2Y_i-1)\delta _{i} e^{-(\Vert \mathbf{Z }_i-\mathbf{z }\Vert /{h_1})^{\alpha _1}} >0\\ 0 \quad \text { otherwise } \end{array}\right. } \end{aligned}$$

for some positive constant \(\alpha _1\). Similarly, when \(\delta =0\), the expression in (11) becomes

$$\begin{aligned} {\phi }_{m,0}(\mathbf {x})= {\left\{ \begin{array}{ll} 1 \quad \text { if } \sum \limits _{i: (\mathbf{Z }_i,Y_i,\delta _i)\in D_m}(2Y_i-1)(1-\delta _{i}) e^{-(\Vert \mathbf{X }_i-\mathbf{x }\Vert /{h_0})^{\alpha _0}} >0\\ 0 \quad \text { otherwise } \end{array}\right. } \end{aligned}$$

for some positive constant \(\alpha _0\). We can now state the following version of Theorem 2 when \(K_{0}\) and \(K_{1}\) are exponential kernels.

Theorem 3

Let \(\Phi _m\) be the class of kernel classifiers \(\phi _m\) in (6), where \(K_0\) and \(K_1\) are exponential kernels, as defined above. Then under the conditions of Theorem 2, and for any distribution of \((\mathbf{Z }, Y)\), one has, for large n,

$$\begin{aligned} P\{ L(\hat{\phi }_{n}) - L(\phi _B)>\epsilon \} \le 4e^{8}(\ell ^{2}m+1)e^{-\ell \epsilon ^2/8}+2e^{-c_{4}m\epsilon ^2} \end{aligned}$$

where the positive constant \(c_{4}\) depends only on the choice of kernels used.

Once again, the above bound (in conjunction with the Borel–Cantelli lemma) yields \(L(\hat{\phi }_{n}) - L(\phi _{B}) \overset{a.s.}{\longrightarrow }0\) as \(n \rightarrow \infty \), provided that \(\ell ^{-1} \log (\ell ^{2}m+1) \rightarrow 0\), as n (and thus \(m, \ell \)) \(\rightarrow \infty \).

(ii) Polynomial kernels

Consider kernels of the form

$$\begin{aligned} K(\mathbf{z })=\left( \sum _{i=1}^{t}a_{i}\Vert \mathbf{z } \Vert ^{b_{i}}\right) I{\{ \Vert \mathbf{z } \Vert \le 1 \}}, \end{aligned}$$

where \(a_i \in \mathbb {R},b_i \ge 1\), and t is the number of terms in the above sum. When \(\delta =1\), the expression in (10) becomes

Similarly, when \(\delta =0\), we find

The following performance bound holds for the class of polynomial kernels:

Theorem 4

Let \(\Phi _m\) be the class of kernel classifiers \(\phi _m\) in (6), where \(K_0\) and \(K_1\) are polynomial kernels, as defined above. Then under the conditions of Theorem 2, and for any distribution of \((\mathbf{Z }, Y)\), one has, for large n,

$$\begin{aligned} P\{ L(\hat{\phi }_{n}) - L(\phi _B)>\epsilon \} \le 4e^{8}(\ell ^{2}rm+1)e^{-\ell \epsilon ^2/8}+2e^{c_{5}m\epsilon ^2}\,, \end{aligned}$$

where \(r=\max (r_0,r_1)\) and where the positive constant \(c_{5}\) depends only on the choice of kernels used.

Once again, the strong consistency of \(\hat{\phi }_{n}\) follows from the above theorem, provided that \(\ell ^{-1} \log (\ell ^{2}rm+1) \rightarrow 0\), as \(n \rightarrow \infty \).

2.3 The resubstitution method

In the previous section, we considered methods based on data-splitting to find a data-driven value of the kernel bandwidth that would yield strongly consistent classifiers. One problem with this approach, however, is that it is not always clear as to how one should choose the splitting ratio m / n (i.e., what m should be). In this section, we propose to consider the alternative approach (for choosing the bandwidth) based on the resubstitution method. Let \(\Phi _n\) be the collection of classifiers \(\phi _n\equiv \phi _{n,h_1,h_0}\) defined via (4) and (5), as \(h_1\) and \(h_0\) vary over sets of positive real numbers (possibly infinite sets). Then for any \(\phi _n\in \Phi _n\), the resubstitution estimate of the error of \(\phi _n\) is simply

$$\begin{aligned} \widehat{L}_{n}^{(R)}(\phi _n)&:= \frac{1}{n}\sum _{i=1}^{n} I\{ \phi _{n} (\mathbf{Z }_{i}, \delta _{i})\ne Y_i\} \nonumber \\&=\frac{1}{n}\sum _{i=1}^{n}I\left\{ I \left\{ \delta _i \widehat{\tau }_{h_1}(\mathbf{Z }_i)+(1-\delta _i)\widehat{\tau }_{h_0}(\mathbf{X }_i) >\frac{1}{2} \right\} \ne Y_i \right\} , \end{aligned}$$
(14)

where \(\widehat{\tau }_{h_1}:=\widehat{\tau }_{1}\) and \(\widehat{\tau }_{h_0}:=\widehat{\tau }_{0}\) are as in (4). In other words, the resubstitution estimates of \((h_1,h_0)\), which we shall denote by \((\tilde{h}_1,\tilde{h}_0)\), satisfy

$$\begin{aligned} (\tilde{h}_1,\tilde{h}_0)= \mathrm{argmin}_{\phi _{n,h_1,h_0} \in \Phi _n} \widehat{L}_{n}^{(R)}(\phi _{n,h_1,h_0}). \end{aligned}$$
(15)

Note that the data has been used twice here: once to construct \(\phi _n\) and a second time to estimate the error of \(\phi _n\) (error committed on the same data). Let \(\tilde{\phi }_n \in \Phi _n\) be the classifier corresponding to \((\tilde{h}_1,\tilde{h}_0)\), i.e., \(\widehat{L}_{n}^{(R)}(\tilde{\phi }_n)\le \widehat{L}_{n}^{(R)}(\phi _n)\) for all \(\phi _n \in \Phi _n\). Also, let

$$\begin{aligned} L(\phi _n)= P\{ \phi _n(\mathbf{Z }, \delta )\ne Y \vert D_n\} \quad \text{ and }\quad L(\tilde{\phi }_n)= P\{ \tilde{\phi }_n(\mathbf{Z }, \delta )\ne Y \vert D_n\} \end{aligned}$$

denote the error probabilities of \(\phi _n\) and \(\tilde{\phi }_n\), respectively. The following theorem establishes strong consistency of the resubstitution method in the case where the components of the random vector \(\mathbf{Z }\) are discrete:

Theorem 5

Suppose that the kernels \(K_1\ge 0\) and \(K_0 \ge 0\) satisfy the conditions \(K_{1}(\mathbf{z })\rightarrow 0\) as \(\Vert \mathbf{z }\Vert \rightarrow \infty \) and \(K_{0}(\mathbf{x })\rightarrow 0\) as \(\Vert \mathbf{x }\Vert \rightarrow \infty \). Let \(\tilde{\phi }_{n}\) be the classifier defined by (4) and (5) with \(\tilde{h}_1\) and \(\tilde{h}_0\) selected via (15). Then \(L(\tilde{\phi }_{n})\rightarrow L(\phi _B)\), with probability one, whenever the random vector \(\mathbf{Z }\) has a discrete distribution.

Remark 2

While consistency is guaranteed when the random vector \(\mathbf{Z }\) is discrete, Theorem 5 makes no conclusions about the performance of the procedure when \(\mathbf{Z }\) is continuous. It is true that the resubstitution procedure can be inconsistent when the random vector \(\mathbf{Z }\) is not discrete [see (Devroye et al. 1996, Sect. 25.6) for a counter-example], yet there may be cases where consistency can be achieved.

3 Numerical examples

In this section, we carry out several numerical studies to assess the performance of our proposed classification methods.

Example 1

Consider the class membership \(Y=0\) or \(Y=1\), of an entity based on the covariates \(\mathbf{Z }=(X,V)'\), where \(\mathbf{Z }\sim N_2(\varvec{\mu }_0,\Sigma _{0})\) if \(Y=0\) and \(\mathbf{Z }\sim N_2(\varvec{\mu }_1,\Sigma _{1})\) otherwise. The unconditional class probabilities were taken to be \(P\{Y=1\}\)\(=P\{Y=0\}=0.5.\) The parameters were chosen as follows:

$$\begin{aligned} \varvec{\mu }_0=(0.7,0)', \Sigma _0 = \begin{pmatrix} 1 &{} 0.2 \\ 0.2 &{} 2 \\ \end{pmatrix}, \end{aligned}$$
$$\begin{aligned} \varvec{\mu }_1=(1,1)', \Sigma _1 = \begin{pmatrix} 1 &{} 0.2 \\ 0.2 &{} 1 \\ \end{pmatrix}. \end{aligned}$$

Here X is observable but V could be missing. The missingness probability mechanism was taken to be

$$\begin{aligned} p(\mathbf{z },y)&=p(x,v,y):=P\{\delta =1 \vert X=x, V=v, Y=y\}\\&=\exp \{-a(1-0.6y)(x-1)^2-b(1+0.6y)v^2-cy\},\nonumber \end{aligned}$$
(16)

where \(a,b,c>0\) are constants. We considered three different choices for (abc): (0.5, 0.5, 0),  (0.45, 0, 1.3),  and (0, 0, 0). The choice (0, 0, 0) corresponds to the case of no missing data. We considered two sample sizes: \(n=100\) and \(n=300\). As for the choice of the kernels, \(K_1:\mathbb {R}^2\rightarrow \mathbb {R}\) and \(K_0:\mathbb {R}\rightarrow \mathbb {R}\) were taken to be standard Gaussian and the bandwidths \(h_1\) and \(h_0\) were chosen from grids of 10 equally spaced values in [0.05, 1.2] and [0.08, 1.2], respectively. Here we employed four different methods to estimate the smoothing parameters \(h_1\) and \(h_0\) of the kernel classifiers: (i) The data-splitting procedure based on (6) and (7) with a splitting ratio of \(65\%\), i.e., \(m=0.65 n\) and \(\ell =0.35 n\). (ii) Breiman’s out-of-bag procedure (Breiman 1996). (iii) The resubstitution method which was explained in Sect. 2. (iv) The method that selects \(h_1\) and \(h_0\) as the minimizers of the asymptotic mean integrated squared error (AMISE) of the corresponding kernel density estimator, as given in Wand and Jones (1995). Further properties of such procedures are discussed in Hall and Marron (1987). See also Bontemps et al. (2009) for results pertaining to the selection of bandwidths for kernel based conditional density estimates. In passing, we also note that the optimal choice of the smoothing parameters in kernel density estimation is not necessarily optimal in the problem of kernel classification. For the out-of-bag method, we first generated a bootstrap training sample of size n from the original sample (that is, a sample of size n drawn with replacement from the original sample). The remaining values (i.e. observations not appearing in the bootstrap sample) were then used as our testing sequence. The bootstrap sample is employed to construct the family \(\Phi _m\) of classifiers of the form (6), whereas the testing sequence is used to choose the empirically best classifier from \(\Phi _m\).

Finally, to assess the error rates of these four classifiers, we also generated an additional 1000 observations, (the same way we generated the data), to be used as our test sample. The entire above process was repeated a total of 500 times and the average misclassification error probability estimates, committed on the testing sequences over 500 such training and testing samples, were calculated. The results appear in Table 1 for the case of \(n=100\). The error rates that are reported are averages over the 500 runs and the numbers in parentheses are the standard errors of those averages.

Table 1 Error rates for \(\hat{\phi }_{n}\) (data-splitting), \(\tilde{\phi }_{n}\) (resubstitution), \(\phi _{n}^{G}\) (out-of-bag), and \(\phi _{n}^{D}\) (density) when \(n=100\) for Example 1

As Table 1 shows, the proposed resubstitution method appears to outperform the other three procedures for most values of the constants ab,  and c. These results further validate our earlier remarks about the performance of this method when the random vector \(\mathbf{Z }\) is continuous. It is also worth noting that the approach based on density estimation performs nearly as well as the other procedures, at least in the case of missing data. These observations are consistent with the results in Hall and Kang (2005), which demonstrate that in the multivariate setting and under suitable conditions, the bandwidths selected to minimize the mean square error between the kernel density estimators and the true densities are on the same order of magnitude as the optimal bandwidths for classification.

The results for the case where \(n=300\) are given in Table 2. Once again, it is clear that the resubstitution method is superior (followed by the data-splitting classifier \( \hat{\phi }_{n}\)).

Tables 1 and 2 also show that, for some choices of the constants ab,  and c, the misclassification error of \(\hat{\phi }_n\) can be less than that of the case with no missing covariates. See, for example, the third row of Table 2, where \(a=0.45, b=0\) and \(c=1.3\), in which case the error of \(\hat{\phi }_n\) is 0.263 < 0.344. This illustrates, somewhat counter-intuitively, that classification with missing covariates can sometimes have a lower misclassification error than the case with no missing covariates. This phenomenon, which is also noted in Mojirsheibani and Montazeri (2007), is partially explained by the relationship between the correlation of Y and V, and that of Y and \(\delta \). To appreciate this, note that when \(a=b=c=0\), the correlation between V and Y is 0.378. When \(a=0.45,b=0,\) and \(c=1.3\), however, the correlation between \(\delta \) and Y is -0.476. The fact that \(\vert -0.476\vert > 0.378\) implies that the random variable \(\delta \), which is always observable, can sometimes do better (than the covariate V) at predicting Y.

Table 2 Error rates for \(\hat{\phi }_{n}\) (data-splitting), \(\tilde{\phi }_{n}\) (resubstitution), \(\phi _{n}^{G}\) (out-of-bag), and \(\phi _{n}^{D}\) (density) when \(n=300\) for Example 1

Example 2

In this example, we consider both continuous and discrete covariates for predicting the class Y. More specifically, we consider covariate vectors of the form \(\mathbf{Z }=(X_1,X_2,V)'\), where \(X_1 = Z_1\) and \(X_2 = I\{ |Z_2| < 2 \}\), i.e., \(X_2\) is a discrete covariate. When \(Y=0\), \((Z_1,Z_2,Z_3)'\sim N_3(\varvec{\mu }_0,\Sigma _{0})\), where

$$\begin{aligned} \varvec{\mu }_0=(0.7,0.7,0.7)', \Sigma _0 = \begin{pmatrix} 1 &{} \quad 0.4 &{} \quad 0.16 \\ 0.4 &{} \quad 1 &{} \quad 0.4 \\ 0.16 &{} \quad 0.4 &{} \quad 1 \\ \end{pmatrix}. \end{aligned}$$

When \(Y=1\), the vector \((Z_1,Z_2,Z_3)'\) has a standard Cauchy distribution with independent components, i.e., \(Z_j,~j=1,2,3\) are independent standard Cauchy random variables. The missingness probability mechanism was taken to be

$$\begin{aligned} p(\mathbf{z },y)&=p(x_1,x_2,v,y):=P\{\delta =1 \vert X_1=x_1, X_2=x_2, V=v, Y=y\}\nonumber \\&=\exp \{-a(1-0.6y)(x_1-0.7)^2-b(1-0.4y)(v-0.5)^2\\&~~~~~~~~~~-c(1+0.6y)^{2}(x_2-0.5)^2-dy\},\nonumber \end{aligned}$$
(17)

where \(a,b,c,d>0\) are constants. We considered three different choices for (abcd), namely (0.15, 0, 0.5, 1.2), (0.25, 0.5, 0.25, 0.6),  and (0, 0, 0, 0). Note that the choice (0, 0, 0, 0) corresponds to no missing data. The kernels, bandwidths, and sample sizes are as in Example 1. Once again, we considered data-splitting, resubstitution, the out-of-bag, and the procedure based on density estimation to select a kernel classifier from a class indexed by values of the bandwidths \(h_1\) and \(h_0\). The results appear in Table 3 for the case of \(n=100\) and Table 4 for \(n=300\) (see Example 1 for more details on how these values were calculated).

Table 3 Error rates for \(\hat{\phi }_{n}\) (data-splitting), \(\tilde{\phi }_{n}\) (resubstitution), \(\phi _{n}^{G}\) (out-of-bag), and \(\phi _{n}^{D}\) (density) when \(n=100\) for Example 2
Table 4 Error rates for \(\hat{\phi }_{n}\) (data-splitting), \(\tilde{\phi }_{n}\) (resubstitution), \(\phi _{n}^{G}\) (out-of-bag), and \(\phi _{n}^{D}\) (density) when \(n=300\) for Example 2

The results show that the proposed resubstitution method outperforms all of the other procedures for every choice of the constants abc,  and d appearing in (17), and for both samples of size \(n=100\) and \(n=300\). Once again, as noted in Mojirsheibani and Montazeri (2007), we see that classification with missing covariates can sometimes perform better than the case with no missing covariates; see, for example, row 2 of Table 4 (corresponding to \(a=0.15, b=0, c=0.5, d=1.2\)). This is explained in part by the fact that the correlation between \(\delta \) and Y is \(-0.37\), whereas the correlation between V and Y in row 1 (corresponding to no missing data) is only 0.04.

Example 3

(Mammogram data)

We now turn to a real data example involving the classification of mammographic masses in the screening for breast cancer; there are many discrete covariates in this data set. The data set consists of 961 patients, 516 of whom have mammographic masses which are benign (class 0), and the remaining 445 patients have mammographic masses which are malignant (class 1). The covariates used to predict the class a patient belongs to are \(x_1=\) ‘patient’s age’ (in years), \(x_2=\) ‘mass shape’ (nominal value \(\in \{1,2,3,4 \}\)), \(x_3=\) ‘mass margin’ (nominal value \(\in \{1,2,3,4,5 \}\)), and \(v=\) ‘mass density’ (nominal value \(\in \{1,2,3,4 \}\)). A full description of this data set is available from the University of California, Irvine, repository of machine learning database at http://archive.ics.uci.edu/ml/. In this example, we focus on one dominant missingness pattern: for 56 patients, the value of \(v=\) ‘mass density’ was missing. To better present our results, we consider only this particular missingness pattern.

For each of the four procedures mentioned in the previous examples, a kernel classifier was constructed using two-thirds of the data (randomly selected), and the performance of the chosen classifiers was tested on the remaining portion (the kernels were taken to be as in Example 1). This entire process was repeated 500 times, producing the results in Table 5. It is interesting to note that the resubstitution method and the procedure based on density estimation perform quite similarly, with estimated error rates that are roughly half of those of data-splitting and the out-of-bag method. In this example, all of the covariates used to predict Y were discrete, which might partly explain the superior performance of the resubstitution procedure.

Table 5 Error rates for \(\hat{\phi }_{n}\) (data-splitting), \(\tilde{\phi }_{n}\) (resubstitution), \(\phi _{n}^{G}\) (out-of-bag), and \(\phi _{n}^{D}\) (density) for Example 3

Remark 3

Although we have stated our results for a two-class classification problem, our results can readily be extended to the more general M-class setup, where \(M\ge 2\). More specifically, if we put

$$\begin{aligned} \phi _B(\mathbf{z},\delta )= \delta \psi _1(\mathbf{z}) + (1-\delta ) \psi _0(\mathbf{x}), \end{aligned}$$

where for \(1\le j \le M\)

$$\begin{aligned} \psi _1(\mathbf{z})=j&\text{ if }~~ E[\delta I\{Y=j\}|\mathbf{Z}=\mathbf{z}] = \max _{1\le j\le M} E[\delta I\{Y=j\}|\mathbf{Z}=\mathbf{z}] \\ \psi _0(\mathbf{x})=j&\text{ if }~~ E[(1-\delta ) I\{Y=j\}|\mathbf{X}=\mathbf{x}] = \max _{1\le j\le M} E[(1-\delta ) I\{Y=j\}|\mathbf{X}=\mathbf{x}],\ \end{aligned}$$

then it follows from Mojirsheibani and Montazeri (2007) that \(\phi _B\) is indeed the optimal classifier. Now, for \(j=1\dots M\), define \(\hat{\psi }_{1,m}(\mathbf{z})=j\)  if

$$\begin{aligned}&\sum _{i : (\mathbf{Z }_i,Y_i,\delta _i)\in D_m} \delta _i I\{Y_i=j\} K _{1}\left( \frac{\mathbf{Z }_{i}-\mathbf{z }}{h_1} \right) \\&\quad = \max _{1\le j\le M} \sum _{i : (\mathbf{Z }_i,Y_i,\delta _i)\in D_m} \delta _i I\{Y_i=j\} K _{1}\left( \frac{\mathbf{Z }_{i}-\mathbf{z }}{h_1} \right) . \end{aligned}$$

Also, for \(j=1\dots M\), define \(\hat{\psi }_{0,m}(\mathbf{x})=j\)  if

$$\begin{aligned}&\sum _{i : (\mathbf{X }_i,Y_i,\delta _i)\in D_m} (1-\delta _i) I\{Y_i=j\} K _{0}\left( \frac{\mathbf{X }_{i}-\mathbf{x }}{h_1} \right) \\&\quad = \max _{1\le j\le M} \sum _{i : (\mathbf{X }_i,Y_i,\delta _i)\in D_m} (1-\delta _i) I\{Y_i=j\} K _{0}\left( \frac{\mathbf{X }_{i}-\mathbf{x }}{h_1} \right) , \end{aligned}$$

where \(K_1\), \(K_0\), \(h_1\), and \(h_0\) are as in Sect. 2, and put

$$\begin{aligned} \psi _m(\mathbf{z}, \delta ) = \delta \hat{\psi }_{1,m}(\mathbf{z }) + (1-\delta ) \hat{\psi }_{0,m}(\mathbf{x }). \end{aligned}$$

Finally, letting \(h_1\) and \(h_0\) vary over a set of prescribed values, the optimal classifier \(\hat{\psi }_n\) is the classifier that minimizes the empirical error committed on the testing sequence \(T_\ell \), i.e., minimizing \(\ell ^{-1}\sum _{i : (\mathbf{Z }_i,Y_i,\delta _i)\in T_{\ell }} I\{\psi _{m}(\mathbf{Z}_i,\delta _i)\ne Y_i\}\).