1 Introduction

Cross validation (CV) was described as early as in Stone (1974). It has been of tremendous interest to characterize why and how a CV method works. In statistics, most of the theoretical works on CV concentrate on regression rather than classification. Some well cited works include Efron (1983, 1986), Shao (1993, 1996, 1998), and Zhang (1992, 1993a, b). One result of particular interest is Zhang’s distributional description of the CV error for linear regression models. For the problems of model selection and error prediction in linear models, certain forms of CV are shown to be equivalent to well known model selection criteria such as the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), and the C p statistics. Based on this framework, good performance of CV and asymptotic convergence can be established.

In regression, the risk function is continuous. Hence, it is relatively easy to study the behavior of CV. However, in classification problems, discontinuity of categorical responses makes it hard to establish an equivalence between CV and some existing criteria. Despite this difficulty, a number of research have been done to explore the performance of CV (in particular, leave-one-out CV) in classification problems. Leave-one-out CV reserves one data point and utilizes the remaining m − 1 points to train a model, where m is the number of data points. This process is repeated for all m data points to obtain the estimate of true error. Most works provided bounds on the accuracy of the leave-one-out CV estimate. Rogers and Wagner (1978) and Devroye and Wagner (1979a, b) obtained exponential bounds of the leave-one-out CV estimate for a k-nearest neighbor algorithm within the Probably Approximately Correct framework. More precisely, they provided the bound of the probability that the leave-one-out CV error departs from the estimate of true error. Kearns and Ron (1999) derived sanity-check bounds for the leave-one-out CV estimator, showing that the bounds from the leave-one-out CV estimate are not worse than that of the training error estimate.

Despite its popularity, the leave-one-out CV has some shortcomings. Most obvious disadvantage is the high computational cost, because the process requires training the model for every data point. Moreover, leave-one-out CV estimate yields high variance in spite of its low bias mainly due to the use of similar training set in each CV step. Therefore, the leave-one-out CV is not recommended when the learning algorithm is instable. Hastie et al. (2001) pointed out that the CV estimate in tree-based models can underestimate the true error, because the reserved testing set strongly affects determining of the optimal tree, and thus recommended five or ten-fold CV as a good compromise between variance and bias. Kohavi (1995), Zhang (1992), and Breiman and Spector (1992) also showed that 10-fold CV produces smaller variance than the leave-one-out CV. Thus, for instable algorithms (like tree-based), the 10-fold CV is more desirable than the leave-one-out CV to estimate the true error.

This study focuses on the behavior of 10-fold CV estimate in tree-based models. The main questions addressed in this paper are (1) how well does CV in tree-based models estimate test error? (2) How the CV performance varies with different situations? We answer those questions using an experimental approach. The following is a synopsis.

  1. 1.

    Bayes classifier Given the distribution of a point cloud, based on the likelihood ratio approach of Neyman–Pearson, an optimal classifier can be derived. Such a classifier is known as a Bayes classifier. Such a classifier is in theory the best possible classifier.

  2. 2.

    Cross-validated classifier Given a training set, a classifier can be trained by minimizing average error rate that is given in the form of CV. The description of the CV error rate is presented in Sect. 2. This classifier is called a cross-validated classifier.

  3. 3.

    Training and testing errors Both classifiers (Bayes and cross-validated) can be applied to the training and testing sets. A smaller error rate on the training set does not necessarily mean optimality, because it may be introduced by over-fitting. For a classifier, equality between training error and testing error may be desirable. Moreover, if the Bayes classifier is applied to both training and testing sets, the difference between the two error rates should be small since the difference is only affected by sampling error. On the other hand, if the testing-to-training error difference is large, the randomly sampled data does not reflect the underlying distribution which suggests that the classifier selected is inappropriate.

  4. 4.

    Methodology evaluation Based on the above, the following procedure is deployed to evaluate the performance of a cross-validated classifier. The difference between the training error and the testing error is calculated for the cross-validated classifier. Let e1,A and e2,A denote the training error of the Bayes and the cross-validated classifiers respectively, where

    • “1” stands for Bayes classifiers,

    • “2” stands for cross-validated classifiers, and

    • A” stands for the training set. Let e 1,B and e 2,B denote the two corresponding testing error rates, where

    • B” stands for the testing set. We consider the differences:

      $$ D_2 = e_{2,B} - e_{2,A} \quad {\hbox{vs.}} \quad D_1 = e_{1,B} - e_{1,A}. $$
  5. 5.

    Main observation The main observation is that the above two quantities have a roughly statistically linear relationship. This is more evident in Fig. 5. We have

    $$ D_1 = a + c \cdot D_2 + \varepsilon, $$
    (1)

where the constant c, |c| ≤ 1, depends on the underlying distribution and a is the constant value. The random variable ɛ has zero mean and seemingly normally distributed.

Note that the asymptotic behavior of D 1 and D 2 are somewhat known. The cross-validated classifier will converge to the Bayes classifier in most cases (Anthony and Holden 1998); see also Sect. 3. Hence D 1 and D 2 tend to be equal. This paper is to study their behavior in the finite-sample cases: when the sample size is not large. It is interesting to find that a simple linear regression model seems to characterize the relation between D 1 and D 2 nicely. The impact of the geometry of decision boundary, the parameter of the underlying distribution, and sample size are also examined in the simulations.

The rest of the paper is organized as follows. Section 2 describes the tree-based models and their relation with CV. Section 3 presents some theoretical analysis. Section 4 describes the simulation results. Section 5 ends the paper with concluding remarks and suggestions for future study.

2 Cross validation in tree-based models

2.1 The cross-validation principle

Suppose we have two disjoint sets: training set and testing set. The former set is used to learn the model and the latter to evaluate the performance of the trained model. Let A denote the training set of size N A and B the testing set of size N B . Let us consider a k-fold CV and α is an algorithmic parameter of a model. If we denote e (−i)α as the error rate when excluding the ith folder during CV, the cross-validating error at α is given as

$$ CV(A; \alpha) = \frac{1}{k}\sum_{i=1}^k e_{\alpha}^{(-i)}. $$

The principle of CV is to choose an α such that CV(A; α) is minimized:

$$ \alpha_0 = \hbox{argmin}_{\alpha}CV(A; \alpha). $$

Let \(T_{\alpha_{0}} (A)\) denote the model that is built by using α = α0 together with the training sample A. The training error based on CV can be expressed as e CV (\(T_{\alpha_{0}} (A),\) A) (identical with e 2,A ), which denotes the error rate when the model \(T_{\alpha_{0}} (A)\) is applied to the training data A. The testing error can be represented as e T (\(T_{\alpha_{0}} (A),\) B) (identical with e 2,B ), which denotes the error rate when the model \(T_{\alpha_{0}} (A)\) is applied to the testing data B.

2.2 Complexity-penalized loss function

Tree-based models are very popular in the data mining community because they provide interpretable rules and logic statements that enable more intelligent decision making. In general, tree modeling involves two major steps: tree growing and tree pruning. Tree growing searches over the whole dataset to find the splitting points that lead to the greatest improvements for a specified score function. After the tree reaches the full-grown stage when no further improvement is possible, one prunes back the tree to identify the right-sized tree that provides the minimum error when the tree is applied to the testing dataset (Hastie et al. 2001).

CV is adopted in the tree-pruning step. Among many tree-pruning algorithms, cost-complexity tree-pruning (CCP) (Breiman et al. 1984) and frontier-based tree-pruning (FBP) (Huo et al. 2006) algorithms utilize the principle of CV. The main idea of both CCP and FBP is to consider a complexity-penalized loss function (CPLF) and search the possible set of a penalizing parameter to find the optimal tree using CV. The CPLF has the form

$$ L(T_b) + \alpha |T_b|, $$
(2)

where L(T b ) is the loss function associated with the tree T b , |T b | is the size of the tree, which is defined as the number of terminal nodes, and α is a parameter that controls the size of trees. We then solve the following:

$$ T_{b_0}(\alpha) = \hbox{argmin}_{T_b} \quad L(T_b) + \alpha |T_b|. $$

2.3 Integration with cross validation

Despite the identical objective function, CCP and FBP use different algorithms to find an optimal tree. FBP is more advantageous to study the behavior of the CV because it can be utilized to implement the principle of cross validation more faithfully. We begin with a brief description on how the FBP algorithm can be used in implementing CV. Suppose the observations are

$$ \{(x_1,y_1), (x_2,y_2), (x_3,y_3), \ldots, (x_N,y_N)\}, $$

where N is the number of observations, x i ’s are predictor variables, and y i ’s are responses. Suppose the above set is equally partitioned into k subsets:

$$ S_1 \cup S_2 \cup \cdots \cup S_k. $$

At each step, we reserve one subset (e.g., S i ) for testing and use the remaining subsets to prune the tree. The core idea of the FBP algorithm is illustrated in Fig. 1. For a given α, when the target size of the tree is m, the minimum value of CPLF is c m  + mα, where c m is a constant. The first step of FBP is to list CPLF in each node of the tree using a bottom-up tree-pruning algorithm. Then all the information is summarized at the root node as a list of CPLFs. Thus, the number of CPLFs at the root node should be equal to that of terminal nodes of the tree. The next step is to plot all the CPLF at the root node in a Cartesian plane (Fig. 1). The x-axis is the range of α and the y-axis is the value of the CPLFs. The lower bound of these CPLFs can be obtained as a form of a piecewise linear function and denoted as f i (α), where f i (α) is the minimum value of (2) without testing subset S i .

Fig. 1
figure 1

An illustration of the frontier-based tree-pruning algorithm

For each value of the parameter α, the optimal subtree can be obtained. Each model is then applied to the reserved testing set. The error rate in testing can be computed and is denoted by e i (α). Note that functions f i (α) and e i (α) are of the same variable. Because function f i (α) is a piecewise linear function, it is not hard to prove that function e i (α) is also a piecewise step function. The principle of CV is to find the α that minimizes the average of e i ’s,

$$ \frac{1}{k}\sum_{i=1}^k e_{-i}(\alpha). $$

The tree size corresponding to the optimal α is the final tree. Figure 2 shows how the error rates (e CV (A; α)) vary with α. The lowest part of the step function indicates the optimal α.

Fig. 2
figure 2

The range of the optimal α that produces the smallest error rate. Note that the minimum is not unique

3 Analysis

3.1 Error difference for the Bayes classifier

Some distributional analysis is given here. The dataset is divided into a training set and a testing set. An oracle knows the underlying distribution and is able to derive a classifier that is statistically optimal—having the minimum expected prediction error rate overall, following the idea of Neymann–Pearson. Such a classifier is known as the Bayes classifier (BC). Note that this classifier does not depend on the sample data. Let e 1,* denote the error rate by applying the BC to data *. We can obtain e 1,A and e 1,B using the following equations:

$$ e_{1,A} = \frac{1}{N_{A}} \sum_{i \in {A}} I(\hat{Y}_{BC}(X_i), Y_{i}), $$
(3)
$$ e_{1,B} = \frac{1}{N_{B}}\sum_{i \in {B}} I(\hat{Y}_{BC}(X_i), Y_{i}), $$
(4)

where N * is the size of dataset * and I is a 0-1 loss function defined as follow:

$$ I(\hat{Y}(X),Y) = \left\{ \begin{array}{ll} 0 & \hbox{if}\quad Y = \hat{Y}(X),\\ 1 &\hbox{if}\quad Y \neq \hat{Y}(X),\end{array}\right. $$

and Ŷ BC (·) is the Bayes classifier. We can compute their difference:

$$ D_1 = e_{1,B} - e_{1,A}. $$
(5)

Distribution of D 1 can be characterized by the following proposition, simply being derived from the central limit theorem.

Proposition 1

Suppose that the minimum misclassification error is a constant p within the state space. When errors e1,A and e1,B are defined as in (3) and (4), we have \(e_{1,A} \sim {\mathcal{N}}(p,\sigma^2_{e_{1,A}})\) and \(e_{1,B} \sim {\mathcal{N}}(p,\sigma^2_{e_{1,B}})\) where p is the true risk. Moreover, we have \(D_1 = e_{1,B} - e_{1,A} \sim {\mathcal{N}}(0,\sigma^2_{D_1})\).

Proof

For \(i \in A\), I BC (X i ), Y i ) can be considered as an independent and identically distributed Bernoulli random variable with success probability p, where p also is the true risk; owing to a fixed decision boundary of the BC. Therefore \(\sum_{i \in{A}} I(\hat{Y}_{BC}(X_i), Y_{i})\) follows a binomial distribution with parameters N A and p. This can be approximated by a normal distribution, \({\mathcal{N}}(N_A \cdot p, N_A \cdot p(1-p))\). Thus e 1,A can be described as \({\mathcal{N}} \left(p, \frac{p(1-p)} {N_{A}} \right)\). Similarly, we have \(e_{1,B} \sim {\mathcal{N}} \left(p, \frac{p(1-p)}{N_{B}}\right)\). Furthermore, because the difference of two normal distributions is also a normal distribution, D 1 = e 1,B  − e 1,A will also approximately follow a normal distribution with mean 0 and variance \(\frac{p(1-p)} {N_{A}}+\frac{p(1-p)}{N_{B}}\).

The assumption of a constant p is stringent, however it is consistent with the settings in the simulations. The same results holds in more general cases. For illustration purpose, we choose not to pursue more general results. Note in the above setting, D 1 provides the minimum asymptotic variance among all unbiased estimators. We choose not to articulate it either; such a result is known in mathematical statistics.

3.2 Error difference for the cross-validated classifier

Incorporating CV in tree-based models yields a classifier, called a cross-validated tree (CVT) classifier. In the present study, we constructed CVTs based on the FBP algorithm, which is described in Sects. 2 and 3.

Let e 2,A denote the training error based on CV, which is the error rate by applying CVT to the training data:

$$ e_{2,A} = \frac{1}{N_A} \sum_{i \in {A}} I(\hat{Y}_{CVT}(X_i), Y_{i}). $$
(6)

Let e 2,B denote the testing error when the CVT is applied to the testing data:

$$ e_{2,B} = \frac{1}{N_B} \sum_{i \in {B}} I(\hat{Y}_{CVT} (X_{i}), Y_{i}). $$
(7)

A quantity similar to the one in (5) is

$$ D_2=e_{2,B} - e_{2,A}. $$
(8)

Similar to the aforementioned proposition, one can see that e 2,B is approximately normally distributed because e 2,B can be viewed as a sum of i.i.d. Bernoulli random variables. It is less evident that e 2,A satisfies an approximate normal distribution as well. Note that the CVT depends on the training data. Our simulations show that such a dependence is ignorable and e 2,A can be considered approximately normally distributed as well. Figure 3 displays the normal probability plots of e 2,A , e 2,B , and D 2 generated by a relatively small number of sample points (i.e., 200). All three plots do not depart substantially from linearity, suggesting that the error distributions are normal.

Fig. 3
figure 3

Normal probability plots for the errors from CVT based on 200 sample points: a training errors from CVT (e 2,A ), b testing errors from CVT (e 2,B ), and c differences (D 2e 2,B e 2,A )

3.3 Convergence of CVT

We will argue that under amenable situations, the CVT should converge to the optimal Bayes classifier. Under this convergence, similar to the argument in Sect. 1, one can show that D 2 asymptotically satisfies a normal distribution with zero mean. In fact, asymptotically, D 1 and D 2 should be equal!

We now present our convergence analysis. Since we consider a tree-based binary classification model, the statistical formulation can be summarized as follows: Consider a univariate response random variable Y = 0 or Y = 1 and a random vector X (which is called predictor). Recall (X i , Y i ) denotes the ith observed pair of the predictor and the response. Suppose there is a geometrically manageable (which will be specified later) subset Ω. If X ∈ Ω, we have Pr(Y(X) = 1) = p = 1 − Pr(Y(X) = 0), where probability p > 1/2; if \(X\,{\notin}\,\Upomega\), then we have Pr(Y(X) = 0) = p = 1 − Pr(Y(X) = 1). Under this model, apparently the Bayes classifier is

$$ T_{BC}(X) =\left\{\begin{array}{ll} 1, & x \in \Omega, \\ 0, & x \notin\Omega. \end{array}\right. $$

Moreover, by definition, we have that T BC minimizes the following

$$ E I[T(X) \neq Y] =Pr[T(X) \neq Y], $$

where E stands for expectation, I[*] is an indicator function of event *, and Pr[*] stands for the probability of event *.

Denote a partition of the training set X A with 10 folders as \(X_A = X_{(1)} \cup \cdots \cup X_{(10)}\), where X (k), \(k=1, 2, \ldots, 10\) are 10 mutually exclusive and roughly equally sized subsets of X A . Denote \(X_{-k} = X_A \setminus X_{(k)}, 1 \le k \le 10\), to be the subset of X A after eliminating X (k). Let T(X k ; α) be the tree-based binary classifier that is obtained by applying the algorithm described in Sect. 2 to subset X (k) with parameter α. The cross-validation criterion picks the α that minimizes the following:

$$ \sum^{10}_{k=1} \sum_{i\in X_{(k)}} I[T(X_{-k}; \alpha)(X_i) \neq Y_i] / |X_A|, $$
(9)

where T(X k ; α)(X i ) is the predicted value of the classifier T(X k ; α) at X i , and |X A | denote the cardinality of set X A .

We consider two properties, under which we argue that the CVT classifier converges to the Bayes classifier. First, recall that in the algorithm in Sect. 2, we first build a big tree, then adopts a fast algorithm to prune the large tree. The underlying geometric region Ω should be representable by a subtree model; i.e., a classifier according to an admissible subtree of the initial large tree is close to the classifier that is defined on Ω, e.g., T BC . Otherwise, the tree-based approach has no hope to converge to the optimal classifier.

Property 1

When the sample size is large enough, an admissible subtree of the largest possible tree (which is built according to the entire sample) leads to a classifier which has statistically similar performance of a classifier that is based merely on whether or not the predictor X is inside Ω, e.g., the Bayes classifier T BC .

We also must assume that distribution of the predictor X is controllable such that when the sample size is large, the optimal classifier is achieved for similar values of α.

Property 2

The distribution of the predictor X is “reasonable” such that when the total sample size for training set X A is large, the classifier T(Xk , α) for 1 ≤ k ≤ 10 and T(X A ; α) lead to similar classification rule for similar values of α. This is to ensure that optimal classifier is achieved simultaneously (with respect to value of α) for T(Xk ; α) with different k.

The reasonableness in the above property means that sets X A as well as \(X_{-1}, \ldots, X{-10}\) have similar statistical behavior, such that tree classifiers that are built according to them are similar as well. Note that the tree classifier, under Property 1, would converge to the Bayes classifier. This property is to ensure that the convergence occurs for a similar α value. The reasonableness can be examined by noticing that X A as well as \(X_{-1}, \ldots, X{-10}\) are samples from an identical source with about the same sizes.

The actual proof of the above two properties will be tedious, and perhaps more suitable for a mathematically oriented paper. However, they are true at least intuitively. In this paper, we choose to sacrifice the mathematical rigor in order to focus on the simulation study that come later. We now show that the aforementioned two properties will lead to the convergence of the CVT to the Bayes classifier. According to Property 1, there exists an α such that T(X k ; α) becomes a classifier that purely depends on the membership of the predictor X in the underlying set Ω. Suppose this classifier is T. From Property 2, such a classifier can be achieved simultaneously for all the k, 1 ≤ k ≤ 10. Note that when the sample size goes to infinity, the value of the expression in (9) satisfies:

$$ \sum^{10}_{k=1} \sum_{i\in X_{(k)}} I[T(X_i) \neq Y_i] / |X_A| \Rightarrow E I[T(X) \neq Y], $$

where “⇒” stands for “converging to,” and the right hand side is minimized by the Bayse classifier. From the definition of the Bayes classifier, the minimizer T should be the Bayes classifier. This demonstrates that the CVT, when the sample size converges to infinity, converges to the Bayes classifier.

Remark 1 Property 1 is to ensure that the geometric region Ω is manageable by a binary tree model.

Remark 2 Property 2 says that when the sample is large, the sets \(X_{A}, X_{-1}, \ldots, X{-10}\) have similar statistical behavior, so they lead to the similar tree classifier for a similar and moderate α.

Remark 3 The above argument can be generalized to any K-fold cross validation as long as K is finite. The above argument cannot be utilized for leave-one-out cross validation because the number of folders then goes to infinity along with the sample size.

3.4 D1 versus D2

As mentioned above, in an asymptotic sense, the CVT should converge to the Bayes classifier. Hence we should have D 1D 2 for a large sample size. In the small sample case, the aforementioned may not be true. It is interesting to observe that the two normally distributed random variables D 1 and D 2 seem to fit nicely into a simple linear regression model based on an ordinary least squares estimation method. In particular, we can conjecture that the following is approximately true in finite-sample situations:

$$ D_1 = a + c \cdot D_2 + \varepsilon, $$
(10)

where the error ɛ has zero mean and a constant variance and the y intercept a is constant value. Moreover, we have that 0 < c < 1 and the value of c together with the variance of ɛ depend on the underlying geometry, the sample size, and the ratio between the numbers of training and testing samples. A direct application is as follows: a combination of larger c and smaller variance of ɛ indicates an amenable scenario to adopt the CV approach. Such an observation can be utilized to develop guidelines on when to apply CV when the data size is small. Evidently, such a guideline is important in applications. This paper focuses on the framework, instead of deriving the exact guideline for specific models.

4 Simulations

4.1 The setting

Data points X are uniformly generated in a unit square. Figure 4 illustrates 200 data points. Decision boundaries are as follows.

  • Case 1: X \(\in\) B r where B r is a rectangular decision boundary for 0.2 ≤ X 1 ≤ 0.8 and 0.3 ≤ X 2 ≤ 0.8.

  • Case 2: X \(\in\) B c where B c is a circular decision boundary for (X 1−0.5)2 + (X 2−0.5)2 < 0.22.

  • Case 3: X \(\in\) B t where B t is a triangular decision boundary for X 2 > 0.2, X 2 < 2X 1− 0.2, and x 2 < −2X 1 + 1.8.

Fig. 4
figure 4

Illustration of 200 simulated data with three different decision boundaries. a rectangular decision boundary, b circular decision boundary, and c triangular decision boundary

Success probability p plays the role as in the following:

  • If input \(X \in B_x\) for \(x \in \{r, c, t \}\), then we have response

    $$ Y = \left \{\begin{array}{ll} 1 & {\hbox{with}\;\hbox{probability}\; (\hbox{w.p}.)} \quad p, \hfill \\ 0 & {\hbox{w.p}.} \quad 1-p. \end{array}\right.$$
  • If input \(X \notin B_x\), then we have response

    $$ Y = \left \{ \begin{array}{ll} 1 & {\hbox{w.p.}} \quad p-1, \hfill \\ 0 & {\hbox{w.p.}} \quad p. \end{array}\right.$$

For each generated data, we compute a series of aforementioned error rates (e 1,A , e 2,A , e 1,B , and e 2,B ). Recall that 10-fold CV is used to obtain e 2,A and e 2,B .

4.2 Relation between D 1 and D 2 and effects of the geometry of decision boundaries

To identify a statistical relation between D 1 and D 2, we utilize a linear regression analysis. D 1 is taken as the response variable and D 2 as the predictor variable. Figure 5 represents linear regression lines between D 1 and D 2 with three different types of decision boundaries. Our experimental results with the finite sample size show that differences between testing and training errors from the Bayes classifiers (D 1) and the CVT (D 2) are not zero but can be modeled by a simple linear regression model.

Fig. 5
figure 5

Regression plots between D 2 and D 1 with rectangular, circular, and triangular boundaries. Regression lines are generated with p = 0.1 and sample size 300. The diagonal lines (D 1 = D 2) are plotted as references

Slopes of regression lines are shown in Table 1. Note that the sample size does not significantly affect the slope. It can be also seen that the slopes of the regression models through the origin (values in the parentheses in Table 1) are not significantly different from the ones of the regression models with the intercepts. This small difference, however, does not indicate that the y intercepts are statistically zero.

Table 1 Slopes of the regression models with different decision boundaries and dataset sizes

Table 1 shows that the relationship of the difference between testing and training errors of the BC and CVT is affected by the geometry of the decision boundaries. The closer the slope is to one, the less the difference between D 1 and D 2. It is not hard to imagine why a rectangular decision boundary has a larger value of the slope than other decision boundaries. This is due to the characteristic of the recursively binary splitting of the feature space in tree-based models. Furthermore, Table 2 shows the coefficient of determination (R 2) of each boundary. It shows that a rectangular boundary yields larger R 2 than the others, which suggests that a strong degree of linear association between D 1 and D 2. In other words, a CVT based on a rectangular decision boundary behaves more like the Bayes classifier compared to circular and triangular boundaries.

Table 2 Coefficient of determination (R 2) with different decision boundaries and sample sizes

4.3 The effect of the parameters in an underlying distribution

Recall that the underlying distribution in our simulation is Bernoulli(p), where p is the constant misclassification rate applying to the entire state space (see Sect. 1). Table 3 describes the slopes of regression lines with different values of p with a rectangular decision boundary.

Table 3 Slopes in a regression line with different parameters

The other geometries of decision boundaries give similar results. Figure 6 displays the boxplots of the slopes for different parameter values. It shows that parameter values between 0.1 and 0.2 produce a strong linear relationship between the BC and the CVT, however this relation becomes weaker as the parameter value becomes either extremely small or close to 0.5. It is not difficult to explain why D 1 and D 2 have a weak linear relationship as p approaches to 0.5. If p equals 0.5, we have the same probability for each class being inside or outside of a decision boundary. In this case, classification processes are mostly affected by random effects instead of the decision rule. This randomness causes a weak relationship between the two classifiers. For small p (e.g., p = 0.01), the relationship of two classifiers is very sensitive to the changes of error rates because both classifiers render very small error rates. Such a high sensitivity results in a weak relationship between the two classifiers. R 2s for the above regression analysis show similar patterns with slopes in regression models (this result is not reported).

Fig. 6
figure 6

Boxplots of slopes in a regression line with different parameters. Plots are generated with rectangular decision boundary

4.4 The effect of the sample size

We study the relationship of the equality between testing and training errors from both classifiers with different sample sizes. First we consider five different total sample sizes (testing & training): 100, 200, 300, 400, 500. For each sample size, we consider five different ratios of testing to the training samples: 1:3, 1:2, 1:1, 2:1, 3:1. Table 4 shows the slopes in a regression line from different ratios of the training and the testing sample sizes. Again, because the intercepts in a regression line are not statistically significant, we consider the slopes with zero intercept shown in the parentheses in Table 4. Figure 7 illustrates a three-dimensional contour plot in which x and y-axes represent the sample size and the ratio of testing to training samples. For example, if the values on the x and y-axes are 300 and 2, this indicates that the experiment is performed using 100 training samples and 200 testing samples. The z-axis (the values on the contour plot) indicates the slopes of each regression line. This plot can be used to determine the ratio of testing to the training sample size for achieving the targeted performance. For instance, if we want our CVT classifier to be \(\frac{1}{0.84677}\) of the Bayes classifier, the corresponding values on the x-axis suggest the ratio corresponding to the different total sample sizes. The result also indicates that changes of slopes become stabilized when sample size is larger than 300. This may indicate the border line between the large sample (asymptotic) size and the finite sample size.

Table 4 Slopes in a regression line with different sizes and ratio of training and testing sets
Fig. 7
figure 7

Contour plot of slopes with different sizes and ratios of training and testing set. Plot is generated with rectangular decision boundary

5 Conclusions

An experimental study is presented to measure the performance of CV in tree-based models. We compares a CVT with a Bayes classifier, which is derived from the knowledge of the underlying distribution. We focus on the finite-sample case. Main observation is that the differences between testing and training errors from both the CVT and the Bayes classifiers follow a simple linear regression model. The slope of the regression line and the variance of the random error can be served as a measure on how well CV may work in that particular situation for tree-based models. R 2 are employed to validate the relation. Both the slope and R 2 being equal to one suggests a strong relationship between two classifiers. In addition, it is demonstrated that the above relation is influenced by other factors such as the shape of the decision boundaries, the probabilistic parameter of the underlying distribution, and the sample size. It should be noted that because our current study was conducted based solely upon 10-fold CV, the results may not be generally applicable to CV on different number of folds. There are interesting directions for future research: one can extend our study to other learning algorithms, such as support vector machines, neural networks, and so on. Authors believe that the finite sample behavior of the cross validation error is a fascinating research topic.