1 Introduction

Cross validation is a widely used technique to estimate how the performance of a prediction model on a given in-sample dataset is generalizable to an unseen out-of-sample dataset. Take multiple linear regression for example, which predicts the response of a given explanatory data \(x \in \mathbb {R}^{n \times p}\) with a linear function: \(\hat{y} = x \beta \). For a given in-sample dataset of \((x^\text {in}, y^\text {in})\), the parameter \(\beta \) can be trained to fit the in-sample data with minimal distance between actual and predicted responses: \(\beta ^* \in \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _\beta \{||y^\text {in} - x^\text {in} \beta ||\}\). However, the model \(\hat{y} = x \beta ^*\) is not expected to fit an unseen out-of-sample dataset \((x^\text {out}, y^\text {out})\) as well as it did the in-sample dataset that it was trained to fit, or overfit [1, 2]. The essence of overfitting is to mistakenly extract noises as representative underlying model structure [3]. This phenomenon is particularly common [4] when the number of explanatory variables, p, far exceeds the number of data points, n. Bartlett et al. [5] analyzed the mysterious phenomenon of benign overfitting, which allows deep neural networks to achieve good prediction accuracy despite a perfect fit to noisy training data. They found that, for linear regression, overfitting cannot be benign unless the unimportant directions in parameter space significantly outnumber the sample size. When this characterization is not satisfied, overfitting often leads to large out-of-sample prediction errors. As a remedy to this challenge, cross validation can be used to estimate the extent of overfitting and to select prediction models that are less overfitted [1, 6, 7].

In order to estimate the performance of a prediction model in out-of-sample data, a common practice in cross validation is to partition the given in-sample data into training and validation subsets, train the model with the training data, calculate the prediction error using the validation data, and use the validation error as an estimation of the out-of-sample performance [8].

Cross validation criteria are either exhaustive or non-exhaustive. Exhaustive criteria evaluate the performance of a prediction model on all possible partitions of training and validation datasets subject to cardinality constraints. As perhaps the most popular exhaustive cross validation criterion, leave-k-out (LKO) consists of three steps. Step 1, enumerate all possible partitions to divide a given in-sample of n data points into a training subset with \(n-k\) data points and a validation subset with k. Step 2, for each partition, train the prediction model and calculate its corresponding validation error. Step 3, use the average validation error as the LKO criterion [9, 10]. Due to the computational complexity, only the special case of leave-1-out is widely used [8, 11, 12]. Commonly used non-exhaustive cross validation criteria include m-fold [13, 14], holdout [15], random sampling [16], etc. It is worth mentioning that leave-k-out is the exhaustive counterpart of (n/k)-fold when n is an integer multiple of k. As a special case, leave-1-out and n-fold are equivalent.

Exhaustive and non-exhaustive cross validation criteria have complementary strengths and limitations. Exhaustive criteria, on the one hand, take all possible partitions of training and validation subsets into consideration, at the cost of computational intractability. Non-exhaustive criteria, on the other hand, introduce the tradeoff between solution quality and computational complexity by sampling random partitions.

Recent studies in the literature have compared different cross validation methods, proposed new variants, and applied these methods in different application fields. Magnusson et al. [17] proposed a new method for model comparison by combining fast approximate leave-1-out surrogates with exact leave-1-out sub-sampling. Xu et al. [18] proposed to improve the effectiveness of m-fold by splitting the training dataset before applying m-fold to each split and pooling the prediction errors. Jung [19] suggested a variant of m-fold, which uses \(m-1\) folds of data for model validation and the remaining fold for model construction. Ramezan [20] compared three cross validation methods (leave-1-out, m-fold, and Monte Carlo) using high spatial resolution remotely sensed datasets and found “minimal differences in accuracy for the different cross-validation tuning methods.” Duarte [21] found cross validation to be more effective than internal metrics for tuning SVM hyperparameters. Recent applications of cross validation included detecting Alzheimer disease [22], predicting basketball game outcomes [23], cryo-EM map reconstruction [24], and wind energy predition [25].

The proposed leave-worst-k-out (LWKO) criterion can be defined as a variant of LKO. With the first two steps of the aforementioned LKO definition being the same, LWKO uses the largest one (as opposed to the average) of the \(C_n^k\) validation errors in the third step. As such, LWKO has two advantages over the LKO or m-fold criteria. First, LWKO is less intractable than LKO. To compute LKO, all the \(C_n^k\) validation errors need to be calculated, whereas LWKO only requires the identification of the largest error without having to calculate the exact values of the others. In Sect. 2.2, we present a random sampling algorithm for calculating LWKO approximately, which has a lower yet comparable computational efficiency with leave-1-out and m-fold. In Sect. 2.3, for the special case of using multiple linear regression as the prediction model under \({\mathcal {L}}_1\) norm, we present a mixed integer linear programming (MILP) formulation for computing the exact value of the LWKO criterion. Second, LWKO has been found to be more effective than LKO and m-fold in identifying outliers to the prediction model, which could be caused by underfitted or overfitted models. The LKO and m-fold criteria favor models that fit the validation data well on average, allowing a large number of good fits to compensate for the bad performance of a few outliers. In contrast, the LWKO criterion favors models whose worst outliers are not too outlying, even though the average fit may not be the best possible.

2 Methods

2.1 Definition of LWKO as a bilevel optimization model

The LWKO criterion is defined as the largest possible validation error when the model is trained using \(n-k\) data points and validated using the remaining k data points. Since there are \(C_{n^{k}}\) possible such partitions of training and validation subsets, a brute force algorithm for calculating LWKO or LWO would require training the prediction model and calculating its validation error \(C_{n^{k}}\) times. In this section, we present the following bilevel optimization model (1)–(6), whose optimal solution yields the LWKO criterion:

$$\begin{aligned} \max _{w, \beta , \mathcal {T}, \mathcal {V}}&||y_\mathcal {V} - h(x_\mathcal {V}|\beta )||_\mathcal {L} \end{aligned}$$
(1)
$$\begin{aligned} {{\,\mathrm{s.t.}\,}}&\sum _j w_j = n - k&\end{aligned}$$
(2)
$$\begin{aligned}&w \in \mathbb {B}^{n}&\end{aligned}$$
(3)
$$\begin{aligned}&\mathcal {T} = \{j: w_j = 1\}&\end{aligned}$$
(4)
$$\begin{aligned}&\mathcal {V} = \{j: w_j = 0\}&\end{aligned}$$
(5)
$$\begin{aligned}&\beta \in {{\,\mathrm{arg\,min}\,}}_b ||y_\mathcal {T} - h(x_\mathcal {T}|b)||_\mathcal {L}.&\end{aligned}$$
(6)

Here, the upper level objective function (1) is to maximize the validation error under norm \(\mathcal {L}\). Function \(h(x_\mathcal {V}|\beta )\) predicts the response from validation data \(x_\mathcal {V}\) using parameter \(\beta \). Variable w indicates the partition decision: \(w_i=1\) or \(w_i=0\) means that data point i is used in the training set, \(\mathcal {T}\), or the validation set, \(\mathcal {V}\), respectively. Constraints (2) and (3) require w to be a binary vector with exactly \(n-k\) elements being 1 and k being 0. Constraints (4) and (5) define the training set, \(\mathcal {T}\), and validation sets, \(\mathcal {V}\), respectively. The lower level (6) finds the optimal parameter \(\beta \) to minimize the training error under norm \(\mathcal {L}\).

This bilevel optimization model (1)–(6) suggests three underlying approaches to computing the LWKO criterion, exactly or approximately. First, train the prediction model for all \(C_n^k\) partitions in a brute force manner, compute their corresponding validation errors, and then use the largest one as LWKO. Second, train the prediction model with a random sample of partitions and use the largest validation error to approximate LWKO. Third, take advantage of special (such as linear) properties of function \(h(x|\beta )\) to identify one training-validation dataset partition that results in the largest validation error, without having to compute the exact values of all other validation errors. The latter two approaches will be discussed in more details in the following two subsections.

2.2 Random sampling algorithm for computing LWKO approximately

We present a random sampling algorithm for calculating LWKO approximately, which is applicable to a general prediction function \(h(x|\beta )\) under a general norm \(\mathcal {L}\). This algorithm trains the prediction model for a randomly sampled subset of partitions and uses the largest validation error among the samples as an underestimated approximate value of LWKO. The premise of this algorithm is that approximate values of LWKO may also be informative for model selection when all prediction models are cross validated by the same approximate algorithm.

figure a

2.3 LWKO for multiple linear regression under the \(\mathcal {L}_1\) norm

In this section, we consider a special case using multiple linear regression for the prediction function \(h(x|\beta ) = x \beta \) under the \(\mathcal {L}_1\) norm and present an approach for calculating the LWKO criterion exactly. In such special case, model (1)–(6) reduces to the following bilevel optimization model:

$$\begin{aligned} \max _{w, \beta }&\dfrac{1}{k} (1-w)^\top |y-x \beta | \end{aligned}$$
(7)
$$\begin{aligned} {{\,\mathrm{s.t.}\,}}&1^\top w = n - k&\end{aligned}$$
(8)
$$\begin{aligned}&w \in \mathbb {B}^{n}&\end{aligned}$$
(9)
$$\begin{aligned}&\beta \in \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _b \left\{ \dfrac{1}{n-k} w^\top |y-x b| \right\} . \end{aligned}$$
(10)

Then, we reformulate model (7)–(10) as the following MILP, which can be solved efficiently by numerous algorithms [26,27,28] and commercial solvers such as CPLEX [29] and GUROBI [30]:

$$\begin{aligned} \max _{w, l, u, \lambda , \mu , r, \beta }&\dfrac{1}{k} \left[ 1^\top r - (n-k) y^\top (\mu - \lambda ) \right] \end{aligned}$$
(11)
$$\begin{aligned} {{\,\mathrm{s.t.}\,}}&r \le M l + x \beta - y \end{aligned}$$
(12)
$$\begin{aligned}&r \le M u + y -x \beta \end{aligned}$$
(13)
$$\begin{aligned}&l + u \le 1&\end{aligned}$$
(14)
$$\begin{aligned}&1^\top w = n - k&\end{aligned}$$
(15)
$$\begin{aligned}&r \ge x \beta - y&\end{aligned}$$
(16)
$$\begin{aligned}&r \ge y - x \beta&\end{aligned}$$
(17)
$$\begin{aligned}&x^\top (\mu - \lambda ) = 0&\end{aligned}$$
(18)
$$\begin{aligned}&\lambda + \mu = \dfrac{w}{n-k}&\end{aligned}$$
(19)
$$\begin{aligned}&w, l, u \in \mathbb {B}^{n}; \lambda , \mu \ge 0; r, \beta \text { free}.&\end{aligned}$$
(20)

Theorem 1

Bilevel model (7)–(10) and MILP (11)–(20) are equivalent in the sense that \((w,\beta )\) is an optimal solution to (7)–(10) if and only if there exists \((l, u, \lambda , \mu , r)\) such that \((w, l, u, \lambda , \mu , r, \beta )\) is an optimal solution to (11)–(20). As such, the LWKO criterion can be calculated by solving the MILP (11)–(20).

Proof

This proof consists of three steps of equivalent reformulations. First, we reformulate (7)–(10) as follows to linearize the absolute value functions in both upper and lower levels:

$$\begin{aligned} \max _{w, l, u, r, \beta }&\dfrac{1}{k} (1-w)^\top r \end{aligned}$$
(21)
$$\begin{aligned} {{\,\mathrm{s.t.}\,}}&r \le M l + x \beta - y \end{aligned}$$
(22)
$$\begin{aligned}&r \le M u + y -x \beta \end{aligned}$$
(23)
$$\begin{aligned}&l + u \le 1&\end{aligned}$$
(24)
$$\begin{aligned}&1^\top w = n - k&\end{aligned}$$
(25)
$$\begin{aligned}&w, l, u \in \mathbb {B}^{n}; r \text { free}&\end{aligned}$$
(26)
$$\begin{aligned}&\beta \in \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{s,b} \left\{ \dfrac{1}{n-k} w^\top s: x \beta - y \le s; y - x \beta \le s \right\} \end{aligned}$$
(27)

Linearization of the absolute value under minimization at the lower level is straightforward. At the upper level, which is a maximization problem, it requires additional binary variables l and u, free variable r, a big-M parameter M, and new constraints to establish \(r = |y - x \beta |\). Constraints (22)–(24) ensure that \(r \le |y - x \beta |\), and the maximization objective will close any gap between r and \(|y - x \beta |\).

Second, we replace constraint (27) with the optimality condition of the underlying linear program:

$$\begin{aligned} r \ge x \beta - y \end{aligned}$$
(28)
$$\begin{aligned} r \ge y - x \beta \end{aligned}$$
(29)
$$\begin{aligned} x^\top (\mu - \lambda ) = 0 \end{aligned}$$
(30)
$$\begin{aligned} \lambda + \mu = \dfrac{w}{n-k} \end{aligned}$$
(31)
$$\begin{aligned} \lambda , \mu \ge 0, \end{aligned}$$
(32)

where \(\beta \) and r correspond to, respectively, decision variables b and s at optimality.

Finally, according to strong duality, we have \(\dfrac{1}{n-k} w^\top r = y^\top (\mu - \lambda )\), which helps convert the bilinear objective function \(\dfrac{1}{k} (1-w)^\top r\) into an equivalent linear one \(\dfrac{1}{k} \left[ 1^\top r - (n-k) y^\top (\mu - \lambda ) \right] \). \(\square \)

3 Computational experiments

3.1 Data

We collected two data sets from publicly available sources for our computational experiments. The first dataset was from regression problem 3 of the CoEPrA (Comparative Evaluation of Prediction Algorithms) 2006 competition [31]. The explanatory data x has a dimension of 133 rows and \(p=5,787\) columns, representing 133 objects of nonapeptides, each described by 5,787 descriptors; the response data y represents the property of the 133 nonapeptides.

The second dataset was from the DRYAD repository: DOI:10.5061/dryad.fc55k [32]. After the removal of incomplete data points, the explanatory data x has a dimension of 2,034 rows and \(p=353\) columns, representing 2,034 mice specimens, each genotyped at 353 single nucleotide polymorphisms on the 19 autosomal chromosomes; total body weight was used as the response data y. The datasets generated and/or analysed during the current study are available from the corresponding author on reasonable request.

3.2 Design of experiments

3.2.1 Experiment 1

Experiment 1 used the first dataset to represent the \(n<< p\) scenario. Eight cross validation criteria were selected to test the performance of the LWKO criterion in comparison with its counterpart LKO or (n/k)-fold for a range of k values under the \(\mathcal {L}_1\) norm. For \(k=1\) and \(k=2\), LKO was calculated with brute force enumeration; for \(k=4\) and \(k=8\), enumeration became computationally expensive, thus (n/k)-fold was used instead as the counterpart. This experiment consisted of the following 6 steps.

  • Step 1 Randomly partition the data into an in-sample subset that consists of n = 24 data points and an out-of-sample subset that consists of the remaining 109 data points.

  • Step 2 Build ten models by solving the best subset problem with the parameter \(t \in \{2, 4, 6, 8, 10, 12, 14, 16, 18, 20\}\):

    $$\begin{aligned} \min \limits _\beta \left\{ \dfrac{1}{n} ||y-x \beta ||_{\mathcal {L}_1}: ||\beta ||_0 \le t \right\} . \end{aligned}$$

    Step 3 Cross validate the ten models from Step 2 with the following eight criteria using the in-sample data:

    ❶ Leave-1-out (L-1-O)

    ❷ Leave-worst-1-out (LW-1-O)

    ❸ Leave-2-out (L-2-O)

    ❹ Leave-worst-2-out (LW-2-O)

    ❺ 6-fold

    ❻ Leave-worst-4-out (LW-4-O)

    ❼ 3-fold

    ❽ Leave-worst-8-out (LW-8-O)

    The LW-1-O and LW-2-O criteria were calculated with brute force enumeration; LW-4-O and LW-8-O were calculated using both the MILP model (11)–(20) (with a time limit for the solver) and the random sampling algorithm 1 to compare the performances of exact and approximate algorithms.

  • Step 4 For each cross validation criterion, select one model with the smallest cross validation error.

  • Step 5 Calculate the mean absolute errors of the eight selected models using out-of-sample data.

  • Step 6 Repeat Steps 1–5 three hundred times with different random partitions of in-sample and out-of-sample subsets.

3.2.2 Experiment 2

Experiment 2 was designed to use the second dataset to test the performance of LWKO for a general prediction model, random forest, under \(\mathcal {L}_2\) norm using realistic explanatory and response data. This experiment had the following 6 steps.

  • Step 1 Randomly partition the data into an in-sample subset that consists of n = 1000 data points and an out-of-sample subset that consists of the remaining 1034 data points.

  • Step 2 Build ten random forest models with the number of variables to select at random for each decision split, q, being \(q \in \{20, 40, 60, 80, 100, 120, 140, 160, 180, 200\}\). All other hyper-parameters are the same (100 trees in the forest, each created using 200 random data points).

  • Step 3 Cross validate the ten models from Step 2 with the following eight criteria using the in-sample data:

    ❶ Leave-1-out (L-1-O)

    ❷ Leave-worst-1-out (LW-1-O)

    ❸ 200-fold

    ❹ Leave-worst-5-out (LW-5-O)

    ❺ 100-fold

    ❻ Leave-worst-10-out (LW-10-O)

    ❼ 50-fold

    ❽ Leave-worst-20-out (LW-20-O)

    The LW-1-O criterion was calculated with brute force enumeration; LW-5-O; LW-10-O, and LW-20-O were calculated using the random sampling algorithm 1, in which parameter s was set to be \(\dfrac{10n}{k}\) so that each data point is used ten times on average to build binary decision trees.

  • Step 4 For each cross validation criterion, select one model with the smallest cross validation error.

  • Step 5 Calculate the root mean square errors of the eight selected models using out-of-sample data.

  • Step 6 Repeat Steps 0–5 three hundred times with different random partitions of in-sample and out-of-sample subsets.

Experiment 2 had three major differences with experiment 1. First, linear regression models in experiment 1 were created by solving the best subset problem, whereas random forest was used as the prediction model in experiment 2. Second, experiment 1 represented a scenario with a small number of data points (\(n=24\)) and a large number of variables (\(p=5,787\)). In contrast, there was a larger number of data points (\(n=1,000\)) and a smaller number of variables (\(p=353\)) in experiment 2. Third, in experiment 1, both the MILP formulation (11)–(20) and the random sampling algorithm 1 were used to compute LWKO, whereas only the random sampling algorithm 1 was used under the \(\mathcal {L}_2\) norm in experiment 2.

3.3 Simulation results

The two experiments were carried out in Matlab on a laptop with 16 GB of ram and 2.8 GHz CPU. Gurobi with default settings was used as the MILP solver for model (11)–(20); a time limit of 10 minutes was imposed for the solver, although it was able to terminate ahead of time in majority of the instances. The Matlab code for solving the MILP (11)–(20) was uploaded to https://github.com/lzwang2017/LWKO. Matlab function fitrtree.m was used to fit binary decision tree for regression.

3.3.1 Results from experiment 1

Table 1 Results from a sample simulation of experiment 1

Table 1 gives an example of simulation result from experiment 1. Each column corresponds to a multiple linear regression model with t non-zero variable effects. The in-sample error, error\(_\text {in}\), decreases as t increases from 2 to 20. In contrast, the out-of-sample errors, error\(_\text {out}\), of the ten models do not have such monotonic property; the errors may be high for very small or large values of t due to underfitting or overfitting. An ideal cross validation criterion would be able to estimate the out-of-sample performance of the models by using in-sample data. The ten models were validated against the eight cross validation criteria using the in-sample data, and the validation errors are given in the table in four groups of rows, for comparison between LKO or (n/k)-fold and the proposed LWKO criteria, and also for comparison between exact and approximate algorithms for computing LWKO. For each of the eight criteria (in ten rows), the smallest validation error is bolded, and its corresponding \(t^*\) that achieved such smallest error is recorded. As a benchmark, the smallest error\(_\text {out}\) and its corresponding \(t^*\) are also recorded, which represent the best possible performance of any cross validation criterion in this experiment.

The sample simulation of Table 1 was repeated for 300 times randomly and independently, each time with a different partition of the in-sample and out-of-sample data sets. Average computation time for ten different models under LW-4-O (MILP) and LW-8-O (MILP) criteria are shown in Table 2. The histograms of \(t^*\) for the selected models and the corresponding error\(_\text {out}\) are plotted in the two subfigures in Fig. 1. We make three observations. First, when used for a very small k value (e.g., \(k=1\) or \(k=2\)), LKO may not be an effective indicator of the model’s out-of-sample performance. Second, compared with LKO for the same k value, the LWKO criterion is more likely to lead to a smaller out-of-sample error. Third, approximate LWKO calculated using the random sampling algorithm could be an even more effective indicator of the model’s out-of-sample performance than the exact LWKO.

Table 2 Average computation time (in seconds) for ten different models under LW-4-O (MILP) and LW-8-O (MILP) criteria in experiment 1
Fig. 1
figure 1

Left Histograms of \(t^*\) for the selected models of eight cross validation criteria using the in-sample data over 300 random simulations in experiment 1. Results for the “optimal” criterion were obtained after observing the out-of-sample data, which represent the best possible performance of any cross validation criterion. Right Histograms and means of error\(_\text {out}\) for the selected models of eight cross validation criteria using the in-sample data over 300 random simulations in experiment 1

Direct comparisons between LKO and LWKO are presented in Table 3, which summarizes the percentages of times that the models selected by the LWKO cross validation criteria have led to a lower, higher, or the same out-of-sample errors in experiment 1, compared with the counterpart LKO or (n/k)-fold criteria. Results suggested that (both exact and approximate) LWKO outperformed the counterparts LKO and (n/k)-fold for all four compared cases of \(k \in \{1, 2, 4, 8\}\), with \(k=4\) showing the most significant advantage.

Table 3 Relative performance of LWKO and the counterpart LKO or (n/k)-fold in experiment 1, in terms of the percentages of times that LWKO led to a lower (better), higher (worse), or the same error\(_\text {out}\) in 300 random simulations

3.3.2 Results from experiment 2

The histograms of \(q^*\) for the selected models and the corresponding error\(_\text {out}\) are plotted in the two subfigures in Fig. 2. Direct comparisons between LKO and LWKO are presented in Table 4. The observations from these results are similar with those from experiment 1, suggesting a consistent performance of the LWKO criterion for different sizes of data sets, different prediction models, and different norms. Results from experiment 2 also suggest that, even when the optimality of the worst validation dataset cannot be guaranteed, the LWKO criteria calculated using the random sampling algorithm 1 could still be more effective than their counterparts LKO or (n/k)-fold in estimating the performance of general prediction models for out-of-sample data.

Fig. 2
figure 2

Left Histograms of \(q^*\) for the selected models of eight cross validation criteria using the in-sample data over 300 random simulations in experiment 2. Results for the “optimal” criterion were obtained after observing the out-of-sample data, which represent the best possible performance of any cross validation criterion. Right Histograms and means of error\(_\text {out}\) for the selected models of eight cross validation criteria using the in-sample data over 300 random simulations in experiment 2

Table 4 Relative performance of LWKO and the counterpart LKO or (n/k)-fold in experiment 2, in terms of the percentages of times that LWKO led to a lower (better), higher (worse), or the same error\(_\text {out}\) in 300 random simulations

Table 5 compares the computation time for ten different models under eleven cross validation criteria for a random sample simulation in experiment 2. Results suggested that, using algorithm 1, leave-worst-k-out has a lower yet comparable computational speed with leave-1-out and (n/k)-fold for a wide range of k. Moreover, parameter s in algorithm 1 can also be adjusted to balance the tradeoff between speed and solution quality.

Table 5 Computation time (in seconds) for ten different models under eight cross validation criteria for a random sample simulation in experiment 2

4 Conclusion

We proposed the leave-worst-k-out as a new criterion for cross validation. This work makes three major contributions. First, we proved that, for the special case of using multiple linear regression as the prediction model under \(\mathcal {L}_1\) norm, the LWKO criterion is an optimal solution to a mixed integer linear program, which can be solved by numerous algorithms and commercially available solvers more efficiently than brute force enumeration. Second, we proposed a random sampling algorithm for computing LWKO approximately for general prediction models and norms. Third, we demonstrated with two computational experiments that LWKO not only is more effective than its counterparts LKO and (n/k)-fold but also can be computed much more efficiently than LKO and almost as fast as (n/k)-fold.

As a new cross validation criterion, LWKO faces similar challenges as LKO does. First, using the random sampling algorithm 1, LWKO has a higher yet comparable complexity with leave-1-out and (n/k)-fold, which may be considered computationally intractable under certain circumstances. When effectiveness is more important than speed, our simulation results suggested that using the worst validation error instead of the average could be more informative, even if the search for the k worst validation data points is approximate and non-exhaustive. Second, there is no guaranteed correlation between LWKO and the out-of-sample performance, although such correlation appeared to be higher than that of LKO according to our computational experiment results. Third, the selection of value k is more of an art than science, in the sense that the extent to which LWKO outperformed LKO did not demonstrate a clear pattern with respect to k. However, it was clear from both experiments that the popular choice of leave-1-out was not the most effective criterion for cross validation. Since algorithm 1 applies to nonlinear models and more general norms, testing the LWKO criterion on more prediction models (such as neural networks) appears to be an interesting follow up research direction.