Keywords

1 Introduction

Splitting a dataset into various subsets for training and validation is a fundamental part of machine learning and is present in multiple tasks, such as model evaluation, model comparison, and hyperparameter tuning. Some traditional methods to split datasets into training and test sets are holdout, bootstrap, and cross-validation (CV).

Although cross-validation is arguably the most popular partitioning method, it has some relevant drawbacks that have been studied in the last decades. Given its stochastic nature, CV may lead to poor estimates because of a partition-induced dataset shift [14], that is, some of the generated folds are not representative of the data. One generally handles this by using repeated cross-validation, However, since applying cross-validation is already computationally expensive, repeating it multiple times may be prohibitive.

The aforementioned issues are related to the randomized steps of CV. Therefore, a few methods have been proposed that attempt to improve the estimates of cross-validation by introducing a more deterministic process of generating folds. In one of the first works on the topic, Diamantidis et al.  [9] introduced a clustering-based technique that relied on k-means (here referred to as CBDSCV). However, k-means can be expensive when used as a splitting strategy for CV. At a similar time, Zeng et al.  [24] proposed distribution-balanced stratified cross-validation (DBSCV), which was later adapted by Moreno-Torres et al.  [14], introducing the distribution optimally balanced stratified cross-validation (DOBSCV). Although DBSCV and DOBSCV have been compared before, there has been no direct comparison between them and the cluster-based methods.

In our work, we propose the use of mini-batch k-means as a way of reducing the computational cost of CBDSCV. Furthermore, we provide a comparison between CBDSCV, DOBSCV, and DBSCV, besides the traditional cross-validation techniques, on 20 datasets of various sizes, class imbalance levels, number of features, and number of classes. Our experiments aim to assess whether any cross-validation splitting strategy tends to outperform the others in terms of bias, variance, or computational cost.

The rest of the paper is structured as follows. In Sect. 2 we present the theoretical background of our work, followed by a description of our experiments in Sect. 3. Next, in Sect. 4, we present and discuss our results for balanced and imbalanced datasets. Section 5 revises other papers that presented efforts towards proposing cross-validation splitting strategies and were not directly compared in our experiments. Finally, Sect. 6 presents our conclusions and directions for future works.

2 k-fold Cross-validation Partitioning Methods

The traditional k-fold cross-validation (CV) [10] consists in dividing the given dataset into k folds. Each fold is then used once as the validation set, while the remaining \(k-1\) folds are used for training. Finally, the average performance obtained for each fold is the performance estimate of the k-fold CV. In general, k is set as 5 or 10, which makes it much more computationally tractable than leave-one-out cross-validation (LOOCV), besides showing less variance than LOOCV estimates. Furthermore, it is less biased than the holdout method, since it is able to use more instances for training than the holdout. K-fold cross-validation can also be used in a stratified fashion (k-fold SCV) to guarantee that the proportion of instances of each class is the same for all folds.

However, the instances assigned to each fold by traditional k-fold CV and SCV are selected randomly, which can cause some folds not to be good representatives of the whole dataset [14]. For instance, it is not guaranteed that all the regions in the input space will be appropriately represented over all folds. This phenomenon may impact performance estimates and thus has been considered in various works (see Sect. 5 for references), leading to new splitting strategies based on the features of the data and not only on their class labels. The methods we have considered in this work are reviewed in the following sections, and we also describe the adaptation we developed.

2.1 Distribution-Balanced Stratified Cross-validation

Following Moreno-Torres et al. [14], the distribution-balanced stratified cross-validation (DBSCV) [24] attempts to generate folds representative of the full dataset by assigning neighboring instances to different folds. Specifically, DBSCV randomly selects an instance and assigns it to a fold; it then jumps to the nearest instance of the same class and assigns it to the next fold. These steps are repeated until all instances of that class have been assigned to a fold. The same process is applied to the other classes so that the folds have approximately the same number of instances per class. Assuming a balanced distribution of the instances, building pairwise distance matrices for each class has complexity \(\mathcal {O}(C (\frac{N}{C})^2) = \mathcal {O}(\frac{N^2}{C})\), where N and C are the numbers of instances and classes. The search-and-hop step has complexity \(\mathcal {O}(C \frac{N}{C} \frac{N}{C})\), so that the final complexity of the algorithm is \(\mathcal {O}(\frac{N^2}{C})\).

2.2 Distribution Optimally Balanced Stratified Cross-validation

The distribution optimally balanced stratified cross-validation (DOBSCV) is a modification of DBSCV. It also starts on a random instance of the datasets, but instead of hopping to the closest one of the same class, DOBSCV finds the (k-1) nearest neighbors of the current instance belonging to the same class and assigns each of them to a different fold. This process is repeated independently for each class, similarly to DBSCV, until all instances have been assigned to a fold. Our implementation of DOBSCV also uses a pairwise distance matrix for each class. Assuming balanced classes, building the matrices has complexity \(\mathcal {O}(\frac{N^2}{C})\) and searching the k-NN for the selected instances in each class can be done in \(\mathcal {O}(C \frac{N}{kC} \frac{kN}{C})\), resulting in an overall asymptotic complexity of \(\mathcal {O}(\frac{N^2}{C})\).

2.3 Clustering-Based Approaches

Diamantidis et al. [9] introduced unsupervised stratification for cross-validation, based on dataset clustering. Although they have also explored hierarchical clustering, their main proposed algorithm is using k-means to cluster the dataset into M clusters. The instances inside each cluster are then sorted by their distances to the cluster center in ascending order. Finally, they assign adjacent instances to different folds, i.e., they make a pass over the sorted list of instances assigning each to a different fold. Note that the number of folds K, clusters M, and classes C need not be equal. We refer to this method as cluster-based stratified cross-validation (CBDSCV). The unsupervised stratification process, however, does not guarantee that the classes are stratified in the usual sense, i.e., the method does not necessarily generates folds with the same proportion of instances per class as the original dataset. K-Means has an average complexity of \(\mathcal {O}(M N T)\), where T is the number of iterations, and sorting each cluster can be done in \(\mathcal {O}(M N \log {N})\). Therefore, CBDSCV has a complexity given by \(\mathcal {O}(M N ( T + \log {N}))\).

Mini-Batch CBDSCV. The running time of the CBDSCV algorithm is generally dominated by k-means. Therefore, we propose the use of mini-batch k-means [21] as a way of reducing the cost of performing CBDSCV. Mini-batch k-means is an adaptation of k-means with two major differences. First, at each iteration, it selects only a batch of samples instead of the whole dataset. These samples are then assigned to the nearest centroid. Then, instead of computing the new centroid as the mean of all instances assigned to a cluster, it iterates over the instances of the cluster, updating the centroid at each instance using a learning rate \(\eta \) inversely proportional to the number of times this centroid has been updated previously. Mini-batch k-means converges faster than k-means while producing results that are only slightly worse [2, 21]. In the following sections, we will refer to our adaptation of CBDSCV as CBDSCV_Mini.

3 Experiments

The experiments performed here were designed to evaluate whether there is a cross-validation splitting strategy which generally outperforms the others in terms of bias, variance, or computational cost. Moreover, we also wish to study whether the imbalance of the datasets may influence the quality of the splitters estimations. Since the estimations they produce may depend on the dataset, classifier, and also on the metric being estimated, we experimented with 20 different datasets (from PMLB [16]) and 4 different classifiers. The datasets were selected so that two groups would be apparent, one with balanced and the other with imbalanced datasets. The complete list is shown in Table 1.

Table 1. List of datasets used in the experiments. Datasets with Imbalance higher than 0.20 were considered imbalanced.

Note that we use the same class imbalance measure \(I \in [0, 1]\) as in [16], defined by \(I = \frac{K}{K-1} \sum _{i=1}^{K} \left( \frac{n_i}{N} - \frac{1}{K}\right) ^2\), where K is the number of classes, \(n_i\) is the number of instances in class i, and N is the dataset size. Imbalance is 0 when the classes are equally distributed and approaches 1 when almost all instances belong to the same class. When analyzing balanced datasets, we evaluated the splitters in terms of their accuracy estimations, as this is the most common and traditional metric. However, when handling imbalanced datasets, we used the F1 score since accuracy is inappropriate in these cases. We used the average between the F1 scores computed for each class, i.e., the macro average.

We chose learning algorithms that presented different biases and variance levels. Specifically, we experimented using Logistic Regression (LR), Decision Trees (DT), Support Vector Machines (SVC), and Random Forests (RF). We have used only the RBF kernel with SVC since linear decision functions could be represented by Logistic Regression. To avoid overfitting when handling class-imbalanced datasets, weights associated with the instances of each class during training were set to be inversely proportional to the class frequencies in the training set. Prior to the experiments, we tuned each classifier to each dataset using the entire data and grid search. The performance of each hyperparameter set was evaluated using 5-fold cross-validation and the hyperparameters which showed the highest F1 score were chosen. These selected hyperparameters for a classifier-dataset pair were fixed for the experiments so that the classifiers were always trained with the same hyperparameters independently of the splitting strategy being analyzed. With this approach, we aim to capture performance differences caused by variation in the splitting strategy rather than by variation in the hyperparameters values.

Finally, we compared three splitting strategies, CBDSCV, DBSCV, and DOBSCV, against traditional k-fold cross-validation and stratified k-fold cross-validation. We also included our adaptation of CBDSCV, which uses mini-batch k-means for faster computation of the clusters, using batches of size 100. Therefore, six different cross-validation splitting strategies were compared in terms of their bias and variance, as well as computational cost. Our implementations of the splitting strategies, the selected hyperparameters, and the code used in experiments are available onlineFootnote 1.

3.1 Estimating the Bias and the Variance

The cross-validation methods considered here attempt to estimate the test performance of the learning algorithms fitted to the datasets. The bias of a cross-validation method is defined as the difference between the expected estimate and the (true) test performance [11]. Since we are working with real datasets, it is unfeasible to obtain the test performance. However, we can compute estimations for it using repeated holdout a large number of times, similarly to [3, 11]. Specifically, we estimated the true performance for each dataset and classifier by repeating a stratified holdout 100 times, using \(90\%\) of the dataset for training, and getting the mean value. We chose a small test set in order to reduce the bias caused by using smaller training sets, while we expect that the high number of repetitions will attenuate the variance of the holdout.

The expected estimate of each cross-validation technique was computed for each dataset by resampling \(90\%\) of the dataset without repetition 20 times and applying the cross-validation technique to obtain the estimates of the true performance. The average value of the 20 estimates was used as the expected estimate of the cross-validation method. That is, let \(CV_i\) be the performance estimate of running k-fold cross-validation on a given dataset and learning algorithm, with a chosen splitting strategy, then we approximated the expected cross-validation estimate as

$$\begin{aligned} \overline{CV} = \frac{1}{20}\sum _{i=1}^{20} {CV}_i. \end{aligned}$$
(1)

Finally, we computed the bias using \(b_{CV} = \overline{CV} - \hat{P}\), where \(\hat{P}\) is the estimation of the true performance that was computed using 100-times repeated stratified holdout, as described above.

The other important quantity that determines the quality of an estimator is its variance. We computed the variance of the cross-validation estimates using

$$\begin{aligned} s^2_{CV} = \frac{1}{20 - 1} \sum _{i=1}^{20} (CV_i - \overline{CV})^2. \end{aligned}$$
(2)

In this paper, however, we will work with the standard deviation (std) s, since we believe it is more easily readable. Note that an estimator with high variance may give poor results even if it has a low bias since one may not have the luck to obtain one of the estimates closer to the true value.

We evaluated the bias and variance of the six different dataset partitioning strategies over 20 different datasets and four classifiers. For each k-fold cross-validation strategy, we experimented with 2, 5, and 10 folds. Finally, we used accuracy and F1 as the performance metrics.

3.2 Defining the Number of Clusters

The cluster-based method requires a number of clusters to be given as input. Ideally, we would compute the number of clusters right before each splitting is performed. However, this would be too computationally expensive, since the number of experiments performed is already large. Therefore, we have chosen to estimate the number of clusters for each dataset prior to the main experiments, and use this number (rounded to the nearest integer) for all cluster-based splitter methods. We have followed the same strategy as Diamantidis et al. [9] to estimate the number of clusters, which was based on repeatedly applying hierarchical clustering to small samples of the datasets and using a threshold on the similarity between clusters being merged to determine the number of clusters. The resulting number of clusters for each dataset is shown in Table 1.

4 Results and Discussion

The experiments performed as described in the previous section result in 80 different samples of bias and variance for each k-fold splitter, where \(k=2\), 5, and 10. Each of the 80 samples corresponds to a dataset-classifier pair. In the next sections, we describe the results grouped by balanced and imbalanced datasets in terms of class labels, resulting in 40 dataset-classifier pairs for each group. We focus mainly on the results with 2 and 10 folds, but the Figures for the 5-folds case are available in our git page (See footnote 1).

4.1 Balanced Datasets

The bias and standard deviations of each 10-fold cross-validation splitting strategy for all datasets and classifiers are summarized in Fig. 1. All the methods showed a general tendency to very low bias and similar standard deviations, indicating that there is no solution that consistently performs better than all others.

Fig. 1.
figure 1

(a) Bias and (b) standard deviation of each splitter method across all balanced datasets and classifiers. Each splitter runs 10-folds.

Note, however, that this does not imply that the accuracy (or F1) estimates produced by each partitioning strategy is not different. The p-values for the Friedman tests [7] comparing the estimates of the splitters are shown in Table 2. In particular, the p-value for the estimates considered here is 0.0279, suggesting that the bias estimates differ depending on the splitting strategy. However, there is no significant difference in terms of the standard deviations. Table 3 shows the number of times each method performed the best. For 10 folds, accuracy, and balanced datasets, stratified 10-fold CV had the most wins for both bias and std.

Reducing the number of folds increases the bias (in absolute terms) and the standard deviations, as shown in Fig. 2. However, the methods still have similar performance overall. We note, however, that DOBSCV and CBDSCV had an increase in the number of times they had the best results, while stratified CV showed worse results compared with its performance in the 10-folds scenario, particularly with respect to the standard deviation of the estimates. This is an indication that the DOBSCV and CBDSCV can be useful when a reduced number of folds is desired so that the computational cost resulting from training various models can be reduced.

Fig. 2.
figure 2

(a) Bias and (b) standard deviation of each splitter method across all balanced datasets and classifiers. Each splitter runs 2-folds.

Table 2. p-values for the Friedman tests comparing whether the estimates produced by the splitters for each dataset-classifier pair differs. Smaller values mean that the hypothesis that the splitters produce similar estimates for the datasets and classifiers is unlikely. Values below 0.05 are in bold form.
Table 3. Number of times each method had the best result in terms of bias or standard deviations, for various metrics, numbers of folds and dataset imbalance. The words balanced and imbalanced are abbreviated to bal. and imb., respectively.

4.2 Imbalanced Datasets

The bias and standard deviations for 10-folds and imbalanced datasets are shown in Fig. 3. Since accuracy is not appropriate for studying imbalanced datasets, the bias and std were calculated for the f1-score estimations. Stratified 10-folds showed the best results for both bias and standard deviation. Tables 2 shows that the difference between the splitters is significant in the imbalanced cases, and Table 3 shows that indeed stratified 10-fold presents the less biased and most consistent estimates for most datasets and classifiers. It is interesting to note that DBSCV and DOBSCV deal with class stratification by performing their splitting strategies per class. The fact that their performance was worse than stratified cross-validation suggests that there may be more appropriate ways to develop stratified versions of DBSCV and DOBSCV. The CBDSCV techniques, however, do not handle class imbalance directly.

Fig. 3.
figure 3

(a) bias and (b) standard deviation of each splitter method across all imbalanced datasets and classifiers. Each splitter runs 10-folds for each splitter and the metric observed is the f1 score.

When one reduces the number of folds to 2, both the bias (absolute value) and the std of the estimates increase. More interestingly, the advantage that the stratified cross-validation had almost disappeared. This pattern is similar to the one observed for the balanced datasets. Table 3 shows that the number of times the SCV performs best indeed reduces when compared to the 10-folds case (Fig. 4).

Fig. 4.
figure 4

(a) bias and (b) standard deviation of each splitter method across all imbalanced datasets and classifiers. Each splitter runs 2-folds for each splitter and the metric observed is the f1 score.

4.3 Running Times

We also compared the running times of each splitting strategy. The running time of a k-fold splitting strategy for a dataset was obtained by averaging the running times of the splitting process over all the 20 runs and the 4 classifiers. We noticed that the running times obtained for 2, 5, and 10 folds were very similar, and therefore we consider only the 10-folds case in the following discussions. Figure 5 shows the running times of each splitter method for the 20 datasets considered. One can see that the classical strategies, KFold and StratifiedKFold, have negligible running times when compared to the others. Furthermore, we note that DBSCV and DOBSCV have higher variability depending on the dataset, reaching the highest running times of all methods. In comparison, CBDSCV and CBDSCV_Mini have closer run times for all datasets.

Fig. 5.
figure 5

Average running times in seconds across all the 20 datasets. The running times correspond to the use of 10 folds.

Ignoring KFold and StratifiedKFold, however, DOBSCV was actually the fastest method in 14 out of 20 datasets, while CBDSCV_Mini was the quickest in the other six. Specifically, it was fastest in ‘analcatdata_germangss’, ‘movement_libras’, ‘analcatdata_dmft’, ‘appendicitis’, ‘page_blocks’, and ‘postoperative_patient_data’, which are the datasets with more instances. This is expected if one considers the algorithmic complexity of each method: DOBSCV scales quadratically with the number of samples.

4.4 Cluster-Based Splitters

When comparing only the cluster-based approaches, CBDSCV and CBDSCV_Mini, we observed that the running times on the minibatch version were smaller for all datasets, as expected. Specifically, it ran on average 2.4 times faster than CBDSCV. Furthermore, we have detected no significant change between estimations given by CBDSCV and CBDSCV_Mini. Table 4 shows the number of times each method performed better than the other, for all datasets and classifiers, and the p-value computed using the Wilcoxon test.

Table 4. Number of times each cluster-based method had the best result in terms of bias and variance. Only the cluster-based methods are considered here. The last columns shows the p-value for the two-sided Wilcoxon signed-rank test.

5 Related Work

Besides more recent theoretical works on traditional cross-validation estimates [4, 20, 22], various works have proposed cross-validation splitting strategies for different scenarios which are not directly related to our cases. Motl et al. [15] developed a technique based on linear programming for performing label-based stratification when the instances have more than one label at the same time, i.e., multi-label datasets. Specific methods have also been developed for data drift situations, where some instances become obsolete over time [13], or for the specific case of dataset shift in credit card validation [19]. Cross-validation over graphs [8, 12] has also seen some recent works, as well as methods for reducing cross-validation computational cost in deep learning [6]. Cross-validation adaptations have been developed in order to handle duplicate data in medical records [1], for calibration models in chemistry, [23] and for infrared and mass-spectroscopy images [17, 18]. All the methods cited above, however, are developed for different specific scenarios, whereas the methods explored here aim at tabular data and single-label datasets. Closer to the methods explored in this work are the proposals by Budka et al. [3] and Cervellera et al. [5]. In both works, the methods attempt to partition a dataset by generating samples whose distributions are as similar as possible to the distribution of the original data. We were not able to include them in this work due to lack of compatibility with the framework we had developed for our cross-validation methods comparison. Nevertheless, we will be including them in future works we are developing in this area of research. We note also that these two approaches have not been compared with each other yet.

6 Conclusion and Future Work

In this work, we proposed an adaptation of a cluster-based technique for splitting a dataset for cross-validation. We also compared various CV strategies using different classifiers for balanced and imbalanced datasets. We found that no method consistently outperforms all others in terms of bias or standard deviation when estimating accuracy using 10 folds and balanced datasets. In these cases, traditional stratified cross-validation remains a good choice. When the number of folds is reduced to 2, however, stratified cross-validation may produce accuracy estimates with higher variance than DOBSCV and the cluster-based techniques.

When considering F1 score estimates, traditional stratified cross-validation produced the best results in terms of bias and variance for most datasets and classifiers when used with 5 and 10 folds, for both balanced and imbalanced datasets. When the number of folds is reduced, however, F1 scores in balanced datasets may be better estimated by other methods such as DOBSCV and the cluster-based splitter. For imbalanced datasets, SCV remained the most frequent winner. In particular, traditional SCV was most significantly better when F1 score and imbalanced datasets were present. This suggests that better class-based stratification adaptations can be developed for DBSCV and DOBSCV. The development of a supervised version of CBDSCV is also an interesting topic for further work. Finally, we found no significant change in the quality of the estimations produced by CBDSCV and CBDSCV_Mini, whereas the mini-batch version is significantly less expensive in terms of computational cost.

We have not studied dataset characteristics such as the presence of subconcepts in the input space, as this kind of information is not easily extracted from a dataset. Those characteristics, however, may be relevant in determining which performance estimator should be used and may provide deeper insight into the use cases of each method. Similarly, we haven’t analyzed deeper whether some splitting strategies may work better for each classifier or for datasets with different sample sizes. Finally, in future works, we intend to expand our experimental comparison and explore other approaches proposed in the literature [3, 5].