Introduction

Advances in analytical instrumentation have provided the possibility of examining thousands of genes, peptides, or metabolites in parallel. Technologies such as microarrays or mass spectrometry provide insight into system biology giving a large amount of complex biological data [1, 2]. However, the cost and time-consuming data acquisition process causes a generalized lack of samples. From a data analysis perspective, omics data are characterized by high dimensionality and small sample counts [3]. Consequently, the “curse of dimensionality” [4] plays a key role and pattern recognition methods must be scrutinized in their ability to deal with small sample to dimensionality ratios [5, 6]. An additional factor in small sample conditions is the increased chances that the selected samples are not representative of the target population. In these conditions, the models become very dependent on the particular set of samples used for model building [7], and sometimes, they show poor generalization capabilities.

Model validation is an extremely important part in predictive model development. Beyond the numerical aspects, diverse levels of validation may serve to test for repeatability, reproducibility, instrumental shifts, or background (matrix) effects. Correct validation design including proper partition of the dataset taking into account data distribution in the input space, stratification issues, and replication of the future scenario of the model is essential to check the robustness of the models in the operational phase. These conceptual aspects of validation have been covered recently by Westad et al. [8] and Marco [9].

The simplest validation method is often known as “hold-out,” and it refers to a simple splitting into training set and validation set. Even for this simple case, there are several alternatives for dataset splitting. Random sampling is the most popular, but it does not guarantee that samples in the borders of the set are within the training set. The Kennard-Stone algorithm (KS) [10] aims to sample the data space in a uniform manner maximizing the Euclidean distances between the selected samples. Recently, updates to the KS algorithm have been proposed to take into account the distribution of the dependent variable. An example is the SPXY splitting method by Galvao et al. [11]. This method was proposed in the context of building multivariate calibration models, that is, in a regression scenario. For classification problems, the distribution of the dependent variable is taken into account by trying to balance the partition so that the number of examples for the different classes is similar both in training and in validation.

Partial least squares-discriminant analysis (PLS-DA) is a common technique in multivariate analysis for classification or oriented dimensionality reduction. It has been the classifier of choice in multitude of applications in diverse fields [12,13,14,15,16,17,18]. Lately, PLS-DA algorithm has become a standard in omics research [19,20,21,22,23,24,25,26,27,28]. Some advantages behind the popularity of PLS-DA as classifier are the ability to cope with collinear and noisy variables, which is often the case in omics datasets [29], as well as possibility of results visualization by means of scores and loading plots [30, 31]. Additionally, variable importance can be interpreted from PLS-DA results [32, 33].

The propensity of PLS-DA to provide overoptimistic results (so called “overfitting”), and consequently poor generalization to samples outside the study, has been reported by several authors [31, 34, 35]. While there has been recent algorithmic proposals to reduce overfitting in PLS when noise among variables is not correlated [36], the usual approach is to optimize the number of latent variables (LV) such they show the best performance in validation.

A basic concept in this framework is to differentiate between internal and external validations. While in some publications, internal validation (IV) is also referred as validation and external validation (EV) as test or simply blind samples, we think this terminology (IV, EV) is less prone to confusion. IV refers to validation using those samples available for model building, while EV consists of fresh samples to test the performance of the model. The best practice is to use IV for model selection (e.g., optimizing the number of latent variables (LV), while performance estimation has to be performed using EV samples.

Beyond simple hold-out, for small sample counts, cross-validation (CV) methods aim at a more efficient use of the available data. In fact, many studies only report internal CV results skipping EV [30, 37]. However, it is well known that internal CV for performance estimation provides in general overoptimistic results, and an unbiased performance estimation should be done in an external validation set (also referred as “blind samples”) [9, 38]. Despite this well-known fact, there is a very rich literature on different CV methods and their relative merits [39,40,41]. Over the years, many CV methods have been described in the literature, but a handful of them are the most popular in omics data analysis practice (see “Cross-validation”).

On the other hand, methodologies have also been proposed to make optimum use of available samples together with EV methods. These methods are named double cross-validation [42]. A more general approach, it is known as cross-model validation (CMV) [43] and is often combined with jack-knifing model parameters [44]. Despite these well-known recommendations and methods, in omics research, simple CV is still the norm in most preliminary studies.

The scarce use of EV techniques in omics research is an issue that has been pointed out previously [45, 46]. Moreover, comparisons of CV methods for omics data [45,46,47,48,49,50,51,52] have already been published. Braga-Neto et al. compared linear discriminant analysis (LDA), which can be considered a particular case of PLS-DA [35], three-nearest neighbors (NN), and decision trees in an internal CV scheme. They concluded that CV had undesirable features, such as the presence of outliers in the accuracy estimation [53]. Fu et al. studied prediction errors with distinct CV strategies for random datasets of different sample sizes, but few cases of different dimensionality were considered [54]. Westerhuis compared single CV with CMV and advocated the use of permutation tests to compute the null hypothesis distribution of different figures of merit [34]. Finally, Varma et al. also highlighted the need of an independent set to estimate model’s performance since their CV and true error estimations differed in internal validation [55].

In this work, we explore the magnitude of overfitting of PLS-DA in internal CV. In particular, we will describe how overfitting depends on the chosen CV method, sample count, number of dimensions, and correlation among features. The study will be based initially in synthetic data, and then in real data from mass spectrometry and microarrays studies.

Methods

PLS-DA

In untargeted omics research, sample j is described by a feature vector xj ϵℝD where D often takes large values. Feature vectors are acquired for n samples, providing a data matrix X ϵ ℝnxD. Each sample has a phenotype or class label under study and thus can be described by a binary vector yj ϵℝq where q is the number of categories (in the particular case of two classes it is enough to take q = 1). Then for n samples a categorical matrix Y ϵ ℝnxq might be defined as Y = {y1, y2, ..., yn}. In many omics studies, the aim is to build a suitable model that allows the accurate prediction of ynew from the measurement of xnew.

PLS-DA can be understood as a partial least squares (PLS) regression between a set of predictors X and label responses Y, with a binary outcome. PLS defines a new subspace of LV through an iterative process, considering a compromise between maximum variance in X and maximum correlation to Y [56, 57]. We focused on the algorithm PLS1, i.e., one response variable and multiple predictors. We employed the plsr function from the pls R package [58]. For PLSR models with one y-variable, no iterations are needed. The projection of the X-matrix into the defined hyperplane is given by the X-scores (T), which are defined in Eq. (1).

$$ \mathbf{T}=\mathbf{XW} $$
(1)
$$ \mathbf{X}={\mathbf{T}\mathbf{P}}^{\mathbf{T}}+\mathbf{E} $$
(2)
$$ \mathbf{Y}={\mathbf{T}\mathbf{C}}^{\mathbf{T}}+\mathbf{F} $$
(3)

T are result of the linear combination of the original variables with the weights W, T model X (Eq. 2); when multiplied by the loadings P, X-scores are good summaries of X and the X-residuals, E, are small. On the other hand, Y can be predicted in terms of the X-scores and the matrix C (Eq. 3). The Y-residuals, F, are the deviations between the observed and modeled responses. Finally, the relationship between X and Y that PLS specifies is given by Eq. (4), where B is a matrix with the PLS-regression coefficients (Eq. 5).

$$ \mathbf{Y}=\mathbf{XB}+\mathbf{F} $$
(4)
$$ \mathbf{B}=\mathbf{W}{\left({\mathbf{P}}^{\mathbf{T}}\mathbf{W}\right)}^{-1}{\mathbf{C}}^{\mathbf{T}} $$
(5)

Cross-validation

Supervised algorithms require an estimation stage (also known as “training”) with labeled data examples, and most classifiers also need tuning of hyper-parameters such as k in k-NN or LV number in the case of PLS-DA. We will refer as “internal validation” to the data split used for parameter optimization, and “external validation” to blind samples used to assess generalization capability or model’s performance. Figure 1 shows the scheme of a three-way data split and the objectives of each data subset.

Fig. 1
figure 1

Three-way data split. Full dataset is split in calibration subset and external validation subset, being the latter for predictive model performance assessment. Calibration subset is split in a training subset for parameter estimation and an internal validation subset for model selection

CV considers a number of iterations or folds with distinct training and test data partitions. For every fold, a model is built with the training set and tested for a range of hyper-parameter values. Finally, the selected hyper-parameter value is the one that provides the best average result along all the folds or partitions [59, 60]. We applied different CV strategies, namely: K-Fold, leave one out (LOO), random subsampling (RS), Bootstrap, and bootstrapped Latin partitions (BLP) for IV [1, 60, 61].

K-Fold

A dataset with n samples is split in k equal-sized partitions. The number of validation samples for each partition is n/k, and they must be in the validation set only once. We considered k = 4 that is 75–25% for training and test, respectively. Other choices of k are possible, but there is not a clear consensus on the preferred value for k.

Leave one out

LOO is the most extreme case of K-Fold. The training subset is composed by all samples but one, which is used for validation. The procedure is repeated k = n times.

Random subsampling

In this strategy, the user can decide both the number of iterations and the percentage of training and validation subsets. Training samples are chosen randomly, and the rest are employed to validate, without resampling. Samples might be on the validation subset many times. We implemented a RS with 80 and 20 iterations, for the simulated and real datasets respectively, and a percentage of 75–25% for training and test. To ensure that class distribution among the training and validation samples is kept balanced, independent splitting of the two classes is performed and then merged. In this manner, we have a stratified partition.

Bootstrap

It is a resampling technique in which the user decides the number of iterations (about a hundred are recommended). In each iteration, n samples are chosen for training with replacement. The validation subset is formed by the rest of samples. Thus, training and test percentages depend on the replacement in every iteration. We performed bootstrapping with 80 iterations for the simulated and 20 and 100 iterations for real datasets [61,62,63]. Again, independent splitting of the two groups is performed and then merged in order to have a stratified partition.

Bootstrapped Latin partitions

In chemometrics, BLP have been considered a preferred approach to estimate the performance of predictive models and takes into account some typical characteristics of analytical instrumentation datasets [64,65,66]. A review of this technique has been recently published [67]. This method is a form of repeated K-Fold CV with some constraints. First, data are split so that replicates from the same chemical sample will not be contained in the prediction and training sets at the same time. In fact, the blind use of conventional CV methods in some works leads to the presence of very similar samples in training and validation and increase overoptimism. Second, the proportions of the number of samples for each class are automatically maintained between the validation set and training set. One of the advantages of BLP is that over CV repetitions, all samples are used for validation a single time. Selected examples for the use of this CV method can be found in Harrington et al. [68, 69] and Rearden et al. [71]. BLP combined with PLS and PLS-DA have been named super-PLS and super-PLS-DA, respectively, by Aloglu et al. [70]. A total of 80 and 20 iterations were performed for the synthetic and real sets, respectively.

Analysis

By overfitting, we understand the difference between the estimated classification accuracy in CV and the true accuracy. For simulated data the true accuracy is known by design (50% in this study), while in the case of real data true accuracy has to be estimated in external validation. We built simulated datasets with distinct samples to dimensionality ratios and feature correlation in order to evaluate PLS-DA overfitting under different conditions. Moreover, a mass spectrometry and a microarray dataset were used to ascertain whether simulation results were consistent with real data and how internal CV and external validation estimations differ when data distribution departs from normality or the expected accuracy is not 50%.

Simulated dataset

We created non-discriminative datasets using multivariate normal distributions obtained with mvrnorm function of the MASS R package. All samples were identically distributed irrespective of the class label, so that the theoretical discrimination power is null. In other words, for the synthetic dataset we know the true accuracy: 50% or random choice. Datasets were composed of Gaussian noise of mean μ = 0 and covariance matrices with different correlation between features (σii = 1, σij = {0.00, 0.50, 0.90, 0.99}). For each dataset, two classes were arbitrarily defined by creating a random label binary vector with equal probability for both classes. PLS-DA overfitting was quantified in internal CV. LOO, K-Fold, RS, BLP, and Bootstrap methods were compared in terms of performance estimation. The number of LV was optimized according to the maximum average accuracy (classification rate (CR)) along the folds. Results of the best model are given as final performance estimation.

In order to study the influence of sample count n and dimensions D, we scanned both parameters in these ranges: n ϵ (14,118) and D ϵ (2100) both for training and internal CV. In all the considered cases, the two classes have the same number of instances and n is the sum. For each case of n, D, covariance matrix, and CV technique, we generated at least 1000 populations. For every population, the procedure to obtain the magnitude of overfitting was repeated. Estimator bias and its root mean square error (RMSE) were used as figures of merit (Eq. 6). According to the well know bias-variance decomposition [5], this error has two sources: Bias and Variance. Bias refers to the expected difference between the true (CR0) and the estimated (\( \hat{\mathrm{CR}} \)) classification rates, whereas RMSE also takes the variance (Var) into account.

$$ \mathrm{RMSE}\left(\hat{\mathrm{CR}}\right)=\sqrt{E\left[{\left(\hat{\mathrm{CR}}-{\mathrm{CR}}_0\right)}^2\right]}=\sqrt{{\left(\hat{\mathrm{CR}}-{\mathrm{CR}}_0\right)}^2+\mathrm{Var}\left(\hat{\mathrm{CR}}\right)} $$
(6)

It is important to remark that RMSE in this work does not refer to a mean square error between the numerical output of the PLS-DA model and the target label.

Microarray dataset

We employed a RNA microarray dataset which contains 295 samples from patients with good (110 samples) or poor (185 samples) (recurrence, distant metastasis or dead) prognosis after a mean follow-up of 6.7 years. In 2002, van’t Veer et al. defined a gene expression profiling for breast cancer prognosis using this dataset). We used a 70-gene signature found by the authors to build PLS-DA models. This signature was previously used for assessing validity of CV for small-sample microarray classification [72, 73].

K-Fold, LOO, RS, BLP, and Bootstrap strategies were used to internally validate PLS-DA models. An unbiased estimation of the CR was obtained from external validation using 50 samples per class. We fixed the same number of data in training for all CV, which implies distinct total sample count. Models were balanced by considering the same number of samples of each class, and thus training and test sets had the same distribution (50–50%). To have a better estimation of the probability distribution of each estimator of the CR, we implemented a resampling strategy over the existing data: both internal and external validations were repeated 1000 times. From the obtained results, the estimator bias and variance were calculated for each internal validation method. Moreover, Wilcoxon tests were performed to assess the differences among validation methods and between IV and EV.

Mass spectrometer dataset

The public domain ARCENE dataset contains mass spectrometry data for patients with ovarian and prostate cancer, and control subjects. The data comes from two sources: National Cancer Institute (NCI) and Eastern Virginia Medical School (EVMS). Ovarian cancer data was obtained from NCI and prostate cancer from both sources. Spectra were pre-processed to minimize the disparity between data sources. The resultant training data was composed by 503 controls and 398 cancer samples, and 10,000 features (3000 of which were randomly permuted values to use the data for benchmark of feature selection methods). The employed version of the dataset was prepared by Isabelle Guyon and is available in the UCI Machine learning repository [74].

For this dataset, only LOO was used for validation. Candidate values for the sample count n and dimensions D were n = {14, 24, 54} and D = [2100]. Again, n was set to even values so that every class had the same sample size. Feature selection was done by random selection, and for each case of sample and dimensionality (n, D), 500 independent trials were performed.

Results and discussion

Simulated dataset

Evaluation of overfitting using cross-validation

Since the occurrence of overfitting without validation is well known, many authors resort to internal validation for performance estimation. In this work, the number of LV has been optimized regarding the maximum average CR along all CV folds. The performance of the best model (i.e., with optimized complexity) is reported as the final accuracy estimation. Hence, the same validation data is used both for optimizing the model’s parameters and estimating its performance.

Figure 2 includes accuracy estimations with K-Fold CV (k = 4) for increasing dimensionality and different sample sizes. The figure shows a significant overfitting that peaks when the dimensionality matches the number of samples in training minus one. The importance of this overfitting increases in small sample conditions. Beyond the peak and contrary with intuition, overfitting decreases as the number of dimensions becomes larger than the number of samples. This behavior cannot be observed when executing few times each condition (n, D), but when averaging thousands of repetitions. Furthermore, it is closely related with the complexity of the PLS-DA models. The frequency distribution of the optimum LV number follows the same tendency. Therefore, the sample to dimensionality ratio of the training data strongly affects to the complexity of the optimal model, which is maximum when the number of training samples approaches the number of features (i.e., determined system). Hence, at the overfitting peak, the maximum number of LV is frequently selected.

Fig. 2
figure 2

CR estimation in fourfold CV for simulated data. Cases with 0.9 correlation (solid line) and without correlation (dashed line). Mean after 1000 runs for each D (n = 50)

On the other hand, Fig. 2 shows that data with covariance matrix Σ = (σii = 1, σij = 0.9) lead to more overoptimistic results than data without correlated features. Consequently, more multicollinearity among the independent variables caused more overfitting. At any rate, the qualitative behavior is the same and the magnitude of overfitting is large in both cases. We would like to remark that at the peak, overfitting can reach a mean of 70 or 67% in CR for the cases with 0.9 and 0.0 correlation, respectively. Since omics data usually contains correlated features, this highlights the need to examine results with a critical eye and even the adequacy of using independent subsets of variables.

For a more comprehensive comparison, we should consider not only the bias of the estimator but also the variance. Thus, we computed the spread of the estimated accuracy with fourfold CV. Figure 3 shows the distributions for n = 18 and 118, approximately in the peak of overfitting, i.e., D = 13 and 89 features (D = ntraining − 1). Both CR estimations are biased and have a certain variance, but a higher sample count provides lower bias and variance. Further, our results indicate that even with a true CR of 50%, internal K-Fold CV may give classification rates over 90% when the sample size is small (e.g., n = 18). Whether K-Fold CV is executed once, estimation might provide an extremely good or bad result only by chance. Consequently, this highlights the need of permutations test in order to know accuracy distribution of a random classification and assess the significance of the results compared with chance.

Fig. 3
figure 3

CR distributions in fourfold CV for simulated data N(0, 1). Cases with (n, D) = (18, 13) and (n, D) = (118, 89). Mean after 5000 runs for each pair (n,D)

Distinct CV strategies were compared for many sample-to-dimensionality ratios in synthetic data. Figure 4 shows the mean CR after repeating a thousand times the population generation and the internal CV estimations for ntotal = 50 and each case of D. This figure suggests that overfitting follows the same qualitative behavior for every CV method, i.e., overfitting peaks when the dimensionality approaches the number of training samples, except for Bootstrap, which shows a more flattened curve. Training sample size is 49 for LOO and 38 for K-Fold and RS, but Bootstrap does not define a single value of samples in test for each iteration. However, it is known that in average training is a 63.2% of the data, so approximately 32 different samples for training and 18 for test. Precisely for the latter reason, Bootstrap is the only method which does not show a sharp peak of overfitting. Our results show that the magnitude of bias changes among CV strategies. Specifically, LOO and K-Fold seem to be the most biased, while RS, BLP, and Bootstrap give more accurate estimations. These latter approaches have a large number of resampling iterations and seem to make better use of the available data. When the number of features approaches to sample size in training, LOO produces higher overfitting than K-Fold. On the contrary, K-Fold introduces more bias than LOO far from the peak. Moreover, K-Fold gives higher overoptimistic estimations than RS or BLP. RS and BLP give essentially the same results and closely follow Bootstrap approach. Precisely, Bootstrap gives the least biased estimation, whose maximum bias is of only 3% for the simulated dataset. Current results indicate that for high dimensionality scenarios, RS and BLP are less biased than Bootstrap, while when the number of features decreases, this trend reverses.

Fig. 4
figure 4

Mean CR of PLS-DA models with five cases of internal CV: LOO, K-Fold, RS, Bootstrap, and BLP. Mean after 1000 runs for each D with simulated data N(0,,1) and n = 50

In each CV approach, peaks are located at different dimensionality since training subsets have different sizes for a given number of total samples. Systematical computations of CR along different conditions suggest that a number of training samples similar to the dataset dimensions is a condition of analysis which should be avoided, since it appears to be the worst scenario in terms of overfitting.

We also evaluated RMSE of the CR estimator in order to consider both bias and variance in the overfitting assessment. After repeating a thousand generations and evaluations of datasets with distinct (n, D) pairs, we represented the average RMSE (Fig. 5). These models were optimized and evaluated with the five presented methods of CV. The risk of obtaining overoptimistic was shown to depend on the sample to dimensionality ratio and the CV strategy employed, as previously hypothesized. A trade-off between bias and variance was observed, causing that conditions of minimum bias corresponded to maximum variance, and conversely. This evaluation let us clearly rank the CV methods in terms of their ability to provide accurate estimations of the CR. In this sense, these results advocate the use of Bootstrap, followed by BLP, RS, K-Fold, and finally LOO. Please observe the important peak of LOO in terms of RMSE of the estimator. To put these results into context, the reader should remember that in many occasions omics datasets are characterized by small sample counts and in those conditions, LOO is the preferred validation technique by many authors [75,76,77,78,79,80]. Finally, while Bootstrap seems to be the best CV procedure according to RMSE criterion, it is important to realize that it will always introduce some overoptimism in the estimation of the CR.

Fig. 5
figure 5

Mean RMSE of PLS-DA models with five cases of internal CV: LOO, K-Fold, RS, Bootstrap, and BLP. Mean after 1000 runs for each D with simulated data N(0, 1) and n = 50

Microarray dataset

We utilized a RNA microarray dataset containing 70 gene-expression features to build PLS-DA binary classification models for breast cancer prognosis. For this dataset, the expected CR is not 50%. We set sample size to the maximum overfitting condition according to the conclusions derived from the previous experiments, so we established that sample count in training minus one equaled the number of features. Since the number of genes is 70, we set the number of samples in training to be 71 in every case, which implies a different total number of examples for every CV. This condition of sample size allowed to compare CV techniques in a scenario of significant overfitting, particularly it would correspond to compare them in the peak of overfitting.

To evaluate the overfitting introduced by the CV method, we have to resort to EV as an unbiased estimation of the CR. The main problem here is since the EV set will have a finite number of samples, the estimated CR will have some unavoidable variance. The number of EV samples was always 50 for each class, which corresponded to the minimum number of remaining samples among all the CV cases, to have comparable test results.

Table 1 shows the results of CR in internal and external validation, as well as the RMSE of the estimator. We can observe that the ascending rank of RMSE, which considers both bias and variance, is: Bootstrap, BLP, RS, K-Fold, and LOO, which coincides with the one obtained with simulated datasets. Comparing K-Fold with LOO, we can observe that a larger k leads to a variance increment. K-Fold, RS, and BLP have the same number of samples in training, but in RS and BLP, which have more folds or iterations than K-Fold, the bias diminishes.

Table 1 Comparison of CV strategies in terms of CR in IV and EV and RMSE of the estimator in IV

On the other hand, it is interesting to highlight that different CV techniques produce PLS-DA models with diverse accuracy. The table shows that Bootstrap provides the best CR in EV, whereas LOO leads to the worst case of predictive accuracy. Thus, Bootstrap appears to be the best approach to optimize model complexity not only in terms of better performance estimation but also in terms of producing the more accurate model. From the inspection of this table, we would like to remark that models were the overfitting is bigger typically result in poorer results in EV. In other words, the selection of the CV method is key, not only to have small bias but also to obtain the most accurate model. The results of the 1000 trials were used to test whether IV and EV results were statistically different for all the CV methods. Bootstrap with 100 iterations was the only case in which the null hypothesis of equality between internal and external CR distributions could not be rejected. Moreover, when comparing the estimations between the different methods, all of them were statistically different except RS and BLP (p values of 0.762 and 0.227 for IV and EV, respectively). External CRs of BLP and Bootstrap also had a p value close to the significance level of 5% (p value = 0.041), but the rest of the p values were below 0.01, indicating that under these conditions the magnitude of overfitting of the methods is statistically different.

Figure 6 shows the accuracy distributions of internal CV and EV, for the cases of LOO (Fig. 6a) and Bootstrap (Fig. 6b). These plots clearly depict the extent of the overfitting in a case where real discrimination between both classes does exist. In the case of LOO (Fig. 6a), we can clearly see how the distribution is shifted to the right and has a tail towards very high accuracies. This effect is much less evident in the case of Bootstrap, as Fig. 6b shows. In Fig. 6b, Bootstrap was computed for 100 iterations since it is a value typically encountered.

Fig. 6
figure 6

CR distributions of internal CV and EV hold out for real microarray data. a Internal LOO CV and b internal Bootstrap

Together with a larger bias, LOO CV yields to an estimation with more variance than Bootstrap. In fact, the peaks of the density distributions for Bootstrap and internal and external distributions are almost fully overlapped. The latter feature is highly desired because it means that internal CV results almost provide the same statistics as EV.

In summary, this study with real microarray data confirms the dependence of the overfitting with respect to the CV technique implemented. It confirms that LOO can be considered a weak validation practice while Bootstrap provides more accurate performance estimations with less bias and variance.

Mass spectrometry data

First, we report the obtained results for randomly selected features. Due to the complexity of the dataset, random-selected features have a null discriminant power when this is estimated in EV. However, in IV again, we have a clear overoptimistic bias as expected.

The mean classification rate when computed over the varied selection of features are shown in Fig. 7. As it was already observed for simulated data we can see how we obtain a peak when the number of features approaches the number of samples. Before that peak, overoptimism increases with the number of features. Beyond the peak, the overoptimism decreases slightly and then saturates. In this region, we can observe a counter-intuitive behavior since, fixed the number of features, we have more overoptimism when we have more samples in the dataset. Similarly, we obtain less overoptimism increasing the number of features, for a given number of data samples.

Fig. 7
figure 7

CR estimation by LOO for the ARCENE dataset with random feature selection. The mean CR after 500 trials for each D is shown for n = 14, 24, and 54 samples

This behavior is obviously related to the average complexity of the models. In Fig. 8, we observe that the model complexity (in terms of number of LV) also peaks when the number of features equals the number of samples.

Fig. 8
figure 8

Optimum number of LV selected by LOO in the ARCENE dataset. The mean number of LV after 500 trials for each D is shown for n = 14, 24, and 54 samples

Beyond the evolution of the bias, we can observe how the full probability density function for the estimator behaves. In the case of only 14 samples, Fig. 9 shows the probability distribution for the estimator of the CR in internal and external validation. Figure 9a corresponds to three features, while Fig. 9b corresponds to 13 features. In both cases, the estimator in LOO is biased and has a much larger variance than the estimator in EV. Figure 9b clearly shows how bias increases when the number of features approaches the number of samples.

Fig. 9
figure 9

Distribution of the CR in internal CV (LOO) and EV after 500 repetitions for the case 14 samples and a 3 and b 13 features

Conclusions

PLS-DA is a preferred predictive model for the analysis of omics data, particularly in the case of metabolomics. Metabolomics data are usually characterized by large dimensionalities and small sample conditions. In these conditions, PLS-DA models have a very strong propensity to overfit to training data. Despite external validation being the recommended practice, many researchers still prefer simple cross-validation in most studies with small datasets.

The need of stronger validation practices such as CMV and permutation test has already been highlighted in the prior literature. The main message of this work is a full characterization of the impact of the sample size, the dataset dimensionality, and the CV method on the overfitting.

Therefore, we have shown a strong dependence of PLS-DA with the sample to dimensionality ratio. For the first time, a full scan of the impact of the dataset size, dimensionality ratio, and the CV technique is done. In extreme cases, for small datasets and using LOO, mean overfit may exceed 20% in a case where there is of no discriminant information. We have observed that for a given number of data samples, increasing the dimensionality leads to more complex models that are obviously easier to overfit. However, the maximum number of LV is limited to the number of data minus one. From that point on, increasing feature vector dimensionality does not allow for more complex models, but instead the additional features are a source of noise and provide a regularization effect in the complexity of the models, leading to simpler models with less overfit. We have shown that the PLS-DA overfitting in CV peaks when the sample size in the training set approaches to the number of dimensions of the dataset. In addition, it decreases far from the peak even if the number of dimensions is much larger than the number of samples. This should be chiefly considered in dimensionality reduction prior to PLS-DA modeling, since training samples matching the number of dimensions appears to be a scenario to avoid. As it has been suggested previously, permutation tests help in determining if the obtained results are likely to be obtained by chance.

Among all the CV techniques, Bootstrap provides the most accurate estimator in terms of RMSE, followed by RS and BLP. In fact, for the microarray case under study, the internal and external validation estimations were almost equal in the case of Bootstrap. Resampling validation techniques provide the most efficient use of the available data. Instead, LOO appeared to provide estimations with large variance and also with more bias than the other CV strategies. This result is important in omics research due to the popularity of LOO in the small sample count datasets typically encountered. Additionally, the models obtained by LOO show a degraded performance when evaluated in external validation when compared with other CV techniques.

This work highlights the need of strong validation methodologies to be used in conjunction with PLS-DA, since the uncritical use of these techniques may lead to overoptimistic results and contribute to the irreproducibility problem in omics research. A rigorous validation strategy is key to avoid overfitting. Hence, we strongly encourage the use of external validation to obtain an unbiased estimation of model’s predictive performance (e.g., double CV or CMV) and permutation tests to evaluate whether the obtained CR is statistically significant.