Abstract
Statistical machine learning models should be evaluated and validated before putting to work. Conventional k-fold Monte Carlo cross-validation (MCCV) procedure uses a pseudo-random sequence to partition instances into k subsets, which usually causes subsampling bias, inflates generalization errors and jeopardizes the reliability and effectiveness of cross-validation. Based on ordered systematic sampling theory in statistics and low-discrepancy sequence theory in number theory, we propose a new k-fold cross-validation procedure by replacing a pseudo-random sequence with a best-discrepancy sequence, which ensures low subsampling bias and leads to more precise expected-prediction-error (EPE) estimates. Experiments with 156 benchmark datasets and three classifiers (logistic regression, decision tree and naive bayes) show that in general, our cross-validation procedure can extrude subsampling bias in the MCCV by lowering the EPE around 7.18% and the variances around 26.73%. In comparison, the stratified MCCV can reduce the EPE and variances of the MCCV around 1.58% and 11.85%, respectively. The leave-one-out (LOO) can lower the EPE around 2.50% but its variances are much higher than the any other cross-validation (CV) procedure. The computational time of our cross-validation procedure is just 8.64% of the MCCV, 8.67% of the stratified MCCV and 16.72% of the LOO. Experiments also show that our approach is more beneficial for datasets characterized by relatively small size and large aspect ratio. This makes our approach particularly pertinent when solving bioscience classification problems. Our proposed systematic subsampling technique could be generalized to other machine learning algorithms that involve random subsampling mechanism.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Baker A. On some Diophantine inequalities involving the exponential function. Canad J Math, 1965, 17: 616–626
Bergstra J, Bardenet R, Bengio Y, et al. Algorithms for hyper-parameter optimization. Adv Neural Inf Process Syst, 2011, 1: 2546–2554
Bergstra J, Bengio Y. Random search for hyper-parameter optimization. J Mach Learn Res, 2012, 13: 281–305
Boyle P, Broadie M, Glasserman P. Monte Carlo methods for security pricing. J Econom Dynam Control, 1997, 21: 1267–1321
Braga-Neto U M, Dougherty E R. Is cross-validation valid for small-sample microarray classification? Bioinformatics, 2004, 20: 374–380
Braga-Neto U M, Zollanvari A, Dougherty G. Cross-validation under separate sampling: Strong bias and how to correct it. Bioinformatics, 2014, 30: 3349–3355
Branicky M, LaValle S, Olson K, et al. Quasi-randomized path planning. In: IEEE International Conference on Robotics and Automation, vol. 2. Piscataway: IEEE, 2001, 1481–1487
Cheng J. Computational investigation of low-discrepancy sequences in simulation algorithms for Bayesian networks. In: Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, vol. 1. San Francisco: Morgan Kaufmann Publishers, 2000, 72–81
Chung K. An estimate concerning the Kolmogoroff limit distribution. Trans Amer Math Soc, 1949, 67: 36–50
Cunningham J, Ghahramani Z. Linear dimensionality reduction: Survey, insights, and generalizations. J Mach Learn Res, 2015, 16: 2859–2900
Dai H, Wang W. Application of low-discrepancy sampling method in structural reliability analysis. Struct Safety, 2009, 31: 155–164
Díaz-Uriarte R, DeAndrés S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 2006, 7: 1–13
Dick J, Kuo F, Sloan I. High-dimensional integration: The quasi-Monte Carlo way. Acta Numer, 2013, 22: 133–288
Dick J, Pillichshammer F. The weighted star discrepancy of Korobov sets. Proc Amer Math Soc, 2015, 143: 5043–5057
Fu W, Carroll R, Wang S. Estimating misclassification error with small samples via bootstrap cross-validation. Bioinformatics, 2005, 21: 1979–1986
Gentle J. Statistics and Computing Random Number Generation and Monte Carlo Methods. New York: Springer, 2003
Georgieva A, Jordanov I. A hybrid meta-heuristic for global optimisation using low-discrepancy sequences of points. Comput Oper Res, 2010, 37: 456–469
Groot P De, Postma G, Melssen W, et al. Selecting a representative training set for the classification of demolition waste using remote NIR sensing. Anal Chimica Acta, 1999, 392: 67–75
Halton J H. Algorithm 247: Radical-inverse quasi-random point sequence. Comm ACM, 1964, 7: 701–702
Hua L K, Wang Y. Applications of Number Theory in Approximate Analysis. Beijing: Science Press, 1978
Kalagnanam J, Diwekar U. An efficient sampling technique for off-line quality control. Technometrics, 1997, 39: 308–319
Keller A. The fast calculation of form factors using low discrepancy sequences. In: Proceedings of the 12th Spring Conference on Computer Graphics, vol. 1. Bratislava: Comenius University Press, 1996, 195–204
Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of International Joint Conferences on Artificial Intelligence, vol. 14. San Francisco: Morgan Kaufmann Publishers, 1995, 1137–1143
Kollig T, Keller A. Efficient multidimensional sampling. Comput Graph Forum, 2002, 21: 557–563
Kucherenko S, Sytsko Y. Application of deterministic and low-discrepancy sequence in global optimisation. Comput Optim Appl, 2005, 30: 297–318
Kuipers L, Niederreiter H. Uniform Distribution of Sequences. New York: John Wiley & Sons, 1974
Li X, Wang W, Martin R, et al. Using low-discrepancy sequences and the crofton formula to compute surface areas of geometric models. Comput Aided Design, 2003, 35: 771–782
Lindermann R, Steven S, LaValle M. Incremental low-discrepancy lattice methods for motion planning. In: Proceedings of IEEE International Conference on Robotics and Automation, vol. 1. Piscataway: IEEE, 2003, 2920–2927
Lohr S. Sampling: Design and Analysis. Boston: Brooks/Cole, 2009
Mahler K. On a paper by A. Baker on the approximation of rational powers of e. Acta Arith, 1975, 27: 61–87
Molinaro A, Simon R, Pfeiffer R. Prediction error estimation: A comparison of resampling methods. Bioinformatics, 2005, 21: 307–330
Niederreiter H. Random Number Generation and Quasi-Monte Carlo Methods. Philadelphia: SIAM, 1992
Olson R, LaCava W, Orzechowski P, et al. PMLB: A Large benchmark suite for machine learning evaluation and comparison. BioData Mining, 2017, 10: 36
Pant M, Thangaraj R, Grosan C, et al. Improved particle swarm optimization with low-discrepancy sequences. In: Proceedings of the IEEE Congress on Evolutionary Computing, vol. 2. Piscataway: IEEE, 2008, 3011–3018
Paskov S, Traub J. Faster valuation of financial derivatives. J Portfolio Management, 1995, 22: 113–123
Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine learning in python. J Mach Learn Res, 2011, 12: 2825–2830
Quinn J, Langbein F, Martin R. Low-discrepancy sampling of meshes for rendering. Eurographics Symp Point-Based Graph, 2007, 1: 19–28
Schmidt W. Irregularities of distribution, VII. Acta Arith, 1972, 21: 45–50
Singhee A, Rutenbar R. From finance to flip flops: A study of fast quasi-Monte Carlo methods from computational finance applied to statistical circuit analysis. In: Proceedings of the 8th International Symposium on Quality Electronic Design, vol. 1. Washington: IEEE, 2007, 685–692
Stone M. Cross-validatory choice and assessment of statistical predictions. J R Stat Soc Ser B Stat Methodol, 1974, 36: 111–147
Struckmeier J. Fast generation of low-discrepancy sequences. J Comput Appl Math, 1995, 61: 29–41
Tan K, Boyle P. Applications of randomized low discrepancy sequences to the valuation of complex securities. J Econom Dynam Control, 2000, 24: 1747–1782
Uy N, Hoai N, McKay R, et al. Initialising PSO with randomised low-discrepancy sequences: The comparative results. In: Proceedings of the IEEE Congress on Evolutionary Computing, vol. 1. Piscataway: IEEE, 2007, 1985–1992
van der Corput J G. Verteilungsfunktionen (Erste Mitteilung). In: Proceedings of the Koninklijke Akademie van Wetenschappen te Amsterdam, vol. 38. Amsterdam: Elsevier, 1935, 813–821
Wenzel L, Dair D, Vazquez N. Pattern Matching System and Method with Improved Template Image Sampling Using Low Discrepancy Sequence. Washington: US Patent No. 6,229,921, 2001
Xu Z Q, Zhou T. On sparse interpolation and the design of deterministic interpolation points. SIAM J Sci Comput, 2014, 36: 1752–1769
Acknowledgements
The first author was supported by the Qilu Youth Scholar Project of Shandong University. The second author was supported by National Natural Science Foundation of China (Grant No. 11531008), the Ministry of Education of China (Grant No. IRT16R43) and the Taishan Scholar Project of Shandong Province. The authors thank the referees for helpful suggestions on an earlier draft of the paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Guo, L., Liu, J. & Lu, R. Subsampling bias and the best-discrepancy systematic cross validation. Sci. China Math. 64, 197–210 (2021). https://doi.org/10.1007/s11425-018-9561-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11425-018-9561-0
Keywords
- subsampling bias
- cross validation
- systematic sampling
- low-discrepancy sequence
- best-discrepancy sequence