Abstract
Benchmarks are an integral part of machine learning development. However, the most common benchmarks share several limitations. For example, the difference in performance between two models has no probabilistic interpretation, it makes no sense to compare such differences between data sets and there is no reference point that indicates a significant performance improvement. Here we introduce an Elo-based predictive power meta-score that is built on other performance measures and allows for interpretable comparisons of models. Differences between this score have a probabilistic interpretation and can be compared directly between data sets. Furthermore, this meta-score allows for an assessment of ranking fitness. We prove the properties of the Elo-based predictive power meta-score and support them with empirical results on a large-scale benchmark of 30 classification data sets. Additionally, we propose a unified benchmark ontology that provides a uniform description of benchmarks.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The data sets generated during the current study are available in the EPP meta-score GitHub repository available at https://github.com/agosiewska/EPP-meta-score36. Source data are provided with this paper.
Code availability
An implementation of the EPP score is available at https://github.com/ModelOriented/EloML. The codes generated during the current study are available in the EPP meta-score GitHub repository available at https://github.com/agosiewska/EPP-meta-score36.
References
Wang, A. et al. GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: analyzing and interpreting neural networks for NLP (eds. Linzen, T., Chrupała, G. & Alishahi, A.), 353-355 (Association for Computational Linguistics, 2018).
Wang, A. et al. SuperGLUE benchmark for general-purpose language understanding systems. Adv. Neural Inform. Process. Syst. 3261–3275 (2019).
Zhai, X. et al. A large-scale study of representation learning with the Visual Task Adaptation Benchmark. Preprint at https://arxiv.org/abs/1910.04867 (2020).
Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein structure prediction (CASP) - round XIII. Proteins 87, 1011–1020 (2019).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20, 1–23 (2019).
Lensink, M. F., Nadzirin, N., Velankar, S. & Wodak, S. J. Modeling protein–protein, protein–peptide, and protein–oligosaccharide complexes: CAPRI 7th edition. Proteins: Structure, Function, and Bioinformatics 88, 916–938 (2020).
Vanschoren, J., van Rijn, J. N., Bischl, B. & Torgo, L. OpenML: networked science in machine learning. SIGKDD Explor. Newsl. 15, 49–60 (2014).
Martínez-Plumed, F., Barredo, P., hÉigeartaigh, S. Ó. & Hernández-Orallo, J. Research community dynamics behind popular AI benchmarks. Nat. Mach. Intell. 3, 581–589 (2021).
Powers, D. Evaluation: from precision, recall and F-factor to ROC, informedness, markedness & correlation. J. Mach. Learn. Technol. 2, 37–63 (2008).
Sokolova, M. & Lapalme, G. A systematic analysis of performance measures for classification tasks. Inform. Process. Manage. 45, 427–437 (2009).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Abadi, M. et al. Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16), 265–283 (USENIX Association, 2016).
Bischl, B. et al. mlr: machine learning in R. J. Mach. Learn. Res. 17, 1–5 (2016).
Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006).
Dietterich, T. G. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10, 1895–1923 (1998).
Alpaydin, E. Combined 5 × 2 cv F test for comparing supervised classification learning algorithms. Neural Comput. 11, 1885–1892 (1999).
Bouckaert, R. R. Choosing between two learning algorithms based on calibrated tests. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning(eds. Fawcett, T., Mishra, N.), ICML’03, 51–58 (AAAI Press, 2003).
Salzberg, S. L. On comparing classifiers: pitfalls to avoid and a recommended approach. Data Min. Knowl. Discov. 1, 317–328 (1997).
Guerrero Vázquez, E., Yañez Escolano, A., Galindo Riaño, P. & Pizarro Junquera, J. in Bio-Inspired Applications of Connectionism (eds. Mira, J., Prieto, A.), pp 88–95 (Springer, 2001).
Pizarro, J., Guerrero, E. & Galindo, P. L. Multiple comparison procedures applied to model selection. Neurocomputing 48, 155–173 (2002).
Hull, D. Information Retrieval Using Statistical Classification. PhD thesis, Stanford Univ. (1994).
Brazdil, P. B. & Soares, C. A comparison of ranking methods for classification algorithm selection. In Machine Learning: ECML 2000 (eds. López de Mántaras, R., Plaza, E.), 63–75 (Springer, 2000).
Elo, A. & Sloan, S. The Rating of Chess Players, Past and Present (Ishi, 2008).
Bischl, B. et al. OpenML benchmarking suites. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (eds. Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S. & Wortman Vaughan, J.), vol. 1 (Curran Associates, Inc., 2021).
Kretowicz, W. & Biecek, P. MementoML: performance of selected machine learning algorithm configurations on OpenML100 datasets. Preprint at https://arxiv.org/abs/2008.13162 (2020).
Probst, P., Boulesteix, A.-L. & Bischl, B. Tunability: importance of hyperparameters of machine learning algorithms. J. Mach. Learn. Res. 20, 1–32 (2019).
Bradley, R. A. & Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika 39, 324–345 (1952).
Clark, A. P., Howard, K. L., Woods, A. T., Penton-Voak, I. S. & Neumann, C. Why rate when you could compare? Using the “EloChoice” package to assess pairwise comparisons of perceived physical strength. PLOS ONE 13, 1–16 (2018).
Agresti, A. In Categorical Data Analysis, vol. 482, chap. 6 (Wiley, 2003).
Shimodaira, H. An approximately unbiased test of phylogenetic tree selection. Syst. Biol. 51, 492–508 (2002).
Shimodaira, H. Approximately unbiased tests of regions using multistep-multiscale bootstrap resampling. Ann. Stat. 32, 2616–2641 (2004).
Suzuki, R. & Shimodaira, H. Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics 22, 1540–1542 (2006).
Agresti, A. In Categorical Data Analysis, vol. 482, chap. 4 (Wiley, 2003).
Gosiewska, A., Bakała, M., Woźnica, K., Zwoliński, M. & Biecek, P. EPP: interpretable score of model predictive power. Preprint at https://arxiv.org/abs/1908.09213 (2019).
Gosiewska, A. & Woźnica, K. agosiewska/EPP-meta-score: EPP paper. Zenodo https://doi.org/10.5281/zenodo.6949519 (2022).
Acknowledgements
Work on this project is financially supported by NCN Opus grant 2017/27/B/ST6/01307.
We thank L. Bakała and D. Rafacz for inspiring ideas, W. Kretowicz and M. Zwoliński for preliminary work35 and P. Teisseyre, E. Sienkiewicz, H. Baniecki and B. Rychalska for useful comments.
Author information
Authors and Affiliations
Contributions
A.G. and K.W. designed and implemented the EPP method as an R package, as well as, studied and described theoretical properties of the EPP. A.G. performed the EPP Leaderboard on the VTAB benchmark and developed the unified benchmark ontology. K.W. performed the EPP Leaderboard on the OpenML benchmark and designed and performed simulations in the Supplementary Materials. P.B. supervised the project, provided technical advice, helped design the method and analysed the experiments. All authors participated in the conceptualization and preparation of the paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Philipp Probst and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 A unified Ontology of ML Benchmarks.
The violet dashed rectangle shows a minimal setup for any benchmark.
Supplementary information
Supplementary information
Suplementary Figs. S1–S5, Discussion S1–S5 and Tables 1 and 2.
Supplementary Data 1
Source Data Supplementary Material Fig. 1.
Supplementary Data 2
Source Data Supplementary Material Fig. 2.
Supplementary Data 3
Source Data Supplementary Material Fig. 3.
Supplementary Data 4
Source Data Supplementary Material Fig. 4.
Supplementary Data 5
Source Data Supplementary Material Fig. 5.
Source data
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gosiewska, A., Woźnica, K. & Biecek, P. Interpretable meta-score for model performance. Nat Mach Intell 4, 792–800 (2022). https://doi.org/10.1038/s42256-022-00531-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-022-00531-2
- Springer Nature Limited