Interpretable meta-score for model performance

  • Article
  • Published:

From Nature Machine Intelligence

A preprint version of the article is available at arXiv.


Benchmarks are an integral part of machine learning development. However, the most common benchmarks share several limitations. For example, the difference in performance between two models has no probabilistic interpretation, it makes no sense to compare such differences between data sets and there is no reference point that indicates a significant performance improvement. Here we introduce an Elo-based predictive power meta-score that is built on other performance measures and allows for interpretable comparisons of models. Differences between this score have a probabilistic interpretation and can be compared directly between data sets. Furthermore, this meta-score allows for an assessment of ranking fitness. We prove the properties of the Elo-based predictive power meta-score and support them with empirical results on a large-scale benchmark of 30 classification data sets. Additionally, we propose a unified benchmark ontology that provides a uniform description of benchmarks.

Fig. 1: A diagram of the EPP benchmark.
Fig. 2: Boxplots of EPP scores split by different algorithms across data sets.
Fig. 3: Actual empirical probability and predicted probability of winning computed on the basis of the EPP meta-score value for data sets ‘banknote-authentication’ and ‘wdbc’.
Fig. 4: Boxplots of scores split by four selected models from the VTAB.

Data availability

The data sets generated during the current study are available in the EPP meta-score GitHub repository available at Source data are provided with this paper.

Code availability

An implementation of the EPP score is available at The codes generated during the current study are available in the EPP meta-score GitHub repository available at


Work on this project is financially supported by NCN Opus grant 2017/27/B/ST6/01307.

We thank L. Bakała and D. Rafacz for inspiring ideas, W. Kretowicz and M. Zwoliński for preliminary work35 and P. Teisseyre, E. Sienkiewicz, H. Baniecki and B. Rychalska for useful comments.

Author information

Authors and Affiliations



A.G. and K.W. designed and implemented the EPP method as an R package, as well as, studied and described theoretical properties of the EPP. A.G. performed the EPP Leaderboard on the VTAB benchmark and developed the unified benchmark ontology. K.W. performed the EPP Leaderboard on the OpenML benchmark and designed and performed simulations in the Supplementary Materials. P.B. supervised the project, provided technical advice, helped design the method and analysed the experiments. All authors participated in the conceptualization and preparation of the paper.

Corresponding authors

Correspondence to Alicja Gosiewska, Katarzyna Woźnica or Przemysław Biecek.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Philipp Probst and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 A unified Ontology of ML Benchmarks.

The violet dashed rectangle shows a minimal setup for any benchmark.

Extended Data Table 1 Example Schemes for EPP Benchmark
Extended Data Table 2 The descriptions of the EPP Benchmark components that extend the Unified Benchmark Ontology
Extended Data Table 3 EPP of selected models for ada_agnostic data set. AUC values are averaged. The numbers of models are IDs from the MementoML benchmark
Extended Data Table 4 The best models in algorithm class for mozilla4 data set. AUC values are averaged. The numbers of models are IDs from the MementoML benchmark
Extended Data Table 5 The best models in algorithm class for credit-g data set. AUC values are averaged. The numbers of models are IDs from the MementoML benchmark
Extended Data Table 6 Springleaf Marketing Response Kaggle Competition.
Extended Data Table 7 IEEE-CIS Fraud Detection Kaggle Competition.

Supplementary information

Supplementary information

Suplementary Figs. S1–S5, Discussion S1–S5 and Tables 1 and 2.

Reporting summary

Supplementary Data 1

Source Data Supplementary Material Fig. 1.

Supplementary Data 2

Source Data Supplementary Material Fig. 2.

Supplementary Data 3

Source Data Supplementary Material Fig. 3.

Supplementary Data 4

Source Data Supplementary Material Fig. 4.

Supplementary Data 5

Source Data Supplementary Material Fig. 5.

Source data

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

About this article

Cite this article

Gosiewska, A., Woźnica, K. & Biecek, P. Interpretable meta-score for model performance. Nat Mach Intell 4, 792–800 (2022).

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

