Introduction

As a subfield of artificial intelligence, machine learning (ML) is the study of computer algorithms that learn from the input data and make predictions on unseen instances [1, 2]. ML algorithms are designed to operate without specific rule-based instructions, improving themselves by learning and correcting through experience [1,2,3,4]. ML is broadly grouped into two categories with a key difference in the use of labels for model development. While supervised learning needs labels in training, unsupervised learning requires no labels in discovering patterns in data sets. Several ML algorithms exist with a wide range of complexity levels. Of which, deep learning is a particular field of ML and capable of handling very large amount of data, even without any need for traditional feature extraction.

Owing to the advances in deep learning, ML has been rapidly expanding in recent years, invading as much radiology environment as possible [5, 6]. Being mostly based on supervised learning, the application area of ML in radiology is vast, ranging from image acquisition to outcome prediction [1]. Common ML tasks in radiology are prioritising worklists [7], classification of reports [8], risk assessment [9], screening [10], detection [11], segmentation [12], histopathologic diagnosis [13], radiogenomics [14, 15], and image acquisition improvement [16]. Although ML offers various opportunities for radiology, it also brings along many methodological challenges and pitfalls [6, 17, 18]. Interestingly, many of these are not specific to radiology and have already been addressed in other fields such as genomics, biostatistics, and bioinformatics [19,20,21,22,23].

Despite the widespread interest in ML, the methodology of ML papers is mostly complex and elusive for the radiology community. Without a good understanding of key methodological concepts, it might be very hard for radiologists to make a proper assessment and critique of the works as to their validity, reliability, effectiveness, and clinical applicability.

In this paper, we aimed to provide radiologists with a fresh perspective on how to evaluate the published literature and manuscript drafts that use ML in radiology. To achieve this, we concentrated on sixteen key methodological quality concepts of ML.

Key methodological concepts

ML-based pipelines vary to a large extend [24]. On the other hand, core concepts usually remain the same. Important steps of an ML pipeline can be grouped as follows: design, data handling, modelling, and reporting. A simplified illustration of these steps is given in Fig. 1. Key concepts that need attention in the evaluation of ML papers are given in Fig. 2. Before going into detail, a glossary of basic ML terminology is given in Table 1.

Fig. 1
figure 1

Machine learning-based study pipeline

Fig. 2
figure 2

Key methodological concepts according to four main study steps

Table 1 Basic terminology of machine learning

Key concepts of study design

Common pitfalls and recommendations for study design are summarised in Table 2.

Table 2 Common pitfalls and recommendations for the key concepts of study design and data handling

Database size

ML projects need large and heterogeneous data sets to ensure generalisability. However, this is usually hardly achieved in radiology research due to a variety of reasons. A common pitfall that can be avoided is to train a model with an extremely small data set. Such a premature strategy poses many challenges to deal with, for instance, overfitting, noise, and outliers.

To our best knowledge, there is currently no well-adopted method for determining the optimal database size for ML and all proposed strategies are empirical. Statistical power calculations might result in thousands of instances, even for establishing the testing set, which seems hardly achievable for all radiology tasks. To minimise the effects of overfitting and improve the quality of predictive performance metrics, the inclusion of at least 50 instances might be sufficient for initial research [5, 25,26,27]. On the other hand, this number would be inappropriate for the development of highly generalisable and clinically useful real-world ML applications. Another common recommendation is to have a data size that is more than ten times the number of features [28, 29]. Furthermore, the complexity of algorithms (e.g., k-nearest neighbours versus deep learning) and tasks (e.g., substantially heterogeneous population or subtle discretionary features) should always be considered when deciding the appropriateness of database size. Aside from these recommendations, another well-known strategy is to plot a learning curve for error or accuracy values versus training data size [30].

Robustness of reference standard

The reference standard is usually an accepted test or a gold standard or expert diagnosis. Source and rationale of the reference standard must be clearly mentioned in ML papers.

Robustness of reference standard corresponds to the stability of labels in varying conditions such as different readers, scanners, or technical protocols, which is critical not only for high-quality model development but also for the overall success of the project [31]. Strategies for reducing such variabilities would be consensus evaluations by experts, majority voting, or selecting a reference standard that is less sensitive to variabilities. It is noteworthy to mention that these concerns on the robustness of reference standards are much more important in medicine compared with other fields because even a small difference in predictive performance might have a significant influence on a large patient population.

Key concepts of data handling

Common pitfalls and recommendations for data handling are summarised in Table 2.

Information leakage

Information leakage is one of the most significant pitfalls in ML modelling. It can be simply defined as the transmission of information among training, validation, and testing datasets, due to incomplete separation of the data.

The information leakage might occur in any stage of the ML pipeline. One should be very careful in detecting this pitfall because its occurrence might not be so obvious and can be easily missed [32], even if separate validation and test partitions are reported. Information leakage can be frequently encountered in following data handling steps: feature scaling, dimension reduction or feature selection, and hyperparameter tuning. It can be minimised or completely avoided through careful data separation [33, 34].

It should also be kept in mind that data leakage usually occurs at the first steps of the pipeline. Therefore, the data split must be done at the beginning of the pipeline. In other words, the separation of the data set must be done just after designating the raw data inputs because even preprocessing of the images (e.g., grey-level discretisation according to bin-width) before feature extraction might lead to leakage, causing optimistic results.

Feature scaling

Feature values are usually presented in different scales, which need to be considered in many ML classification tasks because the parameters of some algorithms are influenced by the scale of features. Particularly, distance-based algorithms like support vector machine, k-nearest neighbours, and artificial neural networks significantly benefit from feature scaling. On the other hand, some other algorithms like tree-based random forest do not need such requirements. Feature scaling can be done in a few ways. The most common approaches are standardisation, normalisation, and logarithmic transformation [35]. It is also important to note that scaling is an integral part of the neural network and deep learning architectures [36, 37].

It is worth mentioning that feature scaling is a completely different task from the scaling of images or image intensities [24]. The latter is commonly used to avoid some challenges posed by the scanner or site-specific variabilities in the field of newly emerging radiomics [24]. Neglecting feature scaling may lead to overrepresentation or underrepresentation of some features and cause bias in the analysis.

Reliability of features

The reliability of features defines the reproducibility of the features during extraction. Reliable features are highly resistant to changing conditions of the feature extraction process, for instance, segmentation margin differences [38], use of different image slice (i.e., slice selection bias in 2D analysis) [39, 40], and scanning protocol differences [41,42,43]. When analysing medical images, some preprocessing steps (e.g., pixel/voxel resampling, intensity normalisation) are necessary to obtain reliable features [44]. However, despite these measures, reliability remains a challenge [42, 45, 46]. The reliability can be assessed with several approaches such as intra-reader and inter-reader agreement analysis for detection of manual and semi-automatic segmentation differences [39, 47, 48], test-retest analysis for automatic methods [49], reproducibility analysis with different scanners or scanning protocols [46], and phantom or simulation measurements [43].

Development of models without a reliability assessment may lead to significant generalisability problems [50]. Nevertheless, there are some very interesting works with no attempt in selecting reliable features for their models [51,52,53]. Disregarding the reliability of features should be acceptable on the condition that the work has a true external independent validation cohort that is large enough to avoid bias and misleading conclusions.

High dimensionality

Advances in radiomics approaches have led to the extraction of a very high number of features, that is, high dimensionality. High dimensionality is considered a challenge to be dealt with in ML because it may induce multicollinearity, overfitting, and false discovery. Hence, redundant features should be eliminated through certain dimension reduction strategies.

Several methods can be used for dimension reduction [54]. Intra-reader and inter-reader feature reliability analysis, multicollinearity analysis, clustering, principal component analysis, and independent component analysis are the most common unsupervised methods. On the other hand, algorithm-based feature selection (e.g., wrapper, embedded, or filtering methods) is the most widely used supervised method [55, 56].

Perturbations in feature selection

The algorithm-based feature selection process has some inherent susceptibilities about the data structure such as data size, the order of input data, and initiations. Particularly when using small data, such susceptibilities might be much more apparent. Therefore, a common pitfall is to select features without considering possible perturbations in feature selection, which might lead to inappropriate feature selection and in turn generalisability problems [57, 58]. To minimise such susceptibilities of feature selection, the easiest way is to select features with multiple sampling, folding, or random initiations.

Key concepts of modelling

Common pitfalls and recommendations for modelling are summarised in Table 3.

Table 3 Common pitfalls and recommendations for the key concepts of modelling and reporting

Class balance

Class imbalance is an important issue in ML [45]. In case of severe imbalance, some algorithms are tended to vote for the majority class, inducing unrealistic outcomes and in turn very poor generalisability [59]. Ignoring the class imbalance is a major pitfall in ML modelling. For this reason, ML-based analysis with severe imbalance should include certain measures such as oversampling (synthetic or original), undersampling, or training with a trade-off between sensitivity and specificity.

It is also noteworthy to mention that the sampling strategies are usually recommended only for the training set. The rule of thumb is that no sampling should be done on the testing set. This is particularly important in the medical context because balancing the classes in test data might distort the actual disease prevalence, yielding poor clinical risk predictions. Also, undersampling should be used cautiously in the medical context because it might increase the risk of overfitting [60].

Bias-variance trade-off

Finding the trade-off between bias and variance is a vital task in ML to obtain models that generalise well (Fig. 3). Bias-variance trade-off can be achieved by a couple of methods. First, different algorithms with a wide range of complexity levels along with different penalisation and regularisation strategies might be evaluated with systematic validation methods. Then, the one that minimises the total prediction error is chosen. Second, some strong resampling techniques can be incorporated into modelling. For instance, bootstrap aggregating or bagging can be used to reduce variance. Third, optimisation or tuning of the models with adjustments of hyperparameters can be used. Besides, the number of hyperparameters can also be changed in this context. Fourth, data size can be altered to establish optimal trade-off.

Fig. 3
figure 3

Bias-variance trade-off and related concepts. (a) In simple terms, bias and variance are simply prediction errors. Bias is the difference between predictions (black dots) and actual values (light blue areas) that occurs when prediction models are prejudiced. Variance, on the other hand, is the level of variability and spread between predictions (black dots) and actual values (light blue areas). (b) If an algorithm is too complex, it may learn noise in training data, leading to good training and poor test performance, which is called overfitting. If an algorithm is too simple, it may not learn important aspects of data, leading to poor performance in training and testing, which is called underfitting. There is an opposite relationship between bias and variance. If one increases, the other one decreases, or vice versa. Suboptimal bias-variance trade-off leads overfitting or underfitting. High variance leads to overfitting, whereas high bias leads to underfitting. Finding trade-off between bias and variance is an important task in machine learning modelling to obtain models that generalise well. Usually, what matters is the total error, not particular items like bias and variance. In practice, there is no single analytical method to find optimal trade-off zone. In finding trade-off, it is critical to experiment with different model complexity levels to find the one that minimises overall error most

Hyperparameter tuning

ML models include parameters and hyperparameters with which modelling behaviour is configured for a given task. Parameters (e.g., support vectors in support vector machine, and weights in artificial neural networks) are internally calculated from the input data, whereas hyperparameters (e.g., C of support vector machine and learning rate of artificial neural network) are configured externally. Practitioners of ML cannot directly interfere with model parameters while the model operates. However, the selection of some parameters (e.g., type of loss function) before training is totally depended on the practitioner and peculiarities of the data being studied (e.g., organ, disease).

Determining the best hyperparameters, which is called hyperparameter tuning or optimisation, is also a crucial task in ML-based modelling [61]. In hyperparameter tuning, the aim is to find the optimal set of hyperparameters that reduces a predefined loss function and increases the predictive performance of the model on independent test data. In this context, one must always question the methodology of the papers as to the use of default hyperparameter configuration and copying from previous related works. The most common hyperparameter tuning strategies are manual configuration, automated random search, and grid search.

Performance metrics

The discriminative performance of ML models is generally evaluated using accuracy or area under the receiver operating characteristic curve. Furthermore, sensitivity, specificity, positive predictive value, negative predictive value, and the confusion matrix should be the minimum requirements in reporting the predictive performance in the medical context. It is also noteworthy that the confusion matrix itself is not only important for the calculation of other various metrics but also for eligibility in future meta-analyses. Concordance index and dice coefficient are other common relevant discriminative performance metrics used for survival analysis and segmentation performance, respectively. For the regression models with continuous results, the following metrics should be included: R squared, mean squared error, root mean squared error, root mean squared logarithmic error, and mean absolute error.

Both for classification and regression tasks, all performance metrics should be separately reported both for the training and testing sets because these are informative in the assessment of the fitting status of models. Furthermore, in comparative studies such as ML versus human expert reading, care should be taken to report the same metrics while comparing the performance of methods being used.

In the case of class imbalance, the Matthews correlation coefficient, F1 measure, area under the receiver operating characteristic curve, and area under the precision-recall curve are important metrics to be included in the results. Furthermore, a detailed evaluation of the confusion matrix and “no-information rate” is also helpful in the assessment of any work that suffers from class imbalance.

Metrics at a single point might be misleading in performance evaluation. This is particularly important when dealing with small data. Thus, the variability of performance metrics should be reported as well. The confidence interval, standard deviation, and standard error are common indicators of performance variability.

Generalisability

Generalisability in ML can be defined as the adaptability of models to previously unseen examples. It is assessed with two strategies: internal validation and independent validation. However, internal validation might lead to an overestimation of performance. Thus, the assessment of generalisability using an independent data set is important. For a true generalisability assessment, independent validation set must correctly represent the actual population of interest, for instance, in terms of disease prevalence and demographics, etc. It is very common to encounter a lack of transparency as to whether the independent validation set was truly independent owing to inconsistent terminology. An independent validation can ideally be achieved by the participation of external institutions. On the other hand, it is noteworthy that scanner-based independent validation in the same institution could be as valuable as the institution-based external independent validation. Validation terminology and simplified strategies are summarised in Fig. 4.

Fig. 4
figure 4

Simplified validation strategies in machine learning. In general, machine learning projects include three data partitions: training, validation, and testing. The training set is iteratively used to establish optimal parameter values that are special to each machine learning algorithm. Internal performance of the model is evaluated through a validation set (i.e., tuning set). Following many iterations of training and validation, the model is fed to unseen test data for its final performance evaluation. (a) Splitting data into training and testing sets. If the testing set includes instances from the same institution or same scanner, the method is called hold-out. If it comes from another institution or another scanner, the method is called independent. The training set includes other validation or sometimes testing parts. The training part should be used in dimension reduction, model development, and hyperparameter tuning. The testing part must be locked at the beginning of the study, to prevent bias in performance evaluation. (b) Cross-validation. This method has no overlap among the validation parts. The validation part can be a proportion of data (e.g., ten-fold cross-validation) or a single instance (i.e., leave-one-out cross-validation) in each sampling. (c) Random sampling. This method has overlaps among validation parts. On the other hand, its major strength is the number of iterations that is much more than that of simple cross-validation. Most common techniques with random sampling are bootstrap validation and random subsampling, with a key difference in replacement technique. (d) Nested cross-validation. Being rather a complex method, it includes separate testing parts, without overlap. Thus, it simulates previously described hold-out method. V, validation

Clinical utility

ML papers usually focus on performance metrics in the assessment of the diagnostic value of the method proposed. Assessment for clinical utility is often disregarded in ML-based classification tasks. Therefore, the claims about the improved predictive performance of ML tools in classifications remain uncertain and weak. The most common tools for this purpose are calibration statistics [62] and decision curve analysis [63].

Calibration statistics is the process in determining whether the predicted probability scores match with the actual probability scores. Rather than categorical outputs of ML models such as benign versus malignant, the use of probability scores for each target class might be much more useful in radiological decision-making, providing confidence in the diagnosis. A clinically useful model should be well-calibrated, having a balance between real and predicted probability scores. A calibration plot can be used to better present the calibration of the models (Fig. 5).

Fig. 5
figure 5

Calibration curve for classification tasks. 45° line of the plot defines perfect calibration. Lines of well-calibrated models (a) lie as close as to the 45° line, whereas it is the exact opposite for poorly calibrated models (b)

Decision curve analysis provides complementary information about the net benefits of the model proposed [63, 64]. This is a powerful clinical tool because it takes into account both discriminatory predictive performance and calibration of the models. A simple decision curve example and its basic interpretation are presented in Fig. 6.

Fig. 6
figure 6

Decision curve analysis for classification tasks. Higher the curve, the higher the sensitivity. Flatter the curve, higher the specificity. Each model curve is interpreted according to a reasonable probability range. The standard line of no medical action (e.g., intervention, surgery, drug therapy, additional diagnostic test) (a) for all instances. The standard line of full medical action (b) for all instances, regardless of diagnosis. Line of a model with high sensitivity and specificity (c), which is better than the other two models (d, e). Line of a model with high sensitivity and low specificity (d). Line of a model with low sensitivity and high specificity (e). TP, true positive

Comparison with traditional tools

As for all newly emerging techniques, the usefulness of ML in radiology should be assessed through comparisons with the traditional methods. Unless a new ML technique offers improvements over traditional methods, it is not intuitive to propose that technique for clinical usage. Therefore, ML papers should include relevant comparisons with traditional statistical modelling or clinical tools. Otherwise, reporting the ML results in isolation would not reflect and influence the clinical practice, limiting our ability to deploy in real-world health-care practice. Potential targets for comparisons would be traditional modelling techniques such as logistic regression and other clinical tools (e.g., qualitative expert readings) that have already been used in daily radiology practice. Such comparisons should be made on the same data sets. While making comparisons, potential negative results are also as valuable as the positive results and should be reported in the publications.

Key concepts of reporting

Common pitfalls and recommendations for reporting are summarised in Table 3.

Sharing data

Sharing data is important for replicability, proper quality assessment, and improvement of the proposed methodology. However, most research papers do not share their relevant data. This could be because of a few reasons. The authors might not be aware of the importance of data transparency. They might want to protect their data from potential misuse. Furthermore, they might even have a fear of falsification or negative comments from other researchers.

Authors of ML papers in radiology should consider sharing their image data, feature data, scripts used for modelling, and resultant model file. Sharing image data might be difficult due to the high volume and technical issues along with ethical and privacy-related concerns [65]. However, feature data, code scripts, and model files can be easily shared using online repositories.

Transparent reporting

Considering the abundance of easy-to-use and open-source toolboxes, it has never been so easy to develop an ML model for a given medical task. In such an environment, transparent reporting in every part of the study is the key to maintain the quality and replicability of the studies. Besides, the factors that limit the generalisability of an ML model to a certain case should not be ignored and must be transparently reported.

Adhering the checklists or guidelines would be the best practice in transparent reporting. Recent seminal work produced a significant checklist called CLAIM (Checklist for Artificial Intelligence in Medical Imaging) that is particularly designed for reporting the artificial intelligence–based research in the field of medical imaging [66]. Also, one can benefit from the following references for the same purpose [67,68,69].

Conclusions

In this paper, we systematically provided the key methodological concepts of ML to improve the academic reading and peer-review experience of radiology community. Although the recommendations given in this paper are not exclusive and do not guarantee an error-free evaluation, we hope it will serve as a guide for high-quality assessment.