Introduction

Radiomics research has been rapidly expanding ever since Gilles et al declared “images are data” [1]. Sophisticated bioinformatics tools are applied to reduce data dimensionality and select features from high-dimensional data, and models with potential diagnostic or prognostic utility are typically developed [1,2,3]. Although radiomics research shows great potential, its current use is confined to the academic literature, without real-world clinical applications. High quality in science and reporting may present strategies for radiomics to become an effective imaging biomarker able to cross the “translational gap” [4, 5] for use in guiding clinical decisions.

The quality of scientific research articles consists of two elements: the quality of the science and the quality of the report [6], and deficiencies in either may hamper translation of biomarkers to patient care [7]. With regard to the quality of the science, a system of metrics in the form of the radiomics quality score (RQS) was developed by the expert opinions of Lambin et al [2], to determine the validity and completeness of radiomics studies. The RQS consists of 16 components that consider radiomics-specific high-dimensional data and modeling and accounts for image protocol and feature reproducibility, biologic/clinical validation and utility, performance index, high level of evidence, and open science. With regard to the quality of reporting, radiomics research is a model-based approach and reporting according to the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) initiative [8] is desirable.

To our knowledge, the quality of the science and reporting in radiomics research studies is largely unknown. A RQS study from the score developer [3] reported an average score of less than 50% over 41 radiomics studies, but the RQS score is underutilized because many investigators and peer reviewers are unfamiliar with it. Prediction model studies in the clinical domain showed suboptimal quality of reporting according to TRIPOD [9], but whether the radiomics domain is good or bad at reporting has not been studied. In this study, we evaluated radiomics studies using RQS and TRIPOD items to evaluate their scientific quality and assessed whether the score and degree of adherence depend on the study design or journal type. The purpose of the study was therefore to evaluate the quality of the science and reporting of radiomics studies according to RQS and TRIPOD.

Materials and methods

Article search strategy and study selection

A search was conducted for all potentially relevant original research papers using radiomics analysis published up until December 3, 2018. The search terms used to find radiomics studies were “radiomic” OR “radiogenomic” in the MEDLINE (National Center for Biotechnology Information, NCBI) and EMBASE databases. The eligible articles were high-impact factor medical journals ranked higher than 7.0 according to the 2018 edition of the Journal Citation reports, as well as those published in radiology journal of Radiology and European Radiology. The impact factor of 7.0 was chosen as it was considered that articles published in journals above 7.0 would be representative of the reporting of high-quality clinical studies on radiomics analysis. Imaging journals were chosen because they are the highest-ranked US and non-US general radiology, given the impact and status of the two journals. The inclusion process is shown in Fig. 1. Study selection and data extraction are shown in Supplementary Materials 1.

Fig. 1
figure 1

Flow diagram of the study selection process. Note: Non-relevant to radiomics indicate the analytic methods are volumetric measurement and locations and not categorized into radiomics analysis

Analysis of method quality based on RQS

The RQS score with 16 components is defined in Supplementary Table 1 [2]. The reviewers extracted the data using a predetermined RQS evaluation according to six domains. Domain 1 covers the protocol quality and reproducibility in image and segmentation: well-documented image protocols (1 point) and/or usage of public image protocols (1 point), multiple segmentations (1 point), phantom study (1 point), and test-retest analysis with imaging at multiple time points (1 point). Domain 2 covers the reporting of feature reduction and validation: feature reduction or adjustment for multiple testing (3 or − 3 points) and validation (− 5 to 5 points). Domain 3 covers the reporting of the performance index: reporting of discrimination statistics (1 point) with resampling (1 point), calibration statistics (1 point) with resampling (1 point), and application of cut-off analyses (1 point). Domain 4 covers the reporting of biological/clinical validation and utility: multivariate analysis with non-radiomics features (1 point), biological correlates (1 point), comparison with the gold standard (2 points), and potential clinical utility (2 points). Domain 5 covers the demonstration of a higher level of evidence: by conducting a prospective study (7 points) or cost-effectiveness analysis (2 points). The final domain (domain 6) covers open science, with open availability of source code and data (4 points).

The six domains and topics which were subject to further discussions until a consensus was reached were in Supplementary Materials 1.

Analysis of reporting completeness based on TRIPOD statement

The TRIPOD checklist was applied to each article to determine the completeness of reporting. The details of the checklist are described elsewhere [8], but it consists of 22 main criteria with 37 items. First, the type of prediction model was decided, whether the radiomics model was development only (type 1a), development and validation using resampling (type 1b), random split-sample validation (type 2a), nonrandom split-sample validation (type 2b), validation using separate data (type 3), or validation only (type 4). The details for TRIPOD checklist and data extraction are shown in Supplementary Materials 1.

Analysis of the role of radiologists

To demonstrate the role of radiologists in the radiomics studies, the analysis was undertaken to calculate the position and number of radiologist among the author lists. The radiologists include general radiologists and nuclear medicine radiologists. First, the main authors, either first or corresponding author, were checked. When the radiologists are not main authors, the position and number of radiologists among the author’s lists were checked. The position is checked for the first appearance (i.e., 3rd and 5th author are radiologists among 8 authors, the position was checked as 3/8, 0.37).

Statistical analysis

For the six domains in the RQS (protocol quality and segmentation, feature selection and validation, biologic/clinical validation and utility, model performance index, high level of evidence, and open science and data), basic adherence was assigned when a score of at least 1 point was obtained without minus points. The basic adherence to RQS (for 0–16 criteria) and each item scored in TRIPOD were counted (range, 0–35 items) and calculated in a descriptive manner using proportions (%). The TRIPOD item 5c (“if done” item) and the validation items 10c, 10e, 12, 13c, 17, and 19a were excluded from both the numerator and denominator when the overall adherence rate was calculated. For all included articles, the total RQS score was calculated (score range, − 8 to 36) and expressed as mean ± standard deviation. A graphical display for the proportion of studies was adopted from the suggested graphical display for Quality Assessment of Diagnostic Accuracy Studies-2 results [10].

Subgroup analyses were performed to determine whether the reporting quality differed according to intended use (diagnostic or prognostic), journal type (clinical or imaging journal), and imaging modality (CT or MRI). Additionally, we compared RQS between radiogenomics studies and non-radiogenomics studies. Before subgroup analysis, the RQS was plotted for each journal to observe whether there was a systematic difference between journals (Supplementary Figure 2). As no systematic difference was observed between journals, the journal was not adjusted for in the analysis. The nonparametric Mann-Whitney U test was used to compare the RQS score in each group. Fisher’s exact test was used to compare proportions in RQS and TRIPOD for small sample sizes in each group. All statistical analyses were performed using SPSS (SPSS version 22; SPSS) and R (R version 3.3.3; R Foundation for Statistical Computing), and a p value < .05 was considered statistically significant.

Results

Characteristics of the included studies

Seventy-seven articles [11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87] were finally analyzed. The journal, impact factor, study topic, intended use, imaging modality, number of patients, and model type are summarized in Supplementary Table 2. The number and characteristics of the included radiomics studies are provided in Table 1 and Fig. 2. The mean patient number was 232 (standard deviation, 248.7; range, 38–2029). The studies were published in 2014 (1 article), 2016 (8 articles), 2017 (16 articles), and 2018 (52 articles). There were 25 articles published in high IF clinical journals and 52 articles in imaging journals (14 in Radiology and 38 in European Radiology). Most articles were oncologic studies (90.9%). Radiomics analysis was most frequently studied as a diagnostic biomarker (80.5%), then as a prognostic (19.5%) biomarker. MRI was the most studied modality (66.0%), followed by CT (26.0%), and PET or US (each 4.0%). Analysis of the validation methods revealed that external validation was missing in 63 out of 77 studies (81.8%). In the oncologic studies, the study purposes most frequently included histopathologic grade and differential diagnosis (51.9%), followed by molecular or genomic classification (21.4%), survival prediction (12.8%), and assessment of treatment response (11.4%).

Table 1 Characteristics of the 77 included radiomics studies with diagnostic or prognostic utility
Fig. 2
figure 2

Summary charts of the 77 included radiomics studies are displayed according to disease, biomarker design, imaging type, and topic in oncological studies

RQS according to the six key domains

Table 2 summarizes the results. The averaged RQS of the 77 studies expressed as a percentage of the ideal score according to the six key domains is shown in Fig. 3. The mean RQS score of the 36 studies was 9.40 (standard deviation, 5.60), which was 26.1% of the ideal score of 36. The lowest score was − 5, and the highest score was 21 (58.3% of the ideal quality score).

Table 2 Radiomics quality score according to the six key domains
Fig. 3
figure 3

Radiomics quality scores (RQSs) of the 77 included studies expressed as percentage of the ideal score according to the six key domains

In domain 1, all studies except one reported well-documented image acquisition protocols or the use of publicly available image databases. Multiple segmentations by two readers were performed in 35 of the 77 studies (45.4%), including six studies with automatic segmentation. Notably, only five studies [11, 12, 20, 26, 47] conducted imaging at multiple time points and tested feature robustness. No articles conducted a phantom study.

In domain 2, most studies adopted appropriate feature reduction or adjustment for multiple testing (74/77, 96.1%). The studies used either false discovery rate with univariate logistic regression or two-sample t tests (for binary outcomes), and a variety of statistical and machine learning methods such as lasso, elastic net, random forest, recursive feature elimination, and support vector machine. Validation was performed without retraining from the same or a different institute in 70.1% of studies (54 out of 77).

In domain 3, all studies used discriminative statistics, but one study [23] provided hazard ratios and p values from a log-rank test for survival analysis instead of the C-index.

In domain 4, half of the studies evaluated relationships between the radiomics features and non-radiomics features (50.6%), but only 28.6% of studies found biological correlates of radiomics to provide a more holistic model and imply biological relevance. Less than half of the studies (46.7%) compared results with an existing gold standard. By contrast, in terms of clinical utility, only 15 studies (19.5%) analyzed a net improvement in health outcomes using decision curve analysis or other statistical tools.

Surprisingly, studies were deficient in demonstrating a high level of evidence such as a prospective design or cost-effectiveness analysis. Only three studies [21, 48, 55] (3.9%) included prospective validation, and no studies conducted cost-effective analysis. For domain 6, only three studies [33, 34, 46] (3.9%) made their code and/or data publicly available.

Both feature reduction and validation were missing from the study [25] with the lowest score. Meanwhile, seven studies with the highest scores [12, 14, 21, 26, 48, 54, 55] (three articles with a RQS score of 16, 1 article with 18, 1 article with 19, and 1 article with 21) earned additional points by using publicly available images [12], multiple segmentation [12, 14, 26], test-retest analysis [12, 26], and validation using three or more datasets [12, 21, 26, 55], demonstrating potential clinical utility using decision curve analysis [14, 54] and conducting prospective validation [21, 48, 55], with all studies fulfilling requirements for image protocol quality, feature reduction, and use of a discrimination index.

Completeness in reporting a radiomics-based multivariable prediction model using TRIPOD

The mean number of TRIPOD items reported was 18.51 ± 3.96 (standard deviation; range, 11–26) when all 35 items were considered. The adherence rate for TRIPOD was 57.8% ± 10.9% (standard deviation; range, 33–78%) when “if relevant” and “if done” items were excluded from both the numerator and denominator. The completeness of reporting individual TRIPOD items is shown in Table 3. The detailed results are shown in Supplementary Materials 2.

Table 3 Adherence to individual TRIPOD items in radiomics studies

Subgroup analysis

The results of the subgroup analysis are shown in Table 4. Prognostic studies showed a trend for a higher RQS score than diagnostic studies (11.83 ± 5.03 vs. 8.93 ± 5.52), but this was not statistically significant. Prognostic studies received a higher score than diagnostic studies in comparison with a “gold standard” (p < .001) and using cut-off analysis (p < .001). This was reflected in the TRIPOD items, with the prognostic studies showing higher adherence rates in “describing risk group” (p = .007) and “report unadjusted association between predictors and outcome” (if done, p = .017).

Table 4 Subgroup analysis of RQS and TRIPOD items in radiomics studies according to the intended use, impact factor, and imaging modality

Studies in clinical journals also showed significantly higher RQS scores than those in imaging journals (12.2 ± 5.23 vs. 8.03 ± 5.17, p = .001). They achieved a higher score in protocol quality (p = .018), test-retest analysis (p < .001), validation (p = .012), multivariable analysis with non-radiomics features (p = .036), finding biologic correlates (p = .009), and conducted prospective study (p = .011). In the reporting quality, studies in clinical journals well reported the study design or source of data (p = .047) and reported unadjusted association. Meanwhile, studies in imaging journal more frequently reported blind assessment of predictors (p = .038), the flow of participants (p = .012), and number of predictors and outcomes (p = .001).

Studies utilizing CT tended to have higher RQS than those using MRI (11.8 ± 3.71 vs. 9 ± 5.2), but this trend was not statistically significant. Studies using CT received a higher score in test-retest analysis (p = .028), discrimination statistics with resampling or cross-validation (p = .011), and cut-off analysis (p = .039) than those using MRI. In the TRIPOD items, studies using CT clearly stated study objective and setting in the abstract (p = .006) and described both discrimination and calibration (p = .017) and more provided the full prediction model (p = .014) than those using MRI.

There were 15 articles studied radiogenomics (19.6% among total, 21.4% among oncologic studies). There was no significant difference in radiomics quality score (Mann-Whitney U test, p = .862) between that of radiomics studies (median 10.5, interquartile range 5.0–13.0) and radiogenomics studies (median 10.0, interquartile range 4.25–14.7) and according to each domain.

Role of radiologists in radiomics studies

The results are shown in the Supplementary Table 3. There were 18 articles (23.4%) that radiologists were not the main authors. Three articles (3.9%) did not have radiologists in the author list. When radiologists were not the main authors, the relative position of radiologist in the author list was 0.5, which indicates middle position in the entire author lists.

Discussion

In this study, radiomics studies were evaluated in respect to the quality of both the science and the reporting, using RQS and TRIPOD guidelines. Radiomics studies were insufficient in regard to both the quality of the science and the reporting, with an average score of 26.1% of the ideal RQS and 57.8% of the maximum adherence rate to the TRIPOD reporting guidelines. No study conducted a phantom study or cost-effective analysis and a high level of evidence for radiomics studies, with further limitations being demonstrated in the openness to data and code. Half of the items that the TRIPOD statement deems necessary to report in multivariable prediction model publications were not completely recorded in the radiomics studies. Our results imply that radiomics studies require significant improvement in both scientific and reporting quality.

The six key RQS domains used in this study were designed to support the integration of the RQS in radiomics approaches. Adopted from the consensus statement of the FDA-NIH Biomarker Working Group [4], the three aspects of technical validation, biological/clinical validation, and assessment of cost-effectiveness for imaging biomarker standardization were included in domains 1 (image protocol and feature reproducibility), 4 (biologic/clinical validation), and 5 (high level of evidence), respectively. With regard to technical validation, radiomics approaches are yet to become a reliable measure for the testing of hypotheses in clinical cancer research, with insufficient data supporting their precision or technical bias. Precision analysis using repeatability and reproducibility testing was conducted in one study [47], but reproducibility needs to be tested using different geographical sites and different equipment. Furthermore, none of the evaluated studies reported analysis of technical bias using a phantom study, which describes the systemic difference between the measurements of a parameter and its real values [88]. For clinical validation, prospective testing of an imaging biomarker in clinical populations is required [89], and only three reports covered a prospective study in the field of neuro-oncology. After biological/clinical validation, the cost-effectiveness of radiomics needs to be studied to ensure that it provides good value for money compared with the other currently available biomarkers. From the current standpoint, the time when radiomics will achieve this end seems far away, and technical and clinical validation is still required.

Validation in TRIPOD pursues external validation, which was performed in only 18.2% of the reports covered in the present study, while independent validation including internal validation is acceptable in RQS, which accounts for 70.1% of the studies. In the reporting of radiomics studies, the highly problematic TRIPOD items existed. In the title, only two radiomics studies explicitly wrote the term “development” or “validation” with the target population and outcome. There are several elements that should be present in the abstract, and one of these was missing in 92.2% of studies, as they did not explicitly describe “development” or “validation” in the study objective, or did not describe whether the study design was a randomized controlled trial, a cohort study, or a case-control design. Furthermore, reporting of the sample size calculation and the handling of missing data were often poorly conducted. These results are similar to the findings of a previous systematic review of TRIPOD adherence that examined publications using clinical multivariable prediction models [9]. Differing from the findings on the clinical multivariable prediction models, the radiomics studies were excellent (100%) in the criterion involving the definition of all predictors and how quantitative features are handled. However, the reporting of blindness to the outcome was insufficient (32.1%). The results for blindness were similar to the adherence to the Standards for Reporting of Diagnostic Accuracy Studies (STARD) [90, 91], which stated that blinding of both predictors and outcome is still insufficient in both clinical and imaging studies.

Subgroup analysis may provide more specific guidance in radiomics research. Studies in clinical journals showed significantly higher RQS scores, especially in test-retest analysis, multivariable analysis with non-radiomics features, finding biologic correlates, and pursuit of prospective study design. In the TRIPOD items, they more clearly defined the data source and study design, i.e., a consecutive retrospective design or case-control design, and the study setting, i.e., tertiary hospital or general population. In terms of validation, all three prospective studies were in the clinical journal, and both external and independent validation was performed more frequently in the clinical journal group. These findings imply that high-impact clinical journals pursue precision in research, clarification of epidemiological background, and independent validation, which demonstrates the room for improvement in radiomics studies.

Of note, radiologists played the main role as either first or corresponding authors in the radiomics studies. There were 23.4% articles that radiologists were not the main authors, but most studies work collaboratively among radiologists, other clinicians, and physicists.

This study has some potential limitations. The first is the relatively small sample size, especially with an impact factor below 7. This was placed to permit in-depth analysis of the radiomics applications. Second, radiomics is still a developing imaging biomarker, and the suggested RQS may be too “ideal.” The criteria of phantom study and multiple imaging acquisitions may be unrealistic for clinical situations. Third, the adoption of TRIPOD items to radiomics studies can be rather strict. For example, most studies are case-control and retrospective study designs, and a clear description of “case-control” is not commonly given in imaging journals. Nonetheless, clear stating of the participants and study setting is important for study transportability, and studies in imaging journals need to pursue this. Fourth, we considered internal validation with a random sample or split sample as independent validation, while the TRIPOD statement only considers external validation for validation of a pre-existing model. When the rates of open science and open data increase in the field of radiomics, the true validation of a model should become easier to perform.

In conclusion, the overall scientific quality and reporting of radiomics studies is insufficient, with the scientific quality showing the greatest deficiencies. Scientific improvements need to be made to feature reproducibility, analysis of clinical utility, and open science. Reporting of study objectives, blind assessment, sample size, and missing data is deemed to be necessary. Our intention is to promote the quality of radiomics research studies as diagnostic and prognostic prediction models, and the above criteria and items need to be pursued for radiomics to become a viable tool for medical decision-making.