Introduction

Cardiac magnetic resonance imaging (CMR) comprehensively evaluates cardiac structure and function and characterizes myocardial tissue. It is considered the gold standard to assess ventricular function and volume [1]. The ability of CMR to distinguish heart tissue characteristics facilitates assessing myocardial viability, determining a differential diagnosis, and predicting prognosis in several cardiomyopathies [2, 3]. In addition to qualitatively interpreting CMR images by visual inspection, the use of quantitative parameters, including parametric mapping techniques, is growing [4]. However, it still has limitations with broad overlap between health and various diseases [5], necessitating a method that utilizes the information of CMR images to the maximum.

Radiomics is an emerging field of research that converts digital medical images into mineable data to extract quantitative features, expanding the utility of images into decision-making and personalized healthcare management [6]. Traditionally, radiomics studies have been mainly conducted in oncology [7], while it has not been extensively conducted in cardiology. However, its application to cardiac imaging has been gradually increasing, especially with CMR [8, 9].

Early studies of echocardiography conducted quantitative texture analysis for differential diagnosis [10, 11], but the experience of radiomics in echocardiography is limited due to low reproducibility. In cardiac CT, several studies reported promising results of radiomics for characterization of coronary plaque [12, 13], perivascular adipose tissue [14, 15], myocardium [16, 17], and cardiac mass [18, 19]. In CMR, several studies have demonstrated feasibility and potential clinical utility of radiomics analysis for diagnostic or prognostic purposes [8]. Since radiomics analysis provides additional objective data from existing images, it has potential to be added to the routine clinical workflow [9].

Despite promising results, CMR radiomics is rarely applied in clinical practice. An important reason for this gap is that standardization of the methodology and reproducibility and robustness of the features are not sufficiently verified [20]. Therefore, generating radiomics data using high-quality science and reporting practices is essential to its clinical application [21].

The radiomics quality score (RQS) was developed to assess the methodology and analysis of a radiomics study [22] and has been applied to oncology or dementia studies [23,24,25,26,27,28,29,30]. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) guidelines are used to evaluate studies that apply diagnostic or prognostic prediction models [31]. Transparent reporting reduces the risk of bias and enhances the clinical utility of a prediction model. Standards defined by the Image Biomarkers Standardization Initiative (IBSI) provide uniformity for image biomarker nomenclature and definitions, image processing workflows, tools for verifying radiomics software implementations, and reporting of radiomic studies [32].

To the best of our knowledge, no previous study has comprehensively evaluated CMR radiomics studies using these three sets of guidelines. Therefore, the purpose of this study was to assess the quality of radiomics studies using CMR with RQS, TRIPOD guidelines, and IBSI standards to identify areas in need of improvement.

Materials and methods

Article search strategy

Two cardiac radiologists with 6 and 9 years of experience, respectively, designed the search strategy. Each radiologist independently performed systematic searches of PubMed and EMBASE on March 10, 2021, to identify relevant original research articles reporting CMR radiomics studies. The search terms are listed in Supplementary Materials.

Study selection

Two radiologists independently reviewed the search results. Articles were initially screened based on titles and abstracts of articles meeting the inclusion criteria. A study was selected for inclusion if it analyzed radiomics using CMR images, involved human participants, collected data from in vivo studies, was written in English, and had available full text. Full-text articles were then reviewed for eligibility. A study was excluded if it was test-retest study without investigating diagnostic or prognostic utility, or did not evaluate the diagnostic or prognostic performance (Fig. 1).

Fig. 1
figure 1

Flow diagram of the study selection process

Data extraction

Two radiologists independently extracted data, and any disagreements were resolved by consensus. The extracted parameters included (a) article information: authors, year of publication, journal type (clinical, imaging, or computer science journal), diseases, study topic, the intended use of radiomics features (diagnostic or prognostic), and the sequence(s) used for feature extraction; (b) enrollee characteristics: number of participants and primary diagnosis.

Method quality based on RQS

The RQS consists of 16 items and is divided into six domains [22]. The details of RQS are presented in Supplementary Materials. Two radiologists independently assessed the RQS of each article. If disagreement occurred between the two reviewers, a final decision was made by a consensus through discussion. The topics which were subject to further discussions until a consensus was reached included feature reduction, discrimination statistics, non-radiomics features, biological correlates, and comparison to “gold standard”.

Reporting completeness based on TRIPOD guidelines

The TRIPOD checklist consists of 22 main criteria assessed by a total of 37 items. Items 21 and 22 were excluded from this study because they refer to supplementary and funding information. Therefore, each article was evaluated with 35 parameters to assess reporting completeness [31]. Furthermore, the type of radiomics model in each article was categorized as development only (type 1a), development and validation using resampling (type 1b), random split-sample validation (type 2a), non-random split-sample validation (type 2b), validation using separate data (type 3), or validation only (type 4). The details of the TRIPOD checklist and data extraction method are presented in Supplementary Materials.

Image preprocessing quality and radiomics feature extraction based on IBSI standards

Two radiologists evaluated whether the preprocessing and processing items were detailed in the methods based on the IBSI guidelines (https://ibsi.readthedocs.io/en/latest/): non-uniformity correction, image interpolation, grey-level discretization, signal intensity normalization, radiomics feature extraction software, and segmentation method.

Statistical analysis

The RQS score was calculated for each article (score range, −8 to 36), and the RQS for all articles was expressed as the mean ± standard deviation, and median with interquartile range. For the six domains in the RQS (protocol quality and segmentation, feature selection and validation, biologic/clinical validation and utility, model performance index, high level of evidence, and open science and data), basic adherence was assigned when a score of at least 1 point was obtained without minus points. The basic adherence to RQS criteria (range, 0–16) and each item scored for the TRIPOD guidelines (range, 0–35) were counted and calculated as proportions (%). The TRIPOD item 5c (“if done” item) and validation items 10c, 10e, 12, 13c, 17, and 19a were excluded from the numerator and denominator when calculating overall adherence. The six IBSI standards were scored, and the adherence to each standard was calculated as a proportion (%). Subgroup analyses were performed by categorizing articles based on the year of publication (publication before January 1, 2019 [n = 12], or after January 1, 2019 [n = 20]) and journal impact factor (4.0 or higher [n = 16] or lower than 4.0 [n = 16]), according to the 2020 Journal Citation Reports. The impact factor 4.0 was chosen based on the median value of impact factors of included articles. The non-parametric Mann-Whitney U test was used to compare RQS scores within each group. Fisher’s exact test was used to compare proportions of RQS, TRIPOD guidelines, and IBSI standards for small sample sizes. All statistical analyses were performed using SPSS version 24.0 (IBM), and a p-value < 0.05 was considered statistically significant.

Results

Characteristics of radiomics studies using CMR

The characteristics of the 32 included radiomics studies [33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64] are summarized in Table 1 and Fig. 2. The journal, impact factor, primary diagnosis, study topic, biomarker, number of participants, and model type are summarized in Supplementary Table 1. The mean number of patients per study was 244.5 (standard deviation, 850.9; range, 23–4891).

Table 1 Characteristics of the included 32 cardiac magnetic resonance imaging radiomics articles with diagnostic or prognostic utility
Fig. 2
figure 2

Summary charts of the 32 included radiomics studies displayed according to disease, study topic, used sequence, and year of publication

The proportions of studies by journal type are as follows: imaging journals (59.4%), clinical journals (28.1%), and computer science journals (12.5%). Radiomics analyses were conducted for diagnostic purposes (50%) and to identify prognostic biomarkers (50%). The two most frequent primary diagnoses were non-ischemic cardiomyopathy (50%) and ischemic cardiomyopathy (31.3%). The study purposes included differential diagnosis (46.9%), adverse event prediction (34.4%), late gadolinium enhancement (LGE) prediction (9.4%), functional recovery prediction (6.3%), and genomic classification (3.1%). Mapping was the most studied sequence (40.6%), followed by LGE (31.3%), cine (18.8%), T1-weighted image (3.1%), T2-weighted image (3.1%), and two or more sequences (6.3%). Most studies performed manual segmentation (90.6%), whereas three (9.4%) performed automatic segmentation. Studies were more frequently conducted at 1.5T (62.5%) than 3T (25%). Three studies (9.4%) included both 1.5T and 3T scans, and one study did not report the magnetic field strength. Three studies were published in 2010–2015, nine in 2016–2018, and twenty in 2019–2021. Four of the 32 studies were published in journals ranked in the top 10% in the field by impact factor [44, 46, 50, 52].

RQS according to the six key domains

The RQS evaluation is summarized in Table 2. The RQS of the 32 studies combined expressed as a percentage of the ideal score according to the six key domains is shown in Fig. 3. The mean overall RQS is 5.16 ± 5.85 (range, −5 to 21), or 14.3 ± 16.3% of the ideal score of 36. Key domain 2 (feature selection and validation) has the lowest percentage (−21.5%) of the ideal score, followed by domain 6 (open science and data, 3.1%) and domain 5 (high level of evidence, 19.1%). Domain 4 has the highest percentage of the ideal score (34.9%).

Table 2 Radiomics quality score according to the six key domains
Fig. 3
figure 3

Radiomics quality scores (RQS) of the 32 studies combined and expressed as a percentage of the ideal score according to the six key domains

Basic adherence to the RQS according to the six key domains

The basic adherence to the 16 RQS criteria is documented in Table 2. The overall basic adherence to the RQS is 35.5%.

Thirty studies (93.8%) followed the domain 1 criterion of having a well-documented image protocol, and one study used a public protocol [60]. Only one study used a test-retest approach at different times to evaluate robustness to temporal variability [43], and no study performed a phantom study. Multiple segmentations for the same region were performed in 19 studies (59.4%) [33,34,35,36,37,38,39, 41, 42, 44,45,46, 48,49,50,51,52, 59, 63].

Twenty-three studies (71.9%) performed the domain 2 criteria of feature reduction or adjustment for multiple testing [33,34,35,36,37,38,39, 41, 42, 44, 46,47,48,49,50,51,52, 54, 55, 59,60,61, 63]. Nine studies (28.1%) performed validation from the same institute [35, 37, 39, 41, 44, 47, 56, 57, 64]. No study performed external validation.

Twelve studies (37.5%) performed the domain 3 criterion of cutoff analysis. Twenty-nine studies (90.6%) applied discrimination statistics using receiver operating characteristics curve and/or area under the curve, and only one study (3.1%) adopted calibration statistics without a resampling method [35].

Twenty-two studies (68.8%) performed the domain 4 criterion of discussed biological correlation of the radiomics features, and eighteen studies (56.3%) compared the radiomics features to the gold standard methods. Seven studies (21.9%) performed multivariable analysis with non-radiomics features [36, 40, 45, 48, 49, 58, 63], and one study (3.1%) derived the potential clinical utility [35].

Seven articles (21.9%) reported performing the domain 5 criterion of a prospective study [36, 41, 45, 46, 52, 57, 63]. No study conducted a cost-effectiveness analysis. Three studies (9.4%) performed the domain 6 criterion of using open-source scans or code [38, 41, 60].

Reporting completeness of radiomics-based multivariable prediction models using TRIPOD

The mean number of TRIPOD items reported is 16.5 ± 4.0 (standard deviation; range, 9–25) out of the 35 items considered. The adherence rate for TRIPOD is 55.9% ± 12.2% (standard deviation; range, 32.1%–78.6%) when “if relevant” and “if done” items are excluded from the numerator and denominator. The reporting completeness of individual TRIPOD items is shown in Table 3.

Table 3 Number of the radiomics studies that adhered to the individual items of the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) guidelines

Quality of image preprocessing and radiomics feature extraction according to IBSI

Most studies did not report preprocessing steps in detail (Table 4). Two studies (6.3%) performed non-uniformity correction, four studies (12.5%) performed image interpolation, six studies (18.8%) performed grey-level discretization, and twelve studies (37.5%) performed signal intensity normalization. The software packages used for radiomics feature extraction were MaZda (43.8%), MATLAB (34.4%), TexRAD (12.5%), PyRadiomics (3.1%), and AVIEW (3.1%). One study did not mention the software used.

Table 4 Quality of image processing and radiomics feature extraction according to IBSI

Subgroup analysis

The results of the subgroup analysis are shown in Table 5. The total RQS is higher for the twenty studies published since 2019 than the twelve studies published before 2019, but the difference is not statistically significant (mean 6.4 ± 6.4 vs. 3.08 ± 4.25; p = 0.114). The studies published since 2019 score higher for domain 1 (p = 0.02) and domain 4 (p = 0.005) items. The most significant difference is observed for the “comparison to gold standard” parameter (p = 0.007). The more recently published studies also have higher adherence rates for the TRIPOD criterion of “describing eligibility criteria” (p = 0.049). No significant differences are observed between the two groups for ISBI standards.

Table 5 Subgroup analysis of RQS and TRIPOD items in CMR radiomics studies according to the publication dates and journal impact factor

The total RQS is higher for studies published in the journals with higher ( ≥ 4.0) impact factor than those published in journals with lower ( < 4.0) impact factor (mean 7.56 ± 4.29 vs. 2.75 ± 6.32; p = 0.007). The studies published in higher impact factor journals score higher for domain 5 (p = 0.035). The studies published in higher impact factor journals also have higher adherence rates for TRIPOD criterion of “key dates” (p = 0.043), “modeling” (p = 0.037), and “unadjusted association” (p = 0.033). No statistically significant differences are observed between the two groups for IBSI standards.

Discussion

Our systematic review evaluated the quality of science in CMR radiomics studies using the RQS, TRIPOD guidelines, and IBSI standards. Radiomics studies scored poorly in science quality and reporting completeness, with 14.3% of the ideal RQS and 55.9% adherence to the TRIPOD guidelines. No study performed a phantom study or cost-effective analysis, and approximately half of the TRIPOD items were not adequately recorded. Most studies did not report preprocessing steps according to the IBSI standards. These results imply that CMR radiomics studies need significant improvement in scientific quality.

The number of radiomics studies performed using CMR is relatively small. However, the studies published to date demonstrate promising results for differential diagnosis and prognosis. Because radiomic features extracted from CMR images can be affected by the equipment, protocol, or personnel, it is vital to establish reproducibility and reduce variability for clinical application [9]. Most CMR radiomics studies focused on the left ventricle for myocardial tissue characterization in non-ischemic and ischemic cardiomyopathies, and mapping was the most frequently used sequence for feature extraction.

The overall mean RQS for CMR studies was 5.16 ± 5.85 (14.3 ± 16.3%), which is lower than the RQS reported for some oncology studies (mean score 9.4–11) [23, 24], but similar to that reported in other radiomics studies on various tumors and dementia (mean score 3.6–6.9) [25,26,27,28,29,30]. The basic adherence rate was low for domain 5 (10.9%) and domain 6 (9.4%); such rates are similar to oncology studies [23,24,25, 27], suggesting that there are limitations to pursuing higher levels of evidence and open science. No phantom study or cost-effective analysis was performed in the included CMR studies, which is consistent with non-cardiac studies [23, 24, 26, 27] and indicates low-level evidence. The basic adherence rate of the calibration statistics (3.1%) was lower than that of oncology studies (range 9.9–29.9%) [23,24,25, 27], suggesting that improvement is needed. The low basic adherence rates for test-retest (3.1%) and potential clinical utility (3.1%) were consistent with other studies [24, 26]. The basic adherence rate of validation (28.1%) was lower than studies on tumors or dementia (range 46.2–70.1%) [23, 24, 26, 27] but was similar to one sarcoma study (26.9%) [25]. The nine CMR studies that performed validation used datasets from the same institute, and no study performed external validation, resulting in a low RQS score. Because radiomic features extracted from CMR images can be largely affected by protocol, standardization, compensation for multicenter effects, and external validation are necessary and crucial for the clinical application of radiomics.

Our results are in line with the recently published study that assessed RQS in cardiac MRI and CT radiomics studies [65]. The median total RQS in the previous study was 7, which was higher than in our study (mean 5.16, median 4.5). The difference could be due to the different range of included articles and interobserver variability for RQS assessment. For example, our study included fewer articles with internal validation (28.1% vs. 45%), and more articles with prospective design (21.9% vs. 4%). The greatest difference was found in potential clinical utility item (3.1% vs. 94%). In our study, an article earned points if the clinical utility was objectively assessed, such as decision curve analysis to demonstrate net improvement. On the other hand, discussing the possible utility of radiomics without an adequate analysis did not provide points. Despite some differences, both studies revealed that improvement is needed in validation, calibration, cost-effectiveness, and open science items.

The overall adherence to the TRIPOD guidelines was 55.9%, consistent with previous radiomics studies [23, 27]. Specifically, the rates of adherence to TRIPOD guidelines were low for reporting title (3.1%), missing data (0%), discrimination/calibration (3.1%), and how to use the prediction model (3.1%). Most articles did not explicitly describe development or validation in the title (item 1), abstract (item 2), or introduction (item 3b). Only one study [35] conducted calibration statistics (item 10d) and constructed a nomogram for clinical use (item 15b), resulting in low adherence rates for these items. Improving the adherence of research conduct and reporting according to the TRIPOD guidelines could improve the quality of radiomics models.

The image preprocessing steps performed before extracting radiomics features affect the repeatability and reproducibility of results [66, 67]. However, most CMR radiomics studies fail to describe the preprocessing steps according to the IBSI standards of non-uniformity correction, image interpolation, grey-level discretization, and signal intensity normalization (6.3%, 12.5%, 18.8%, and 37.5%, respectively), and the adherence rates are lower than those of radiomics studies in cancer patients [25, 27]. Making more concerted efforts to conduct image preprocessing steps according to IBSI standards could improve the reproducibility of CMR radiomics studies.

The subgroup analysis revealed that the total RQS was higher in studies published since 2019 than studies published before 2019; however, the differences did not reach statistical significance. More recent studies have been conducted with better protocol quality and stability (domain 1) and biologic/clinical validation and utility (domain 4). The adherence rates for most TRIPOD guidelines also tended to be higher in the more recent studies. However, for IBSI standards, even the more recent studies exhibited inadequate adherence.

When grouped according to impact factor, studies published in journals with a higher (≥4.0) impact factor had higher RQS scores than studies published in journals with a lower (<4.0) impact factor. The studies published in higher impact factor journals had a higher level of evidence (domain 5), because they included more prospective studies. In TRIPOD items, they more clearly defined the study dates, type of model, and unadjusted association between candidate predictor and outcome. However, no significant difference was observed for IBSI standards between articles grouped by journal impact factor.

Our results suggest that the RQS, TRIPOD guidelines, and IBSI standards were not adequately followed even in recently published studies or articles published in more highly ranked journals. We conclude that additional effort is needed to improve the scientific quality and reporting completeness in CMR radiomics studies.

There are several limitations to our study. First, the number of included CMR radiomics studies was small. Second, some items of RQS and TRIPOD may be too idealistic for most studies to meet. For example, it is not easy to perform multiple CMR scans in clinical practice. In addition, it is challenging to meet all the TRIPOD guidelines because most CMR radiomics studies are retrospective. However, it is necessary to be as transparent as possible about protocol parameters and data extraction methods. Lastly, we did not assess the inter-reader reproducibility for scores due to the consensus approach.

In conclusion, the overall scientific quality and reporting completeness of CMR radiomics studies are inadequate. Improvements are needed in the areas of validation, calibration, clinical utility, and open science. Complete reporting of study objectives, missing data, discrimination/calibration, and how to use the prediction model is necessary, and image preprocessing steps must be reported in detail.