Introduction

Osteosarcoma is the most common primary malignant osseous sarcoma with most cases developing in children and adolescents [1]. Radiologic examinations are useful tools in osteosarcoma diagnosis [2,3,4,5]. Osteosarcoma is often detected on plain radiograph with a contrast-enhanced MRI scan as the next step in the diagnostic work-up; a chest CT scan is essential for lung metastases detection; PET examination or a bone scan is generally recommended for initial staging in osteosarcoma patients [1].

For treatment considerations, chemotherapy has been considered as essential of high-grade osteosarcoma [5]. Surgery of the primary tumor after chemotherapy is a conventional approach [6]; and the histologic response to neoadjuvant chemotherapy evaluated based on tumor necrosis of excision specimen [7] is crucial for treatment strategy and is related to prognosis of patients [8]. Although aggressive treatment plans improve prognosis of patients who were likely to exhibit poor survival, not all patients benefit from these approaches. In clinical settings, expert radiologists may provide informative reports for clinicians to decide treatment strategy [2,3,4], and if patients could be stratified by radiologic examinations, personalized medicine strategy may be realized [9]. However, imaging interpretation relies largely on radiologists; therefore, reports vary due to uncontrollable subjective factors.

Radiomics, a bunch of strategies extracting quantitative, minable, high-dimensional data from medical images, is capable for generating imaging biomarkers which may not be visible to naked eye [9,10,11,12]. Quantitative, reader independent imaging biomarkers could support clinical decision and increase diagnostic, predictive, and prognostic accuracy [13]. In recent years, extensive research using radiomic methods and even artificial intelligence tried and succeeded in linking radiologic image to lesion characterization, treatment response, and patient outcome; nonetheless, translation into clinical practice has not yet realized [14]. For radiomics to cross the translational gap between an exploratory research method and a valued addition to precision medicine workflows, challenges including technical and biological validity and regulatory and ethical problems as well as cost-effectiveness still need to be overcome, in which this process radiomics quality score (RQS) may be employed as a useful tool (Fig. 1) [9, 15, 16].

Fig. 1
figure 1

The radiomics research and role of RQS. A typical radiomics workflow includes image and acquisition and post-processing; manual semi-automatic, or automatic segmentation; model definition using classical machine learning algorithms or deep learning method; external and prospective validation; and finally, clinical application. RQS is a useful tool to assess the methodological quality of this workflow and further reflecting challenges and insufficiencies in radiomics studies, such as lack of prospective design, absence of external validation, and unwillingness to share data. On the other hand, modification of RQS is deemed to be necessary, either according to other predictive model reporting checklists or in response to actual practical needs

Furthermore, no previous study has been undertaken a systematic research on radiomics in osteosarcoma. The factors affecting the performance of radiomics in osteosarcoma should be identified to further improve its clinical translation. Thus, the aim of our study was to establish whether the methodological quality of studies published on radiomics in osteosarcoma for multiple purposes poses barriers to effective clinical application. A meta-analysis of the radiomics utility in prediction of neoadjuvant chemotherapy response to osteosarcoma was performed to evaluate its ability of proposed models to answer this clinically relevant question.

Materials and methods

Protocol and registry

This systematic review was conducted according to the Preferred Reporting Items for Systematic Reviews and Meta-analysis (PRISMA) statement [17]. A review protocol was drafted [18], and has been submitted to the International Prospective Register Of Systematic Reviews (PROSPERO).

Literature search and study selection

The structured search via PubMed, Embase, and Web of Science was performed until 30 Apr 2020 by two reviewers both with 2 years of experience in radiology, independently. Disagreements were resolved by discussion or with the help of a third reviewer with 4 years of experience. This review included primary research assessing the role of radiomics in patients with osteosarcoma for diagnostic, prognostic, or predictive purpose. Two reviewers selected potential studies by screening titles and abstracts. Articles that met inclusion criteria were obtained in full. The full text was determined for further eligibility by two same reviewers. In the case of uncertainties, a third reviewer was consulted to reach final consensus. The reference lists of included studies were screened for additional, potentially eligible articles. Detailed search strategies and selection criteria can be found in supplementary materials.

Data extraction and quality assessment

The eligible articles were assessed by the RQS for methodological quality [9] and by the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool for the risk of bias and concern of application [19]. The RQS was a recently accepted tool to measure the methodological rigor of radiomics workflow [20]. The RQS checklist is described in Table S1 [9]. The assessment interrogates 16 components and rated resulting with a minimum score with − 8 to 0 defined as 0% and a maximum score with 36 points defined as 100%. The QUADAS-2 tool was employed for presenting bias in patient selection, index test, reference standard, and flow and timing. The tool was tailored to our research question through signaling questions for risks of bias specific to current study [19].

We developed a data collection instrument for study data, RQS, and QUADAS-2 score based on previous articles [9, 19, 21]. Two reviewers independently extracted study data into the instrument from two randomly chosen articles that fully met the inclusion criteria to test and adjust the tool. Disagreements were discussed with the third reviewer in order to achieve a shared appropriate understanding of each parameter. The data collection instrument is described in Table S2. Two same reviewers then measured and rated each study independently and recorded data for further analysis.

Data synthesis and analysis

Statistical analysis was performed with SPSS and R language using raters package, while the meta-analysis was performed with Stata using the metan, midas, and metandi packages [20,21,22]. The summed RQS rating per study was calculated and the average rating of all raters is reported. Inter-rater agreement for single items of the RQS was calculated with modified Fleiss kappa statistic, while the interclass correlation coefficient (ICC) was determined to describe inter-rater agreement for the summed RQS [20, 21].

In current systemic review, the response prediction of osteosarcoma to neoadjuvant chemotherapy was addressed repeatedly; therefore, these studies were included in the meta-analysis. Two-by-two tables were extracted, if documented, or reconstructed based on published data. Sensitivity, specificity, positive and negative likelihood ratio, and diagnostic odds ratio (DOR) and their 95% confidence intervals (95% CIs) were calculated as effect size. A hierarchical summary receiver operating characteristic (HSROC) curve was plotted to show the diagnostic accuracy.

For heterogeneity assessment, Cochran’s Q test and Higgins inconsistency index (I2) test were used to estimate the heterogeneity among the studies included in the meta-analysis. HSROC curve was drawn to visually assess the difference between the 95% confidence region and prediction region. A funnel plot and Deeks funnel plot were constructed to visually assess the risk of publication bias, and Egger’s test and Deeks funnel plot asymmetry test were performed. The trim and fill method was used to estimate the number of missing studies. Detailed statistical methods were described in the supplementary materials.

Results

Literature search

The search strategy yielded 30 studies from PubMed, 21 from Embase and 24 from Web of Science. After exclusion of 32 duplicates, 43 unique records of titles and abstracts were screened. Among these, fourteen were selected for possible inclusion and their full text retrieved. Review of the full text resulted in ultimate inclusion of twelve articles in the systematic review [23,24,25,26,27,28,29,30,31,32,33,34]. No additional study was included by hand search of their reference lists. Five studies [25, 26, 28, 29, 31] that interrogated response to neoadjuvant chemotherapy were included into meta-analysis (Fig. 2).

Fig. 2
figure 2

Flow diagram of the study selection process for this systematic review and meta-analysis

Study characteristics

Tables S3 and S4 summarize aims and characteristics of included studies. Five studies investigated treatment response prediction, three interrogated survival prediction, and two attempted to answer both clinical questions by radiomics method, while the remaining two studies explored stratification of metastatic risk, and differentiation of benign and malignant pulmonary nodules in osteosarcoma patients, respectively. In terms of used modalities, seven studies used metabolic imaging methods, including PET and advanced MRI sequences. In terms of applied MRI sequence, one used conventional MRI sequence and contrast-enhanced T1-weighted imaging; three used advanced MRI sequence, two with DWI and one with IVIM, respectively.

Quality assessment

The twelve studies reached a mean ± standard deviation RQS of 6.92 ± 6.00, median 5, and range − 5 to 16. The average percentage RQS was 20.4% with a maximum of 44.4%. Average RQS rating per component and inter-rater agreement are presented in Table 1.

Table 1 Average rating and inter-rater agreement per component of RQS

Most of all studies reported well-documented image acquisition protocols; however, eleven studies relied on prospectively acquired data, and only one included plan for radiomics analysis in its prospective study protocol. Ten studies acquired images using the same equipment, while two study included images from three CT scanners. However, none of them performed a phantom study. Multiple segmentation was conducted in seven studies, in which six were performed by two or more readers, and the remaining one identified tumor using the region-growing algorithm and then confirmed by a physician. Three studies conducted imaging at multiple time points and extracted their radiomics features respectively.

Twelve studies in this review included a total sample size of 964 patients. These studies extracted between 10 and 474 features from 16 to 191 patients, in which one study investigated 42 pulmonary nodules from sixteen patients. The ratio between features and patients ranged from 3.1 times more patients than features to 3.2 times more features than patients. Feature reduction and adjustment was performed in ten studies, in which eight underwent multiple testing. Five studies combined radiomics with clinical information or human objective assessment of image. Validation of radiomics signatures on internal datasets was performed in five of the studies; none of them employed external datasets. For model assessment, discrimination statistics results were usually provided, while calibration statistics results were less mentioned, and none of the study performed cutoff analysis.

Concerning biological validation and clinical utility, most studies compared their model with gold standard. The correlation between tumor biology and radiomics features were discussed in three to provide a more holistic model. Only two studies evaluated whether the model was sufficiently robust for clinical practice by decision curve analysis, but cost-effectiveness analysis was performed in none. Surprisingly, one study made its data partially available to the public.

Risk of bias and applicability concerns were assessed by QUADAS-2 and summarized in Fig. 3. Most included studies were regarded as having a moderate risk of bias. Risk of bias and application concerns relating to index testing were frequently observed. Some studies did not provide enough observations per predictor variable to produce reasonably stable estimates. Feature reduction and adjustment process were not described in detail to allow replication.

Fig. 3
figure 3

Quality assessment of included studies by QUADAS-2 tool. The authors’ judgments for each domain of each included study were reviewed. The proportion of included studies that indicated low, unclear, or high risk and applicability concerns is shown in green, yellow, and red, respectively

The reproducibility of the RQS and QUADAS-2 was calculated. The ICC for the RQS was 0.95 (95% CI 0.85–0.99). Moderate agreement was achieved in evaluating image protocol, discrimination statistics, and gold standard; substantial or almost perfect agreement was reached for the remaining elements of the RQS. Absolute agreement of the seven indicator questions of the QUADAS-2 ranged from 66.7 to 91.7%. RQS score and QUADAS-2 assessment per study per element are presented in Tables S5 and S6.

Prediction of response to chemotherapy

Since only one of the five included studies had a validation dataset, meta-analysis was performed only in the training dataset with a sample size of 328 patients. Individual selected studies showed high DOR for predicting response to neoadjuvant chemotherapy, ranging from 25.46 to 470.59, and the pooled DOR was 43.68 (95% CI 13.50–141.31; Fig. 4). Furthermore, the pooled sensitivity and specificity were 86% (95% CI 65–95%) and 88% (95% CI 79–94%), respectively (Fig. S1). The pooled positive likelihood ratio and negative likelihood ratio were 7.16 (95% CI 3.96–12.94) and 0.16 (95% CI 0.06–0.43), respectively (Fig. S1). The AUC was 0.91 (95% CI 0.89–0.94), which indicates a high diagnostic performance (Fig. S2).

Fig. 4
figure 4

Forest plot of the effect size calculated as diagnostic odds ratio for studies investigating the diagnostic accuracy of radiomics in neoadjuvant chemotherapy response prediction in osteosarcoma patients. The numbers are pooled estimates with 95% CIs in parentheses; horizontal lines indicate 95% CI. TP number of good responders correctly diagnosed, FN number of good responders diagnosed as poor, FP number of poor responders diagnosed as good, TN number of poor responders correctly diagnosed

Cochran’s Q test implied that heterogeneity was present (Q = 10.137, p = 0.003) across the studies, and the Higgins I2 statistic also demonstrated that heterogeneity was high (I2 = 80%). The significant difference between the 95% confidence region and 95% prediction region was large, indicating a high possibility of heterogeneity across the studies (Fig. S2). However, the funnel plot with Egger’s test (p = 0.277) and Deeks funnel plot (p = 0.79) revealed that the likelihood of publication bias was low (Figs. S3 and S4). Trim and fill analysis estimated that no study was missing (Fig. S5).

Discussion

The current review using RQS found that the overall scientific quality of radiomics studies in osteosarcoma is insufficient, with an average RQS rating of 20.4% and 44.4% for the best performing study. Although the meta-analysis showed that radiomics had an excellent diagnostic performance (AUC 0.91, 95% CI 0.89–0.94) in predicting patients’ response to neoadjuvant chemotherapy, radiomics is far from a clinical applicable tool due to its poor methodological quality.

The mean RQS rating was acceptable (20.4% vs 10.8 to 36.1%) comparing with previous reviews, and the same for the best performing study (44.4% vs 41.2 to 55.6%; Tables S7) [20, 21, 35,36,37,38,39,40,41,42], however, lower than a study using a modified RQS checklist, which included patient selection related criteria from QUADAS [43], and higher than a recent review including studies without a systematic approach [44]. In our review, the main reasons for low RQS rating were lack of validation, absence of prospective study design, and unavailable open data. Further insufficiency in scientific quality of radiomics studies were in feature reproducibility and in analysis of clinical utility. Although the guidelines for machine learning model reporting have not strongly emphasized on publicly available code [45, 46], the open data and code would be preferable for assessing the reproducibility of findings [47].

Despite the promising results of meta-analysis, the repeatability and clinical adoption of those models were uncertain. Only five studies were included and most of them were lack of independent validation. Moreover, neither did studies provide publicly available imaging data with segmentation, nor the code employed for data preparation, feature extraction, and model construction. Both Cochran’s Q test and Higgins I2 statistic showed high heterogeneity, but subgroup analysis was not performed due to limited sample size. The likelihood of publication bias was low, while negative results were not identified in our review. On the other hand, prognostic studies concerning survival of the patients and metastasis risk were not pooled, due to varied outcomes. Further analysis may be possible, if future studies report the results by similar measurements.

Among the reviewed studies, radiomics analysis was employed mainly in treatment response and prognosis prediction and only one study fell into the diagnostic field that differentiate benign and malignant pulmonary nodules. Some of them accomplished with conventional imaging data, indicating that radiomics may provide novel quantitative imaging markers without new acquisition equipment or tracers. Our study demonstrated that radiomics may be useful in aiding radiologists for answering clinical questions tightly related to practice. To be able to translate these excellent results into clinical radiology, well-designed and properly-conducted studies are indispensable. Therefore, disadvantages in study design, validation, and open science detected by RQS should be avoided. RQS should be used not only as a tool assessing the scientific and reporting quality of published researches but also as a routine self-checklist before manuscript submitting, and even as a guideline for radiomics study design.

During the application of RQS, varying inter-rater agreement was observed [20]. To avoid that, one later study developed a data extraction instrument and introduced a training phase to reach a shared understanding of each parameter before the formal assessment [21]. As a result, agreement for the sum RQS rating (ICC = 0.96) and most items was improved. Other studies discussed topics with initial disagreements and tailored RQS to the specific research question during the data extract phase to reach a more reproducible assessment [38, 39]; yet, the agreement was not reported. Our study repeated those processes and demonstrated that those efforts allowed reaching a moderate inter-rater agreement in RQS (ICC = 0.95) and shared understanding on most items. Therefore, a similar procedure is deemed to be essential.

However, modifications of RQS in response to practical needs are necessary. Two previous reviews attempt to integrate RQS with six key domains to facilitate the use in radiomics approaches [38, 39], to approach a more precise assessment and appropriate method amelioration. One of them compared RQS with Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) checklist [48], and pointed out that room for improvement was shown in stating study objective in abstract and introduction, blind assessment of outcome, sample size, and missing data categories, to accelerate a more standardized reporting of radiomics researches [39]. Another guideline emphasized on reliable assessment of model validity and consistent interpretation of model outputs, and provided a more clearly defined checklist for assessment of model establishment than RQS, to enable consistent reporting and correct application of model specifications and results [45]. Both RQS and TRIPOD emphasized on validation of the imaging biomarkers [42], but they mainly concerned the dataset used during this process. A recent statement [49] further proposed a structured pipeline for validating based on a three-step technical validation and clinical validation, and pointed out the need of regular updating of validated imaging biomarkers. This process may provide a more practicable and more standardized roadmap for translating radiomics models to clinical applicable tools [9]. Other guidelines concerning artificial intelligence method also provide valuable references for RQS improvement [50, 51].

Some inherent limitations exist in this review. First, radiomics studies investigating osteosarcoma is limited. Hence, only twelve studies were included and five were meta-analyzed. However, osteosarcoma is a rare disease with incidence of several millionth; our review is sufficient to represent the status of this highly specialized field. Second, only one study included in the meta-analysis was validated with an internal dataset. Our results may actually represent a higher performance of radiomics models. Third, the meta-analysis showed high overall heterogeneity, while the subgroup analysis was not performed, since the sample size of the studies was too small to draw any reliable conclusions. Future reviews including more studies and greater sample size may assess the influence of heterogeneity. Finally, the RQS has limitations. Radiomics is still a developing field and so is RQS. It is necessary to improve RQS items in response to actual practical needs. Still, RQS is a timely tool for methodological quality assessment of radiomics research.

In conclusion, radiomics models showed promise for answering clinical questions related to osteosarcoma patients. Especially, for the response to neoadjuvant chemotherapy, the meta-analysis implied moderate performance of radiomics to approach this prediction. However, prospectively designed, well-validated radiomics trials with open data are needed for demonstrating their effectiveness and clinical validity. Moreover, RQS with ongoing improvements may serve as a useful tool to facilitate radiomics towards clinical translation.