Introduction

Colorectal cancer ranks as the third most frequent malignancy worldwide, and 30%–50% of colorectal cancer occurs in the rectum [1, 2]. Currently, neoadjuvant chemoradiotherapy (N-CRT) followed by total mesorectal excision (TME) is considered the recommended treatment for locally advanced rectal cancer (LARC) [3, 4], with N-CRT playing an important role in the tumor treatment. However, the therapeutic outcomes following N-CRT for LARC vary considerably, ranging from no tumor regression to intermediate regression, and even pathological complete response (pCR) [5,6,7,8].

Early detection of poor responders and good responders to N-CRT plays an important role in the choice of appropriate therapeutic strategies. If poor responders could be precisely detected before TME, more appropriate treatment strategies such as timely surgery and extensive surgery could be good choices for poor responders, to avoid ineffective use of high radiation doses or high doses of chemotherapy drugs [9,10,11]. However, the accurate identification of good and poor responders to N-CRT currently relies on surgical specimens, and therefore results in a substantial time delay. Thus, early identification of good and poor responders is highly important for determining therapeutic strategies.

By extracting high-dimensional data from medical imaging, radiomics may reveal hard-to-visualize information that is closely associated with tumor heterogeneity [12,13,14,15]. Recent advances in radiomics have shown that it can be useful in tumor prognosis and therapy guidance [16], and previous articles have demonstrated that MRI radiomics features can be used to predict pCR to N-CRT in patients with LARC [17,18,19]. To the best of our knowledge, few articles have reported on the use of radiomics to predict good and poor responders after N-CRT, and the role of radiomics feature analysis in the prediction of good responders and poor responders has not yet been established.

Therefore, the purpose of our study was to generate MRI-based radiomics models for the preoperative prediction of poor and good responders to N-CRT in patients with LARC.

Materials and methods

Patient population

Our institutional review board approved this retrospective study and waived the requirement for informed patient consent. The picture archiving and communication system of our institution was searched for consecutive patients who underwent rectal MRI between March 2014 and May 2019. The inclusion criteria for this study were: (a) rectal adenocarcinoma confirmed by pathological examination; (2) stage T3/T4 LARC; (3) no evidence of distant metastasis; (4) underwent standard N-CRT. The exclusion criteria were: (1) incomplete N-CRT for any reason; (2) incomplete MR imaging; (3) no postoperative pathological results. The patient selection process is summarized in Supplementary item S1.

Neoadjuvant-CRT

All patients underwent radiation therapy with a 3D-conformational multiple field technique with a total dose of 45 Gy (daily fraction of 1.8 Gy, delivered 5 days a week for 5 weeks). The chemotherapy regimen was as follows: during radiation therapy days, intravenous injection of 60 mg/m2 oxaliplatin weekly, and daily oral administration of 825 mg/m2 capecitabine.

Reference standard

The reference standard for the response to N-CRT was based on the histopathologic reports for the surgically resected specimens. All histopathologic analysis was evaluated by two dedicated gastrointestinal pathologists with 9 and 15 years of experience, who were blind to the clinical and MRI findings. The histopathological analysis of the resected specimens included the N-CRT response according to the tumor response grading (TRG) proposed by Mandard et al. [20]. The criteria for the TRG are illustrated in Supplementary item S2. The enrolled patients were divided into two groups according to the TRG: good responders (pTRG1–2) and poor responders (pTRG3–5).

MRI protocol

The majority (66.7%, 122 of 183) of the rectal MRI examinations were acquired on a Signa 3.0-T MRI scanner (GE Medical Systems, Milwaukee, USA) in the Southern District of our hospital and were used as a training cohort, while the remaining patients (33.3%, 61 of 183) were scanned on a Signa HDX 3.0-T MRI scanner (GE Medical) in the West District of our hospital, and were used as a validation cohort. The MRI protocol included axial T2-weighted imaging (T2WI), diffusion-weighted imaging (DWI; from which apparent diffusion coefficient [ADC] maps were generated), and multiphase contrast-enhanced T1-weighted imaging (CE-T1WI) obtained with four timeframes: one pre-contrast (L1) and three post-contrast, at 15 s (L2), 60 s (L3), and 120 s (L4). For the CE-T1WI, 90–100 mL of the gadolinium-based contrast agent dimeglumine gadopentetate (Magnevist; Schering Diagnostics AG, Berlin, Germany) was administered intravenously at a rate of 2.5 mL/s through a high-pressure injector (Optistar LE, LiebelFlarisheim Company, OH, USA). The MRI acquisitions and parameters are summarized in Supplementary item S3.

Image segmentation and radiomic feature extraction

Two radiologists with 7 and 15 years of abdominal imaging experience who were blinded to the patients’ pathological outcomes delineated the entire volume of interest (VOI) around the tumor in a slice-by-slice manner. T2WI, ADC maps, and the L2 timeframe of the CE-T1WI were used for VOI segmentation. Freely available open-source software (3D-Slicer, version 3.4.2, https://www.slicer.org/) was used for the imaging segmentation and feature extraction. Resampling to a voxel size of 1 × 1 × 1 mm and Gaussian filtering with sigma values of 0.5, 1.0, and 1.5 were performed with 3D-Slicer. All extracted radiomics features are shown in Supplementary item S4.

Intra-observer and inter-observer agreement

Thirty samples were randomly chosen from the T2WI, ADC maps, and CE-T1WI acquired on the two MRI scanners, and the intra-observer and inter-observer agreements for the radiomics features extracted by the two radiologists were measured. To evaluate intra-observer reproducibility, radiologist 1 performed VOI delineations twice within 1 week. To evaluate inter-observer reproducibility, radiologist 2 independently delineated the VOIs once following the same procedure, and the resulting radiomics features extracted from the VOIs of radiologist 2 were compared with the radiomics features extracted by radiologist 1. Radiologist 1 performed all of the VOI delineations that were used to extract radiomics features.

Radiomics feature selection and dimensionality reduction

To avoid overfitting and to select the most informative radiomics features for development of the predictive model, a radiomics feature selection process involving two main steps was performed. First, minimum redundancy maximum relevance feature selection (mRMR) was performed to eliminate redundant and irrelevant features. Second, the least absolute shrinkage and selection operator (LASSO) with tenfold cross-validation was applied to the remaining features for the regression of the high-dimensional data. A radiomics score (Rad-score) was then calculated for each patient in the training and validation cohorts using a linear combination of the selected radiomics features weighted by their respective coefficients.

MRI findings including the structured report template, ADC values, clinical characteristics, and feature selection

Clinical data including sex, age, carcinoembryonic antigen (CEA), CA199, and CA125 were assessed. All MRI was reviewed by the same radiologists who delineated the VOIs. Decisions on MRI findings, including the structured report template, were made by consensus. The two radiologists recorded the following: (1) T stage, (2) N stage, (3) circumferential resection margin (CRM), and (4) extramural vascular invasion (EMVI). The criteria for these MRI characteristics are listed in Supplementary item S5. The method used for ADC measurement is shown in Supplementary item S6.

Model building, machine learning classifiers, and construction of the nomogram

To develop a clinical model, clinical factors for predicting the response to N-CRT, including MRI findings from the structured report template, ADC value, sex, age, CEA, CA199, and CA125 were selected using univariate- and multivariate logistic regression. First, univariate logistic regression was performed on the clinical risk factors, and those predictive factors with a P value < 0.05 were introduced into a multivariate logistic regression model with backwards step-wise selection and Akaike’s information criterion as the stopping rule. Then, the features identified in this multivariate regression were used to develop a clinical model. The imaging features selected by mRMR and LASSO were used to build a radiomics model. Random Forest and logistic regression classifiers were applied in the clinical and radiomics models. The clinical factors selected by the univariate logistic regression and radiomic signature were used to build a nomogram for predicting the response to N-CRT. The workflow for imaging data processing and analysis is displayed in Fig. 1.

Fig. 1
figure 1

The workflow of imaging processing and data analysis

Follow-up and survival analysis

Progression free survival (PFS) was defined as the interval from surgery to tumor progression, with tumor progression including local recurrence and/or metastasis or death. The enrolled patients were followed up by MRI every 6–12 months for the first year after surgery, and then annually. Local recurrence or metastasis was confirmed by medical imaging and/or gastroscopic biopsy. Survival curves were plotted according to the Kaplan–Meier method with R software (version 3.4.2, http://www.Rproject.org), and the log-rank test was used to compare them.

Statistical analysis

All statistical analyses were performed with R software (version 3.4.2, http://www.Rproject.org). Continuous variables were analyzed by independent t-test or Mann–Whitney U tests. Categorical variables were analyzed by Fisher’s exact test or the χ2 test. The AUC of the ROC curve with 95% confidence intervals (CI) and a decision curve were used to assess the prediction of poor and good responders using the combined nomogram. The LASSO regression, mRMR, and ROC curve analyses were performed using the R packages of “glmnet”, “mRMRe”, and “pROC”, respectively. The Random Forest classifier was built using the “randomForest” packages in R software. The agreement between the predicted risks and the actual results was evaluated using calibration curves and the Hosmer–Lemeshow test.

Results

The relevant characteristics of the training and validation cohorts are summarized in Table 1. The overall distributions of age, sex, CEA, CA199, CA125, MR T stage, CRM, EMVI, and ADC values showed no significant differences between the two patient cohorts (P > 0.05, respectively). The training set consisted of 55/122 (45.08%) good responders and 67/122 (54.92%) poor responders, whereas the validation set consisted of 28/61 (45.90%) good responders and 33/61 (54.10%) poor responders.

Table 1 Clinical and MRI data

A total of 942 imaging features were extracted from the three modalities of T2WI, ADC mapping, and CE-T1WI. The inter-observer assessment showed a satisfactory agreement rate of 87.9% (mean ICC = 0.867, ranging from 0.001 to 0.985) for all 942 features, while the intra-observer agreement showed a satisfactory agreement rate of 95.6% (mean ICC = 0.922, ranging from 0.005 to 0.997). Eighty-seven imaging features were excluded because of low intra- and inter-observer agreement (ICC < 0.75), and the remaining 855 imaging features were used for further analyses. From the remaining 855 features, 30 features were retained after the mRMR feature selection procedure, and these 30 retained features were then entered into the LASSO regression and were reduced to 13 potential predictors, including 2 features derived from ADC maps, 2 from T2WI, and 9 from CE-T1WI. The selected radiomics features and their coefficients are illustrated in Fig. 2. The Rad-score of each patient in the training and validation cohorts was calculated using these 13 features (Fig. 3). A radiomics model was constructed using the 13 features selected by mRMR and LASSO.

Fig. 2
figure 2

The selected radiomics features and their coefficients

Fig. 3
figure 3

Plots a, b show the radiomics Rad-score for each patient, while the heatmap shows the contributions of the selected features to the Rad-score

In the training cohort, good responders had significantly higher radiomics scores than poor responders (P < 0.001; Fig. 4a), a finding that was also confirmed in the validation cohort (P < 0.001; Fig. 4b). In the training cohort, there was a significant difference in ADC values between the poor responders and good responders (P =0.002; Fig. 4c), although this finding was not confirmed in the validation cohort (P =0.56; Fig. 4d). The Rad-score achieved significantly better predictive capacity (AUCs of 0.869 in the training set and 0.842 in the validation set) than the ADC values (AUCs of 0.661 in the training set and 0.456 in the validation set).

Fig. 4
figure 4

Boxplots showing differences between the good responders and poor responders. Plots a, b show the difference in Rad-score between the good responders and poor responders in the training and validation sets. Plots c, d show the difference in ADC values between the training and validation set

The multivariate logistic regression analysis showed CRM, T stage, and ADC value to be independent predictors of a good or poor response (Table 2). A clinical model was constructed incorporating CRM, T stage, and ADC value, and a nomogram was then constructed using CRM, T stage, ADC value, and the selected Radiomics features to differentiate poor from good responders.

Table 2 Univariate and multivariate logistic regression analysis of factors for prediction of a good or bad response

Table 3 shows the prediction performance of the LR and RF classifiers of the predictive models. The nomogram showed a strong prognostic ability in both cohorts, with AUCs of 0.923 (95% CI, 0.872–0.975) and 0.898 (0.819–978) in the training and validation sets, respectively (Fig. 5). The DeLong test indicated that there was no significant difference between the ROCs of the nomogram for the training and validation sets (P values = 0.884). The nomogram yielded the high accuracy (0.861), sensitivity (0.836), and specificity (0.881) in both the training and validation cohorts (0.803, 0.714, 0.879, respectively). Moreover, the calibration of the nomogram showed a nonsignificant Hosmer–Lemeshow test statistic in the training (P = 0.349; Fig. 6b) and validation sets (P = 0.538; Fig. 6c). The decision curve analysis illustrated in Fig. 6d shows that the nomogram was superior to the clinical factors when the threshold probability was within the range of 0.1–1.0.

Table 3 Ability of the models with different classifier for distinguishing between good responders and poor responders
Fig. 5
figure 5

Scatterplots of areas under the ROC curve (AUC) of classifiers

Fig. 6
figure 6

The nomogram, its corresponding calibration curves for each dataset, and the decision curve analysis of the nomogram. Plots (b) and (c) present the calibration curves of the nomogram for the training and validation sets, respectively. Plot (d) presents the decision curve analysis for the nomogram

For both the training and validation cohorts, the Kaplan–Meier curves of PFS for the prediction models are shown in Fig. 7. The pathologic results (good responders vs. poor responders) significantly stratified patients with respect to PFS in both the training and validation sets (P = 0.00091 and P = 0.0039, respectively). The nomogram achieved separation of the survival curves with P values of 0.0036 in the training set and 0.024 in the validation set.

Fig. 7
figure 7

Kaplan–Meier survival curves of patients in the training and validation sets, displaying patient stratification according to the pathology (a, c), nomogram model (b, d)

Discussion

In this study, we investigated two machine learning algorithms for identifying good responders and poor responders in patients with LARC who underwent N-CRT. The nomogram model showed excellent discrimination and calibration for differentiating the two cohorts, with AUCs of 0.923 in the training cohort and 0.898 in the validation cohort, and it achieved separation of survival curves in both training and validation sets, with P values of 0.0036 and 0.024, respectively.

Morphological information on MRI is widely used to evaluate the response to N-CRT. Cui et al. [17] built a model based on radiomics features to predict pCR in patients with LARC after they had undergone N-CRT, and showed that the T stage according to MRI was a positive predictive factor. Tang et al. [21] reported that a nomogram using MRI findings taken from a structured report template and clinical factors performed well in identifying poor responders after N-CRT, with AUCs of 0.823 on the training set and 0.820 on the validation set, and demonstrated that T stage according to the MRI structured report template was a significant predictor. Our result is in accord with their findings, although our nomogram showed better discrimination of good and poor responders in patients with LARC who underwent N-CRT, with AUCs of 0.923 on the training set and 0.898 on the validation set. Our nomogram also contained radiomics factors, which can provide nonvisual information on the underlying tumor heterogeneity [22,23,24]. Recent advances in radiomics have demonstrated that the approach has great potential for predicting tumor response [25,26,27,28]. Previous studies added imaging features extracted from DCE-MRI to radiomics analysis to identify pCR in LARC patients having undergone N-CRT, and obtained high AUCs [17, 29]. DCE-MRI may provide comprehensive information with smaller image intervals and higher image contrast than T2-weighted imaging and diffusion-weighted imaging. The addition of these radiomics predictive factors may further improve the predictive performance of our model.

After multiple logistic regression analysis, CRM, which was previously found to be an independent prognostic factor for local recurrence after surgery [30, 31], was also introduced into our nomogram. Owing to its high resolution for soft tissue, MRI is considered to be the most reliable imaging method for identifying potential CRM involvement [32, 33], with Glynne-Jones et al. [34] finding that CRM status could predict outcome after surgery following preoperative radiotherapy and chemoradiation, and Depypere et al. [35] demonstrating that CRM was correlated with the biological behavior of the tumor and therapy response after N-CRT. Studies have demonstrated that MRI-detected EMVI positivity is closely related to tumor recurrence after surgery in patients with LARC [36, 37]. Chand et al. [38] showed that detection of EMVI after N-CRT was prognostically significant, and was of independent prognostic significance for DFS. However, while Lee et al. [39] found that EMVI was the only significant factor correlating with patients’ DFS in a univariate analysis, it was not significant in a multivariate analysis. Following the univariate and multivariate analysis, EMVI was not included in our nomogram. It may be that the lack of significance was related to the low number of EMVI-positive cases.

In previous studies, DWI and DWI-derived ADC values were used to access the response to N-CRT [40,41,42], although the results remain controversial. Birlik and colleagues reported that the ADC values of good responders were significantly higher than those of poor responders [43], but Cai et al. could not confirm this result [44]. In our study, the difference in ADC values between good and poor responders varied between the training and validation sets. Our study showed a marked difference in ADC values between good and poor responders in the training set (P = 0.002), but not in the validation set (P = 0.56). The role of ADC values as an independent predictive factor for assessing the response to N-CRT therefore remains highly uncertain. Compared with the ADC values, the Rad-score model demonstrated more satisfactory results, with AUCs of 0.869 and 0.842 for predicting a good response in the training and validation sets. One possible explanation for the improved performance is that high-dimensional imaging features hold more detailed information on tumors, and that this information allows for more sensitive differentiation between good responding and poor responding LARC patients.

The TRG system has been widely used to access the response to N-CRT in patients with LARC who underwent TME, and to stratify LARC patients after N-CRT according to DFS [45,46,47]. In our study, we evaluated the relationship between TRG and prognosis, and found that patients with TRG 1–2 had longer PFS than poor responders with TRG 3–5, a finding in accord with the results of Huh et al. [48]. There are few articles about using a nomogram to stratify LARC patients after N-CRT according to PFS. In our study, the nomogram also achieved separation of survival curves.

Machine learning based on radiomics was widely used to identify pCR patients with LARC after N-CRT. Horvat et al. [28] used RF classifier based on radiomics to predict pCR patients, with an AUC of 0.93. However, the validation group was missing in their study. In our study, RF classifier based on radiomics showed well classification performance for identifying good and poor responders with AUCs of 0.883 and 0.827 in the training and validation set, respectively. Moreover, nomogram is based on logistic regression analysis which is one of machine learning algorithms. Compared with other machine learning algorithms, the significant advantage of nomogram is visualization. This makes it easy to use.

Study limitations

Our study has some limitations. First, this was a retrospective study, and there is the potential for selection bias by not including patients on nonsurgical management. Second, the VOI segmentation was manually performed, and differences in the extraction of imaging features may have occurred. Therefore, we performed a consistency check to control for variation in the variables. Third, this study used a dichotomous grading of good or poor responders, rather than the five grades of the TRG; further studies are needed to provide a better generalization. Furthermore, the operation type, surgical approach, and treatments after surgery may significantly influence the prognosis of LARC patients [33,34,35], but as a preoperative predictive method, our nomogram involved only preoperative factors.

Conclusions

The nomogram combining multiparametric MRI-based radiomics features with CRM, T stage, and ADC values showed high performance in distinguishing between good responders and poor responders in patients with LARC who underwent N-CRT, and may be help in predicting PFS.