1 Introduction

Hepatocellular carcinoma (HCC) is the third leading cause of cancer-related death worldwide [1]. It is the 2nd and 6th common cancer worldwide in men and women respectively. Every year, there are approximately 905,677 new cases and 830,180 deaths globally [2]. Transcatheter arterial chemoembolization (TACE) is considered a routine and standard treatment for unresectable liver cancer, which has proven effective for the treatment of liver cancer in its intermediate stages [3,4,5,6,7]. However, due to the highly heterogeneous biological behavior of tumor cells, the results of TACE treatment vary between individuals [8]. The objective response rates of progressive disease following TACE range from 15 to 61% [9,10,11,12]. Therefore, it is necessary to quickly and accurately select ideal candidates for TACE therapy preoperation, in order to improve treatment efficacy and overall survival rate. Previous studies have reported that genes and proteins might be biomarkers for TACE treatment response [13, 14]. However, obtaining these biomarkers is quite time-consuming and expensive, which further increases the financial burden on the patients. Radiomics, a high-dimensional quantitative feature analysis approach, can extract high-throughput features from medical imaging and perform quantitative analysis on tumor heterogeneity. Several studies have shown that radiomics features can characterize the tumor and the tumor microenvironment (TME) [15,16,17,18,19,20,21]. These were closely related to specific microscopic features at the genetic, protein and molecular levels. The use of radiomic features has been suggested in predicting molecular subtyping, tumor gene expression, pathological classification, treatment response and survival rates [17,18,19,20,21,22,23,24,25]. Deep learning is a type of machine learning method that can extract a large number of higher-level deep features from deep hidden layers of the convolution neural network (CNN) and has been widely successful in image recognition and classification compared with the handcrafted features, these deep features contain more abstract medical image information and provide more insight for predictive patterns. Incorporating deep learning into the current radiomics model can enrich the judgment factor of the model and improve its prediction performance. Although there are many studies using radiomics or deep learning to predict the response of TACE in HCC patients, there are still relatively few studies combining deep learning and radiomics features for preoperatively predicting TACE treatment response.

In this study, we propose an integrated model for TACE treatment response prediction of HCC patients by integrating deep learning and radiomics features in MRI scans and carry out a quantitative analysis of the features. Our goal is to aid doctors in making decisions regarding pre-treatment based on our model. We only use the deep learning method as a feature extractor for the model and then integrate deep learning features into the radiomics analysis model, which enriches the judgment factor of the model and improves its prediction performance with limited training data. Patients with high levels of positive predicted response outcomes are selected for TACE treatment, while patients with poorly predicted treatment outcomes can have their treatment plan adjusted accordingly while there is still time.

2 Materials and Methods

2.1 Patients

The data in this study consisted of 71 HCC patients who received TACE treatment in our center from February 2016 to August 2020. This is a retrospective study from a single center. The hospital ethical review committee approved our research protocol and waived the requirement for informed consent (B2019-336). Table 1 shows the clinical information of the patients. The patients were followed up once in the first month after the operation and subsequently every 2–3 months. The modified RECIST criteria were used to determine whether there was a treatment response in a 6-month follow-up venous phase MRI scan. The criteria for patient inclusion are as follows: (1) A multifunctional MR test was performed within one week before surgery; (3) The patient has a confirmed clinical diagnosis of HCC; (4) TACE was used as the initial treatment protocol; (5) Complete clinical information on the patient was available. The criteria for exclusion of patients were as follows: (1) Poor image quality (blurry images); (2) The follow-up time was less than six months; (3) There were other malignant tumors; (4) Other surgical or chemotherapy interventions were received before TACE; (5) Expected survival time is less than 12 months. Finally, as shown in Fig. 1, 20 patients with progressive disease (PD) response and 51 patients with non-progressive disease (N-PD) response after TACE treatment were selected.

Table 1 Clinical characteristics of the patients
Fig. 1
figure 1

A flowchart describing how the authors enrolled and excluded patients

2.2 TACE Procedure

The TACE procedure was performed by six interventional radiologists (IR) with more than 5 and 10 years of clinical experience with TACE respectively. A 5F catheter (RH catheter) was used to selectively perform diagnostic angiography of the celiac trunk and superior mesenteric artery. The process of super-selective catheterization for the tumor-feeding artery was conducted using a 2.7-F microcatheter (Progreat; Terumo). Thereafter, a mixture of 75 mg/m2 oxaliplatin and 500 mg/m2 5-fluorouracil was infused. This was mixed with 5–20 mL of iodized oil (Lipiodol Ultrafluido; Guerbet), 30–50 mg/m2 epirubicin (Pfizer Inc.) and gelatin sponge (Bi-Trumed Biotech Co., Ltd.) for chemical embolization. The embolic material was applied under the guidance of a fluoroscope. The chemical embolization preprocess concluded when the contrast agent stopped being removed from the blood vessel after 10 consecutive heartbeats.

2.3 Image Acquisition and Preprocessing

MRI scans were acquired on a 3.0 T MRI scanner device (Verio; Siemens, Erlangen, Germany). The MRI protocol consisted of a T2-weighted TSE sequence (TR/TE = 3320 ms/83 ms; an acquired resolution of 0.74 × 0.74 mm2, slice thickness = 6 mm; matrix = 320 × 320). The segmentation of the entire tumor volume of interest (VOI) was manually dissected slice-by-slice using the T2-WI by a radiologist with 10 years of experience. Thereafter, each segmentation slice was reviewed and modified by a chief radiologist with over 20 years of experience in HCC MRI analysis. The volume of interests(VOIs) covered the whole tumor. The Medical Imaging Interaction Toolkit (MITK) software was applied to draw the tumor VOI. Figure 2 shows an example of the tumor VOI in a sequence. The dimension of each MRI image was 256 × 256 × 40, the intensity value for the T2-WI was normalized by N4BiasFieldCorrection [26] and the intensity range was standardized using histogram matching [27].

Fig. 2
figure 2

An example of the tumor ROI segmentation on one slice of T2WI

2.4 Feature Extraction

2.4.1 Radiomics Features (RsF)

A total of 1595 3D radiological features were extracted from each VOI with Pyradiomics [28]. These radiomic features can be divided into three categories: texture characteristics, intensity characteristics and geometry characteristics. The texture characteristics of VOI can be described by 16 Gy-level size zone matrix (GLSZM) features, 24 Gy-level co-occurrence matrix (GLCM) features, 5 neighboring gray tone difference matrix (NGTDM) features, 16 Gy-level run length matrix (GLRLM) features and 14 Gy-level dependence matrix (GLDM) features. The intensity characteristics within the tumor were reflected by 18 first-order statistical features. The geometry characteristics of the tumor were described using 14 three-dimensional shape features. In addition, eight different image filters were also applied to the original image to yield its corresponding derived image. The filters are gradient, wavelet, square, square root, logarithm, exponential, Laplacian of Gaussian (LoG), and local binary pattern 3D (LBP-3D). The above radiomic features excluding shape features were also extracted from these derived images.

2.4.2 Deep Learning Features (DLF)

A total of 1024 deep learning features for each patient were extracted from a 3D CNN, which consists of two 3D convolution layers and two fully connected layers. The deep features were obtained from the outputs of the first fully connected layer after applying the rectified linear unit (ReLU) [29] activation function which maintained values at 0 if the values were < 0. The number 1024 for deep learning features was found to be very effective after a series of trials. The architectures and parameters used are described in Table 2. 3D CNN architecture was designed with Tensorflow [30]. The training batch size was set to 20 and the model was trained using the Adam optimizer [31] with a learning rate of 10–3 and 50 epochs.

Table 2 3D CNN architecture for extraction of deep learning features

2.4.3 Radiomics + Deep Learning Features

The output features of the first fully connected layer of the 3D CNN framework were connected with the latter half of the radiomics process so that the extracted 1595 radiomics features and 1024 deep learning features could be merged into an integrative feature space. The total number of features after the fusion was 2619 as shown in Fig. 3. After feature reduction, these were input into the machine learning classifier for joint training.

Fig. 3
figure 3

The predictive models from the integration of radiomics and deep features

2.4.4 Feature Selection and Classifier Modeling

We applied normalization to the feature matrix. Each feature vector was subtracted from the average value of the vector and then divided by its length. Due to the high dimensionality of the feature space, we utilized Pearson correlation coefficient (PCC) [32] analysis to identify redundant features. One of the feature pairs would be removed when the absolute value of the PCC was larger than 0.86 as it was considered to be redundant. After that, the dimensionality of the feature space was reduced and each feature became independent of the others. Before building the classifier model, a recursive feature selection approach support vector machine-recursive feature elimination (SVM-RFE) was used to select features. The SVM-RFE method has proven to be very effective in finding worthwhile and significant features for improving classification performance [33, 34]. It selects features based on the SVM classifier by recursively considering the smaller size of feature sets. The SVM-RFE algorithm obtained a ranking list of all features by eliminating only one feature that had the least impact on the prediction of the SVM model during each cycle [35, 36]. The first item in the ranking list was the most relevant feature and the last item was the least relevant feature. Finally, the ranking list of the top N features was selected to build the SVM model. Here we used a linear kernel function for building the SVM classifier model [37] with these selected features, which made it easier to interpret the characteristic coefficients of the final model [38]. When the SVM classifier model was built, each selected feature would get a corresponding coefficient. Finally, the predicted RsF + DLF_Score would be calculated by the linear weighted summation of the selected features. Since the number of samples was limited, we also applied an imbalance strategy named SMOTE (Synthetic Minority Oversampling Technique) in the training process [39].

2.5 Statistical Analysis

The receiver operating characteristic (ROC) curve was used to assess the predictive ability of the model and quantification of its results was performed by estimating the area under the receiver operating characteristic curve (AUC). The accuracy, recall, specificity, precision and f1_score were also calculated at the cutoff value that maximized the AUC value. To validate the performance of the model, we used fivefold cross-validation on the data set. In fivefold cross-validation, the data set was randomly divided into five unique subsets S = [s1, s2, s3, s4, s5] to train five independent models. The first model was trained using the subsets [s2, s3, s4, s5] and tested using s1, while the second model was trained using [s1, s3, s4, s5] and tested using s2. This procedure was repeated until all five subsets had been tested. We ensured that there was no patient overlap between the subsets. The mean and standard deviation of the above-mentioned metrics were estimated to evaluate overall model performance. The P value is calculated by univariable association analyses between clinical parameters and TACE treatment response status with the statistical significance set at 0.05. Keras was used to conduct feature selection, feature extraction, classification and statistical analysis. The experiment code was implemented on an Nvidia GeForce GTX 1070 GPU with 8 GB of GDDR5 memory.

3 Results

Table 3 shows the predictive performance of the radiomics-based model, deep learning-based model and integrated model on the data set after fivefold cross-validation. The integrated model had the best predictive ability, with an AUC value of 0.947 ± 0.069, an accuracy of 0.893 ± 0.088, a f1-score of 0.700 ± 0.245, a specificity of 0.700 ± 0.245, a precision of 0.700 ± 0.245 and a recall of 0.600 ± 0.279. This was followed by the deep learning-based model with an AUC of 0.867 ± 0.121 and lastly the radiomics-based model with an AUC of 0.848 ± 0.128. The ROC curves for the three models are shown in Fig. 4. The blue line represents the radiomics feature-based model; the green line corresponds to the deep learning feature-based model; the orange line represents the model with radiomics and deep learning features (the best predictive ability). We also performed a Delong test with the P-value [40] (0.036) between the AUC values of RsF + DLF and the DLF was less than 0.05. However, the difference in P-value between the RsF + DLF and RsF is 0.214 and the difference in P-value between the DLF and RsF is 0.896.

Table 3 The performance of radiomics-based model, deep learning-based model and integrated model on 5-Fold cross-validation
Fig. 4
figure 4

ROC curves of the prediction models

In Table 1, all clinical information is not statistically significant except sex, with the P value being less than 0.05. However, the correlation between sex and TACE efficacy has not been confirmed in medicine, so this factor is not included in the model analysis. Finally, no clinical characteristics are taken into consideration when building the model. For the integrated model, 864 features remained after PCC screening. The top 30 features were selected by SVM-RFE for model building. The integrated predictive model with the best AUC used 20 selected features and the details are shown in Table 4. By linearly combining the 20 features, the integrated score can be computed as

$$ \begin{gathered} {\text{RsF}} + {\text{DLF }}\_{\text{Score }} = \, \left( {0.{379}} \right) \times {\text{wavelet}} - {\text{ LHL}}\_{\text{glcm}}\_{\text{Idmn}} \hfill \\ + \, 0.{6}00) \times {\text{wavelet}} - {\text{HLH}}\_{\text{firstorder}}\_{\text{Skewness}} \hfill \\ + \left( { - 0.{356}} \right) \times {\text{wavelet}} - {\text{HHL}}\_{\text{firstorder}}\_{\text{Mean}} \hfill \\ + \, \left( {0.{385}} \right) \times {\text{wavelet}} - {\text{HHL}}\_{\text{glcm}}\_{\text{MCC}} \hfill \\ + \, \left( {0.{764}} \right) \times {\text{wavelet}} - {\text{HHH }}\_{\text{firstorder}}\_{\text{Kurtosis}} \hfill \\ + \, \left( {0.{233}} \right) \times {\text{wavelet}} - {\text{HHH}}\_{\text{ngtdm}}\_{\text{Complexity}} \hfill \\ + \, \left( {0.{991}} \right) \times {\text{exponential}}\_{\text{ngtdm}}\_{\text{Busyness}} \hfill \\ + \, \left( { - 0.{993}} \right) \times {\text{lbd}} - {\text{3D}} - {\text{m2}}\_{\text{glszm}}\_{\text{GrayLevelVariance}} \hfill \\ + \, \left( { - 0.{618}} \right) \times {\text{lbd}} - {\text{3D}} - {\text{k}}\_{\text{ngtdm}}\_{\text{Contrast}} \hfill \\ + \, \left( {0.{981}} \right) \times {\text{lbd}} - {\text{3D}} - {\text{k}}\_{\text{ngtdm}}\_{\text{Busyness}} \hfill \\ + \, \left( { - 0.{481}} \right) \times {\text{deep}}\_{\text{feature}}\_{15} \hfill \\ + \, \left( {{1}.{423}} \right) \times {\text{deep}}\_{\text{feature}}\_{151} \hfill \\ + \, \left( {0.{999}} \right) \times {\text{deep}}\_{\text{feature}}\_{231} \hfill \\ + \, \left( { - 0.{55}0} \right) \times {\text{deep}}\_{\text{feature}}\_{253} \hfill \\ + \, \left( {0.{242}} \right) \times {\text{deep}}\_{\text{feature}}\_{387} \hfill \\ + \, \left( {0.{692}} \right) \times {\text{deep}}\_{\text{feature}}\_{454} \hfill \\ + \, \left( {0.{662}} \right) \times {\text{deep}}\_{\text{feature}}\_{5}0{8} \hfill \\ + \, \left( {0.{863}} \right) \times {\text{deep}}\_{\text{feature}}\_{594} \hfill \\ + \, \left( {0.{762}} \right) \times {\text{deep}}\_{\text{feature}}\_{611} \hfill \\ + \, \left( {{1}.0{19}} \right) \times {\text{deep}}\_{\text{feature}}\_{757} \hfill \\ \end{gathered} $$
Table 4 A multivariate analysis of preoperative factors

4 Discussion

Previous studies have shown that early response assessment of TACE treatment response is crucial for successful treatment, which may help to modify the treatment plan in time for further effective treatment [41]. Fast and accurate prediction of treatment response before TACE treatment is of great significance for improving overall patient prognosis. Currently, radiomics features have been proven to be inextricably linked to clinical prognosis and tumor genomic patterns [42,43,44,45,46]. Recently, the utilization of radiomics features for preemptively predicting TACE response in HCC patients before surgery has attracted a lot of attention in literature. However, studies that integrate radiomics and deep learning features for preoperative assessment of TACE response are very few.. In our study, we presented an integrated model for quickly and accurately predicting the response of unresectable HCC patients before TACE therapy based on analysis using both radiomics and deep learning features.

As assessed using fivefold cross validation, the integrated model achieved the highest AUC of 0.947 ± 0.069, while the deep learning-based and radiomics-based model had AUC values of 0.867 ± 0.121 and 0.848 ± 0.128 respectively. We also performed a Delong test. The P-value between the AUC values of RsF + DLF and DLF was 0.036 which is less than 0.050. However, the P-value between the RsF + DLF and RsF is 0.214 and the P-value between the DLF and RsF is 0.896. In theory, more dimensional feature information should improve model performance. However, the statistically significant difference between the integrated model and radiomics is not very clear. It is only the statistically significant difference between the integrated model and the deep learning model that is obvious. This may be a deviation in results from theory and could be caused by the small amount of data. More data will be collected in the future to verify the proposed theory. The experiment results may not be sufficient to strongly demonstrate that the feature set integrating radiomics and deep learning features is more effective than using one feature type alone in predicting response to TACE treatment. Nonetheless, the integrated model still achieved good performance for predicting TACE treatment response from preoperative MRI scan analyses of HCC patients.

Additionally, we formulated the predictive score equation based on 20 features from the integrated model with the highest AUC, of which 10 were radiomics features and 10 were deep learning features. In our integrated model, both the deep learning and radiomics features contribute to the prediction of TACE treatment efficacy. In deep learning features, deep_feature_151 and deep_feature_253 are strongly correlated with efficacy prediction with P-values of 0.019 and 0.010 respectively, followed by deep_feature_387 (p = 0.048) and deep_feature_757 (p = 0.047). The number in the deep_feature represents the order of neurons in the fully connected layer. For example, deep_feature_151 indicates that this feature is taken from the 151st neuron of the fully connected layer. Among the radiomics features, wavelet-HLH_firstorder_Skewness and lbd-3D-m2_glszm_GrayLevelVariance have an extremely high correlation with efficacy prediction (p = 0.007 and p = 0.007 respectively), followed by lbd-3D-k_ngtdm_Contrast (p = 0.036).

Wavelet filters are mainly used to optimize radiomics features, which can quantify the heterogeneity of tumors at different scales. Previous studies have demonstrated that wavelet features had a strong ability to predict treatment outcomes and could be an important predictor for constructing radiomic features [47,48,49], which is consistent with the results of our study. First-order radiomics can reflect the distribution of voxel intensities within the tumor ROI region. A significant difference is observed between wavelet-HLH_firstorder_Skewness values of the PD and N-PD cohorts (Fig. 5, left panel) and the median in the PD group is higher than that for the N-PD group. This result demonstrates that voxel intensity information in the tumor area may be related to the TACE treatment response and that with stronger voxel intensity, the therapeutic response may be weaker. The neighborhood gray tone difference matrix (NGTDM) textural features can describe the differences between the gray value of a voxel and the average gray value of its neighboring voxels [50]. The NGTDM features have been applied to tumor heterogeneity analysis [51,52,53]. In our result, the N-PD group is mostly distributed in the negative value area of the lbd-3D-k_ngtdm_Contrast feature while the N-PD group is mostly distributed in the positive value area. (Fig. 5, middle panel). This may indicate that NGTDM texture features are a good predictor that can provide doctors with more information about TACE treatment response. The Gray Level Size Zone Matrix (GLSZM) can quantify the number of connected voxels that share the same gray level intensity in the tumor area [54]. The GLSZM features are particularly efficient at characterizing texture homogeneity, non-periodicity or speckle-like textures, which could provide better characterizations than granulometry for medical image analysis [55,56,57,58]. Figure 5 (right panel) shows that the PD group has a lower median for the lbd-3D m2_glszm_GrayLevelVariance feature than the N-PD group. The variance in gray level intensities for tumor zones is significantly different between the PD and N-PD groups, which indicates that this feature may have an impact on TACE response prediction. The Rs_DL Score is high and statistically significant (0.005), as seen in Table 4. Figure 6 shows the distribution of the Rs_DL Score between the N-PD group and the PD group. The distribution gap is obvious and the case scores in the N-PD group mainly concentrated between 0.0 and − 0.5, while the case scores in the PD group are above 0. This indicates that the characteristics of the two groups on the Rs_DL Score are different, which can be used to predict the therapeutic effect of TACE.

Fig. 5
figure 5

The violin plot of three radiomics features value distributions between the PD response and N-PD response

Fig. 6
figure 6

The violin plot of Rs_DL Score features value distributions between the PD response and N-PD response

However, there are some limitations to this study. First, it is a retrospective study from a single center without additional validation by other hospitals. In addition, the number of HCC cases was limited, especially the proportion of patients with treatment responses. Therefore, the validity of our research results may have been impaired. In future research, more cases and multi-center research are needed. Second, although we have provided the features that contribute to predicting TACE efficacy in the combined model, only the radiomics features can be interpreted, while interpretation of the deep learning features still poses a major problem. The possible correlation between the tumor biological mechanisms and these deep learning features will be investigated from molecular, protein and genetic levels in future work. Also, due to data collection issues, only the T2-WI sequence of MRI scans was available for analysis in this study, which may have omitted information from other kinds of sequences.

In conclusion, we proposed a predictive model to integrate the radiomics and deep learning features for quick and accurate prediction of TACE treatment response for unresectable HCC patients before TACE therapy. The experiment results demonstrate that although a feature set that combines radiomics and deep learning features tends to be effective in predicting response to TACE treatment, further validation studies are needed using multi-center data to support this study in the future.