Introduction

Colorectal cancer remains the second leading cause of cancer worldwide despite expansion of screening programs and advances in treatment [1, 2]. Significant mortality is attributable to disease recurrence in colorectal cancer survivors, with an estimated occurrence of 29–63% of patients diagnosed with stage II–III disease [3]. The American Society of Clinical Oncology has highlighted the importance of identifying new prognostic factors associated with disease recurrence in order to improve surveillance guidelines [4]. Identifying risk factors may allow clinicians to tailor surveillance strategies for patients at higher risk for recurrence, particularly among those with stage II–III disease. Advances in computing power have given radiomics, the process of using quantitative image features as clinical data, the potential for earlier identification of metastatic recurrence that could positively impact outcomes and help guide post-treatment surveillance strategies [5].

Radiomic features of the liver, the most common site of distant metastatic spread in patients with colorectal cancer, may provide information about underlying physiology [6]. It has been shown that hepatic textural features at staging portal-venous enhanced computed tomography (CT) appear correlated with hepatic perfusion values in colorectal cancer patients [7, 8]. For example, entropy, a textural feature that increases with overall image “disorder” [9], appears to be inversely correlated with hepatic perfusion indices [8]. Such physiologic measurements may be altered in the setting of early occult liver metastases, where reduced portal venous blood flow and increased hepatic arterial blood flow have been observed in mouse and rat models [10, 11].

Given that CT texture features may reflect changes in hepatic perfusion, monitoring changes in these features could potentially alert radiologists to the imminent development of hepatic metastases. In other words, micro-metastatic hepatic disease may be present, detectable by CT texture features, which may be later seen at imaging as discrete lesions. However, to our knowledge, no studies regarding the change in hepatic texture features at CT during post-treatment surveillance of colorectal cancer patients to portend subsequent development of hepatic metastases have been reported. The purpose of this study was twofold. First, a major aim was to determine if CT texture changes of the hepatic parenchyma are present in patients shortly before the development of hepatic metastases compared with control patients with colorectal cancer who did not develop hepatic metastases, and if these changes can predict their development. And secondly, whether such hepatic textures predictive of changes are present at the initial staging CT.

Materials and methods

This study was HIPAA-compliant and IRB-approved; the need for informed consent was waived.

Patient population

A flowchart of patient participation and sample selection is given in Fig. 1. A database of 923 patients treated with FOLFOX, FOLFOX + bevacizumab, FOLFIRI or FOLFIRI + bevacizumab at our institution from 2003 to 2016 was searched to identify potential patients with colorectal cancer. Colorectal cancer cases were identified through manual review of pathology reports in the electronic health record. Date of diagnosis was taken as the date of colonoscopy biopsy or surgical pathology report of the resected tumour. TNM staging information was collected from the surgical pathology report of the primary tumour resection. Imaging follow-up was obtained through the picture archiving and communication system (PACS). Staging CT images were reviewed by three board-certified abdominal radiologists (DK, MGL, PJP) to exclude the presence of liver metastases at the time of initial cancer diagnosis. Then, the earliest CT images demonstrating visible hepatic metastases were identified. We then searched for the most recent surveillance CT study prior to the development of CT-detectable liver metastases (no earlier than 1 year prior to demonstrable metastatic disease). Sixty control patients were selected such that the distribution of TNM stages at diagnosis and timing of surveillance scans were similar to the case cohort, yielding a final sample size of 89. Chart review was performed by a single reader (SJL).

Fig. 1
figure 1

Flow diagram for case and control patient selection

CT imaging

All texture measurements were performed on anonymised contrast-enhanced portal venous-phase exams. CT imaging studies were performed on a variety of scanners (predominately GE, but also Siemens and Toshiba in a small minority of cases). Seven scans were performed at 140 kV, two were performed at 130 kV and one was performed at 100 kV; all others were performed at 120 kV. Our standard IV contrast protocol consists of weight-based contrast dosing (range, 80–150 ml) with iohexol (300 mg I/ml), followed by 40-ml saline chaser, all at 3 ml/s. Portal venous phase imaging is initiated by a liver enhancement threshold of 50 HU, typically 60–70 s after initiation of contrast injection. The CT series were reconstructed at a slice thickness of 5 mm at 3-mm intervals.

Texture analysis

A single image slice at the level of the porta hepatis was selected for texture analysis of the liver parenchyma, similar to prior studies (Fig. 2) [12]. All images were reviewed and appropriate slices were selected by an abdominal radiologist. These anonymised images were then uploaded to a commercially available texture analysis program (TexRAD Ltd, Somerset, UK). A region of interest was manually drawn around the margin of the liver parenchyma by a single reader (SJL), excluding the large proximal branches of the portal vein (Fig. 2). To perform texture analysis, the software uses a filtration–histogram method in which an initial filtration step is performed that highlights image features of a specified size, followed by histogram analysis of the filtered image. The initial filtration step uses a Laplacian of Gaussian (LoG) spatial band-pass filter to selectively extract features of different sizes. Informed in part by prior work, the following histogram-based texture features were calculated to characterise intrahepatic heterogeneity and complexity:

Fig. 2
figure 2

Images showing a the region-of-interest utilised for CT texture analysis, and the subsequent output from b fine, c medium and d coarse filtering of the image. The anatomic level shown was seleceted for segementation in all patients, performed on the last negative CT prior to the development of hepatic metastases for cases, and matched for controls

Standard deviation of pixel attenuation histogram (SD)

Entropy of pixel attenuation histogram

Skewness of pixel attenuation histogram

Kurtosis of pixel attenuation histogram

The number of variables was deliberately limited to minimise type I error (multiple hypothesis testing). Each of these features was calculated at six filter sizes, ranging between fine (spatial scaling factor [ssf] = 0–2), medium (ssf = 3, 4) and coarse (ssf = 5, 6), yielding a total of 24 texture features.

Statistical analysis

Texture features for cases and controls were separately compared at staging and surveillance CT with Mann–Whitney U tests. Bonferroni correction for type I error was applied. Hypothesis tests for 24 texture features were performed at both staging and surveillance CT; the corrected threshold for statistical significance was p = 0.05/48.

For Cox models predicting freedom from metachronous metastatic liver disease, time-to-event was defined as the number of months between the date of tissue diagnosis of colorectal cancer and the date of CT-detectable liver metastases. Control patients were right-censored at their last follow-up date at our institution, or at date of death from any cause (if recorded). For models predicting overall survival, time-to-event was defined as the number of months between date of tissue diagnosis and date of death from any cause; control patients in overall survival models were right-censored at their last follow-up date at our institution. We also noted the use of oxaliplatin in each patient’s chemotherapy regimen. Colorectal cancer patients treated with oxaliplatin frequently develop sinusoidal obstruction syndrome, which affects hepatic haemodynamics and thus may affect texture measurements. We tested for a difference in proportion of patients receiving oxaliplatin between the liver metastases group and control group with a Pearson’s chi-squared test.

To test our specific hypotheses that hepatic entropy values are associated with the development of metachronous liver metastases or overall survival, we built separate Cox proportional hazards models using staging and surveillance CT entropy values as the predictor variables. Linear predictor values for each patient were obtained by leave-one-out cross-validation (LOOCV) and were used to create a prognostic index based on their median value. For a Cox model, the linear predictor is the sum of covariate values weighted by the regression coefficients, and it represents the log(relative hazard) compared to a hypothetical observation whose linear predictor value is 0. When the outcome of interest is an adverse event, higher linear predictor values indicate a greater risk of occurrence [13]. Using this paradigm, we assigned patients into “low risk” or “high risk” groups based on the median value of the linear predictors. Kaplan–Meier estimators were then fit to the data using the linear predictor categorisation, and log-rank statistics were calculated to determine if the survival functions were significantly different between the two groups [14].

For models using all 24 texture features as predictors, we utilised least absolute shrinkage and selection operator (LASSO) regression and LOOCV to perform model selection [15]. The number of features allowed in the final model is subjected to a penalty determined by λ, a tuning parameter. By varying λ, any number of coefficients in the model may be shrunk to zero, effectively removing them. The optimal value for λ and thus the number of included features in the final model is determined by calculating the partial likelihood deviance at each λ value through LOOCV, and selecting the model with the minimal value [16]. This process reduces type I error rates by reducing the probability of overfitting the model to the “noise” in the data set, which becomes a significant concern when there is a large number of predictor variables relative to the number of observations [17]. LASSO regression and cross-validation were implemented using the software package, glmnet, in R [18].

Results

Patient characteristics are listed in Table 1. The median interval between staging and surveillance CT scans from which texture features were calculated was 14 months for patients who developed liver metastases (on the subsequent CT scan) and 22 months for patients who did not develop metastases. For cases, the median interval between surveillance CT and the CT showing hepatic metastases was 6 months (IQR 4.5–11.25 months). The distribution of stage at diagnosis was well-matched between cases and controls, with the majority being stage III (69% for cases, 72% for controls). There was no evidence that the proportion of patients receiving oxaliplatin in their chemotherapy regimen differed between cases and controls (p = 0.81 by chi-squared test). The distributions of entropy values for cases and controls at staging and surveillance CT are shown in Fig. 3.

Table 1 Patient demographics
Fig. 3
figure 3

Box-and-whisker plots show the change in liver parenchyma entropy values at both the initial staging and the surveillance CT scans before imminent development of metastasis amongst cases (Y) and matched for controls (N). Each plot represents entropy values at different spatial scaling factors (ssf): fine (ssf = 0–2), medium (ssf = 3, 4) and coarse (ssf = 5, 6). The whiskers extend to the furthest measurement within (1.5 × interquartile range). a, b Staging and surveillance entropy, respectively. Note the lack of separation for entropy values between cases and controls for any filter at either time point. This demonstrates the lack of predictive ability for identifying those patients who subsequently went on to develop identifiable metastases at the next CT scan after the surveillance scan depicted

Table 2 displays the results of Mann–Whitney U tests comparing texture features between cases and controls at staging and surveillance CT. Two of the tests suggested that kurtosis values (ssf = 3, 4) differed between cases and controls (p = 0.01, p = 0.03), but these were no longer statistically significant after applying Bonferroni correction.

Table 2 Results of Mann–Whitney U tests comparing texture features between patients who developed liver metastases and controls

For models using entropy values as predictors, cross-validated Kaplan–Meier curves and corresponding log-rank statistics are shown in Fig. 4. At staging CT, using linear predictor values to categorise patients into groups at high or low risk for liver metastases resulted in inaccurate prognostic predictions; patients categorised as high risk had a significantly greater survival function than those categorised as low risk (p = 0.02). This suggests that leaving out a single patient’s hepatic entropy values during model LOOCV significantly affected the parameter estimates of the Cox model. Incorrect prognostic categorisation was also seen when using entropy values at surveillance (pre-metastasis) CT to predict liver metastasis-free survival and overall survival.

Fig. 4
figure 4

a Cross-validated Kaplan–Meier curves displaying liver metastasis-free survival for patients split into high- and low-risk groups based on linear predictor values at staging CT. b Liver metastasis-free survival for patients split into high- and low-risk groups based on linear predictor values at the “pre-metastasis” surveillance CT. c, d Overall survival for patients split into high- and low-risk groups based on linear predictor values at staging and pre-metastasis surveillance CT, respectively. Linear predictor values are calculated through leave-one-out cross-validation (LOOCV) of Cox survival models, using hepatic entropy values as covariates. In a, b and d, using linear predictor values to categorise patients into high- and low-risk groups resulted in incorrect predictions of patient prognoses (i.e. “low-risk” groups show decreased metastasis-free survival)

For LASSO models using all 24 texture features as predictors, the partial likelihood deviance was still decreasing when the value for λ shrunk all coefficients to 0, which suggests that none of the calculated hepatic texture features from either staging or pre-metastasis CT scans are predictive of future occurrence of liver metastasis or overall survival.

Discussion

In this study of patients with colorectal cancer, we have shown that CT texture features of the hepatic parenchyma shortly before the development of liver metastases are similar to those of matched colorectal cancer patients who do not develop liver metastases. We compared these CT texture features at both the initial cancer staging CT and at similar time intervals after biopsy-confirmed diagnosis (and immediately before CT-detectable metastases in the case cohort). Using survival analysis methods developed for high-dimensional data sets, we showed that these texture features were poor predictors of the occurrence of liver metastases and overall survival. Lastly, we tested the hypothesis that hepatic entropy values are predictive of liver metastases when measured on CT scans just before their development.

We were not able to predict overall survival or time to liver metastasis using survival models based on hepatic entropy values at staging and surveillance pre-metastatic CT. In some cases, using the models to assign patients to high- and low-risk groups resulted in categorisations that were worse than random assignment to those groups.

Prior studies using CT texture analysis have identified either uniformity or entropy of the liver parenchyma as potential predictors for clinical outcomes in patients with colorectal cancer [8, 12]. One study has suggested that entropy values of the liver parenchyma at staging CT may differ between patients who eventually develop liver metastases and those who do not, but this study had a limited sample size (total n = 29) [19]. We sought to test these intriguing hypotheses that entropy values at CT texture analysis might be predictive of either the development of hepatic metastases or overall survival. To improve our investigation, we not only included the initial staging CT but also identified the last nominally normal CT (in terms of hepatic evaluation) before the development of liver metastases. Our results differ from these studies, and this difference may be due to a variety of reasons. Most notably, we used a more robust statistical methodology for feature selection and cross-validation for estimating model test error rates. These methods reduce data overfitting, leading to more conservative estimates of prediction accuracy. They are also less affected by the problems associated with multiple hypothesis testing, such as false positive associations between covariates and outcomes [20].

The statistical problems of multiple comparisons are inherent when performing analyses with large numbers of potential predictor variables, and independent validation studies of previously generated hypotheses are imperative before large-scale prospective studies of texture features’ clinical utility can be carried out [21]. This issue has caused criticism of texture analysis studies in the recently published literature. For example, one group conducted a systematic review of CT texture analysis studies and applied p value correction to their results using the Benjamini–Hochberg method; none of the included studies’ results remained statistically significant after the corrections were applied [22]. Furthermore, they simulated 100 quantitative random variables in place of the original image-derived indices from one of the included studies and found that 10% of these variables were associated with clinical outcomes. The statistical methodologies of future texture analysis studies can be improved and false positive associations reduced by utilising machine learning and data mining techniques developed for analysing data in other “-omics” disciplines [23, 24].

The results of texture analysis studies have been difficult to compare because of variability in acquisition, pre-processing and reconstruction of images. For example, a recent phantom study demonstrated that texture features significantly varied across scanner models, and the authors suggested that researchers should develop a standardised acquisition technique when collecting images to be used in texture analysis studies [21]. Another study involving patients with non-small cell lung cancer tumours showed significant variations in over half (13/23) of the calculated CT texture features after simulating a decrease in tube current (mA), and also when comparing feature values from whole-tumour or largest cross-sectional slices of tumours [25]. Finally, the reconstruction algorithm most likely varied within our sample since our scans were obtained over a wide range of time, which also could have affected the texture parameters used in the study. Overall, these sources of variation make it difficult to validate results of texture studies performed at other institutions. The development of a texture-specific protocol that leads to minimal feature variation is necessary if predictive models are to be validated across different healthcare settings and translated into clinical practice.

There have been increasing efforts to produce guidelines and software that facilitate standardisation, reproducibility and collaboration between radiomics research groups. A recent review has proposed a set of potential guidelines for future texture analysis studies to facilitate reproducibility and pooling of results [21]. Additionally, open-source software platforms have been developed specifically for reproducible feature calculation [26]. These radiomics software platforms were developed with an emphasis on transparency and reproducibility of the computational methods used to perform texture analysis. It is paramount that future radiomics studies are as transparent as possible when describing the methods used to acquire, process and analyse texture features from images.

There are several limitations to our study. Partly because of the retrospective method of data collection, CT scans included were performed at multiple institutions on a wide variety of scanner models, including those from different manufacturers, in order to reach an adequate sample size. This could have biased the results and potentially could even mask actual differences between cases and controls. Also, we included patients with stage I–IV disease, although the majority of the patients in this study that developed liver metastases were diagnosed with stage III disease. Three patients had metastases at sites other than the liver at diagnosis, and this may have affected liver texture measurements in the liver which could conceivably bias the texture measurements within the case group. However, by design, there was a similar distribution of disease stage at diagnosis within the control group, and this may have sufficiently reduced the bias of the parameter confidence intervals.

In summary, we compared select CT texture features of the liver parenchyma in patients with colorectal cancer who developed metachronous liver metastases against those who did not develop liver metastases. These features were derived from staging CT images and from CT images just before the development of liver metastases. We were unable to demonstrate the predictive utility of the 24 calculated texture features using a machine learning algorithm developed for high-dimensional survival analysis. We also were unable to replicate the utility of entropy alone for predicting overall survival or development of liver metastases. Despite these findings, CT texture analysis still holds great promise for other clinical applications and is being investigated in the setting of response to therapeutics and evolution of the molecular profile of cancers. To advance the field of radiomics into clinical practice, future texture analysis studies must make efforts to reduce feature variation during image acquisition and computation so that generated models are easily replicable across research settings.