Keywords

1 Introduction

Colorectal cancer (CRC) is the second cause of death for cancer with mortality ranging almost 35% over the CRC patients [1]. In the last years, new therapeutic approaches have been introduced in the clinical practice but, due to the high mortality, genomic-driven drugs are under evaluation. In particular, the advent of immunotherapy has represented a promising approach for many tumours (e.g., melanoma, non-small cell lung cancer) but results of clinical trials related to CRC have revealed that patients do not benefit from such therapeutical approaches. The chance to molecularly classify this tumour could lead to a better assessment of the regimen to be administered. Many research groups are focusing on these aspects and a multilayer approach could lead in a substantial improvement of the clinical outcomes.

The advent of different computational models allows to perform multilayer analyses including deep study of histological images. Such an approach relies on the automatic assessment of tissue types.

The classical pipeline for building an image classifier involves handcrafted feature extraction and statistical classification. A typical choice was Support Vector Machines (SVMs), or Artificial Neural Networks (ANNs), plus eventual stages of preprocessing and dimensionality reduction.

Linder et al. addressed the problem of classification between epithelium and stroma in digitized tumour tissue microarrays (TMAs) [2]. The authors exploited Local Binary Patterns (LBP), together with a contrast measure C (they referred to their union as LBP/C) as input for their classifier, an SVM. In the end, they compared LBP/C classifier with those based on Haralick texture features and Gabor filtered images, and the LBP/C classifier resulted the best model (area under the Receiver Operating Characteristic – ROC – curve was 0.995).

In the context of colorectal cancer histology, it is worth of note the multi-class texture analysis work of Kather et al. [3], which combined different features (considering the original RGB images as grey-scale ones), namely: lower-order and higher-order histogram features, Local Binary Patterns (LBP), Grey-level co-occurrence matrix (GLCM), Gabor filters and Perception-like features. As statistical classifiers, they considered: 1-nearest neighbour, linear SVM, radial-basis function SVM and decision trees. Even though they get good performances, by repeating the experiment with the same features, we noted that adopting the red channel leads to better results than using grey-scale images (data not shown). This consideration does not hold after that staining normalization techniques are applied.

Later works exploited the power of Deep Learning (DL), in particular of Convolutional Neural Networks (CNNs), for classifying histopathological images.

Kather et al. employed CNN for performing automating tissue segmentation of Hematoxylin-Eosin (HE) images from 862 whole slide images (WSIs) of The Cancer Genome Atlas (TCGA) cohort. Then, they exploit the output neuron activation in the CNN for calculating a "deep stroma score", which proved to be an independent prognostic factor for overall survival (OS) in a multivariable Cox proportional hazard model [4].

Kassani et al. proposed a Computer-Aided Diagnosis (CAD) system, composed of an ensemble of three pre-trained CNNs: VGG-19 [5], MobileNet [6] and DenseNet [7], for binary classification of HE stained histological breast cancer images [8]. They came to the conclusion that their ensemble performed better than single models and widely adopted machine learning algorithms.

Bychkov et al. introduced a DL-based method for directly predicting patient outcome in CRC, without intermediate tissue classification. Their model consists in extracting features from tiles with a pretrained model (VGG-16 [5]), and then applying a LSTM [9] to these features [10].

In this work, we extensively compare three classes of approaches for the multi-class tissue classification task: (1) extraction of handcrafted features with the adoption of a statistical classifier; (2) extraction of deep features using the transfer learning paradigm, then exploiting ANN or SVM classifiers; (3) fine-tuning of deep classifiers. We also proposed a feature combination methodology in which we concatenate the features of different pretrained deep models, and we investigate the effect of dimensionality reduction techniques. We identified the best feature set and classifier to perform inferences on external datasets. We investigated the explainability of the considered models, by looking at t-distributed Stochastic Neighbour Embedding (t-SNE) plots and saliency maps generated by Gradient-weighted Class Activation Mapping (Grad-CAM).

2 Materials

The effort of Kather et al. resulted in the development and diffusion of different datasets suitable for multi-class tissue classification [3, 4, 11, 12].

[3, 11] describe the collection of N = 5.000 histological images, with size of 150 × 150 pixels (corresponding to 74 × 74 μm).

[4, 12] introduce a dataset of N = 100.000 image patches from HE stained histological images of human colorectal cancer (CRC) and normal tissue. Images have size of 224 × 224 pixels, corresponding to 112 × 112 μm. This dataset is the designated train set in their experiments, whereas a dataset of N = 7.180 has been used as validation set. We denote the first one with T and the latter one with V1. For the train set, they provide both the original version and a normalized version exploiting the Macenko's method [13].

In order to harmonize some differences between the class names of the two collections, we considered the following classes:

  • TUM, which represents tumour epithelium.

  • MUSC_STROMA, which represents the union of SIMPLE_STROMA, as tumour stroma, extra-tumour stroma and smooth muscle, and COMPLEX_STROMA, as single tumour cells and/or few immune cells.

  • LYM, which represents immune-cell conglomerates and sub-mucosal lymphoid follicles.

  • DEBRIS_MUCUS, which represents necrosis, hemorrhage and mucus.

  • NORM, which represents normal mucosal glands.

  • ADI, which represents adipose tissue.

  • BACK, which represents background.

Starting from the Dataset of [11], SIMPLE_STROMA and COMPLEX_STROMA have been merged, resulting into a MUSC_STROMA class. For the dataset of [12], DEB and MUC classes have been merged, resulting in a DEBRIS_MUCUS class, and MUS and STR classes have been merged, resulting in a MUSC_STROMA class. Of note, the merging procedure has been performed according to class definition of the T training dataset. At the end of the merge, our training dataset is reduced obtaining N = 77.805 images, keeping half of the images of each of the two combined classes, and maintaining the balance across classes. After the same merge, the external validation set V1 resulted to have N = 5.988 images.

An additional dataset of N = 5.984 HE histological image patches, provided by IRCCS Istituto Tumori Giovanni Paolo II, has been used as another independent test set. The institutional Ethic Committee approved the study (Prot n. 780/CE). This dataset, hereinafter denoted with V2, has been made publicly available [14]. The class subdivision has been done according to the list mentioned above and classified by an expert pathologist, in order to gain the ground truth of the V2 dataset. We made our dataset publicly available, in order to ease the development and comparison of computational techniques for CRC histological image analysis.

Some test images from both the V1 and V2 datasets can be seen in Fig. 1.

Fig. 1.
figure 1

Test dataset example patches for each class. Left: V1 dataset; right: V2 dataset. All images have been pre-processed with Macenko’s method.

3 Methods

3.1 Image Features

Different features can be extracted from single channel histogram of an image. In [3], the authors only considered the grey-scale version of the image, but also other color channels may be considered. For HE images, red channel can be more informative.

According to the convention used in [3], we can consider two sets of features from the histogram: a histogram-lower, which contains mean, variance, skewness and kurtosis index, and a histogram-higher, composed of the image moments from 5th to 11th.

Another set of features used was Local Binary Patterns (LBP). An LBP operator considers the probability of occurrence of all the possible binary patterns that can arise from a neighbourhood of predefined shape and size. A neighbourhood of eight equally spaced points arranged along a circle of radius 1 pixel has been considered. The resulting histogram was reduced to the 38 rotationally-invariant Fourier features proposed by [15]; these are frequently used for histological texture analysis. To extract this set of features it is possible to use the MATLAB tool from Center for Machine Vision and Signal Analysis (CMVS) available at the link http://www.cse.oulu.fi/CMV/Downloads/LBPMatlab [16, 17].

Kather et al. also considered the Grey-level co-occurrence matrix (GLCM); in particular, they considered four directions (0°, 45°, 90° and 135°) and five displacement vectors (from 1 to 5 pixels). To make this texture descriptor invariant with respect to rotation, the GLCMs obtained from all four directions were averaged for each displacement vector. From each of the resulting co-occurrence matrices the following four global statistics were extracted: contrast, correlation, energy and homogeneity, as described by Haralick et al. in [18], thereby obtaining 20 features for each input image.

As the latest set of features, Kather et al. considered Perception-like features, that included features based on image perception. Tamura et al. in [19] showed that the human visual system discriminates texture through several specific attributes that were later on refined and tested by Bianconi et al.; the features considered in [3] were the following five: coarseness, contrast, directionality, line-likeness and roughness [20].

This procedure leads to the extraction of a feature vector with 74 elements.

Fig. 2.
figure 2

Stain normalization with Macenko’s method [13] and tiling, analogously to Kather et al. [4, 12]. This procedure has been followed to generate the test patch images for our dataset.

3.2 Stain Normalization

Stain Normalization is necessary due to the pre-analytical bias specific to different laboratories; it can lead to miscalculation of images by ANN or CNN. Techniques for handling stain color variation can be grouped into two categories: stain color augmentation, which mimics a vast assortment of realistic stain variations during training and stain color normalization, which intends to match training and test color distributions for the sake of reducing stain variation [21].

In order to normalize the images coming from different datasets, we exploited the Macenko’s normalization method [13], as reported by Kather et al. [4, 12], allowing comparability across different datasets.

The procedure adopted for the stain normalization is depicted in Fig. 2.

3.3 Deep Learning Models

Deep Learning refers to the adoption of hierarchical models to process data, extracting representations with multiple levels of abstraction [22]. Convolutional Neural Network (CNN) have a prominent role in image recognition problems. A huge amount of literature data regarding the construction of DL-based classifiers for images [5, 23,24,25,26,27,28,29]. Some example of application in histological images include classification of breast biopsy HE images [30], semantic segmentation, detection and instance segmentation of glomeruli from kidney biopsies [31, 32].

An important concern about CNN is that training a network from scratch requires tons of data. One interesting possibility is that offered by transfer learning, which is a methodology for training models by using data which is more easily collected compared to the data of the problem under consideration. Refer to [33] for a comprehensive survey of the transfer learning paradigm, here we will consider models pre-trained on ImageNet as feature extractors for histological images, as done also in [10, 34,35,36,37,38]. The paradigm of DL-based transfer learning has led to the term Deep Transfer Learning [39]. It has been noted that, although histopathological images are different from RGB images of everyday life, they share common basic structures as edges and arcs [40]. Earlier layers of CNN capture this kind of elementary patterns, so transfer learning may be useful also for digital pathology images.

One potential drawback of deep feature extractor is the high dimensionality. Cascianelli et al. attempted to solve this problem by considering different technique of dimensionality reduction [38]. We investigated the combinations of deep features extracted by pretrained models, also considering different levels of compression, after having applied Principal Component Analysis (PCA). In particular, we concatenated the features coming from the ResNet18, GoogleNet and ResNet50 models, obtaining a feature set of 3584 elements. Then, different numbers of features, ranging from 128 to 3584, have been considered for training our classifiers. To ensure that deep features are relevant for the problem under consideration, we compared them to smaller sets of handcrafted features. In particular, we checked: (1) that they tend to represent similar tissue types into defined regions of the feature space, by considering a 2D scatter plot after having applied t-SNE [41] on the deep and handcrafted features; (2) that they lead to the training of an accurate model, without overfitting problems; (3) the saliency maps highlighted by Grad-CAM [42]. t-SNE can both capture the local structure of high dimensional data and reveal global structure at several scales (e.g. the presence of clusters), as image features in this case. Grad-CAM is a class-discriminative localization technique for CNN-based models to make them more transparent by producing a visual explanation.

We considered three different topologies of deep networks: ResNet18, ResNet50 [28] and GoogLeNet [25]. For each architecture, we compared the ImageNet [43] pretrained version (the network is working only as feature extractor in this case) with the fine-tuned version on our data.

Fig. 3.
figure 3

Training procedure. Starting from a subset of the dataset T, we compared three kinds of models. 10-fold Cross-validation was performed to find the best model. Validation procedure. We externally validated the models found as best from internal cross-validation on two datasets: V1 and V2. T refers to the Training set from Kather et al.; V1 stands for Test set from Kather et al.; V2 refers to the Test set from IRCCS Istituto Tumori Giovanni Paolo II.

4 Experimental Results

We considered three types of experiments: (1) training of ANN and SVM classifiers after handcrafted feature extraction; (2) training of ANN and SVM classifiers after deep feature extraction with models pretrained on ImageNet; (3) fine-tuning of deep classifiers. The workflow is depicted in Fig. 3. For the ANN and SVM trained after handcrafted feature extraction or pretrained deep feature extraction, we made a 10-fold cross validation (90% train, 10% test for each iteration) on the train dataset T, after having pre-processed it as described in Sect. 2.

Table 1. Results of 10-fold cross-validation on the train dataset T. Performances for SVM and ANN are expressed as accuracy: mean ± std.

Then, we exploited the best classifier for each category for testing it on the validation datasets V1 and V2. Performances reported in Table 1, Table 2, Table 3 and Table 4 are assessed in terms of accuracy.

Table 2. Results on the V1 dataset. Performances are reported as accuracy measure.
Table 3. Results on the V2 dataset. Performances are reported as accuracy measure.
Table 4. Proposed methodology. Feature set is given by the concatenation of pretrained ResNet18, GoogleNet, ResNet50, considering different numbers of principal components after the PCA. Results are shown on both the V1 and V2 datasets. Percentages represent accuracies.

For the best classifier of each category (handcrafted features, pretrained deep features, finetuned deep model), we computed the confusion matrix to assess how errors are distributed across the different classes. Confusion matrices are reported in Tables 5, 6 and 7.

Table 5. Confusion matrix on the V2 dataset for the best handcrafted model.
Table 6. Confusion matrix on the V2 dataset for the best pre-trained deep features model.
Table 7. Confusion matrix on the V2 dataset for the best deep fine-tuned model.

4.1 Discussion and Explainability

Looking at the confusion matrices, we observed that handcrafted features are not able to well generalize on our dataset, whilst deep features are better suited for the task. In particular, the model trained with handcrafted features is not able to recognize any LYM tissue from our V2 dataset. For the proposed method which combines features of different deep architectures, we showed that PCA could be a useful tool for reducing dimensionality without incurring in a decrease of accuracy. Among the pretrained models on the V1 dataset, the proposed methodology slightly outperforms the best pretrained model alone, ResNet18, using also less features. For the SVM classifiers on the V1 dataset, using more than 256 features after PCA does not result in measurable improvements.

Fig. 4.
figure 4

t-SNE of the best classifier features; a) Fine-tuned ResNet50 features t-SNE of V1 dataset b) Pretrained ResNet18 features t-SNE of V2 dataset.

We observed that frequent misclassification errors involved NORMAL and MUSC-STROMA patches which are predicted as TUMOUR or DEBRIS-MUCUS.

In order to assess the explainability of the obtained results, we considered different techniques. First, we looked at the t-SNE embeddings, to understand if deep features, also those obtained by pre-training on ImageNet, are meaningful for the problem under consideration. Figure 4a displayed that clusters are much better defined from the V1 dataset. It is important to highlight that they considered tiles clearly belonging to only one class, whereas we also allowed the inclusion of patches more difficult to be classified.

The presence of a sub-cluster of TUM tiles can be seen within the MUSC_STROMA cluster. As stated above, MUSC_STROMA derives from the merging of simple and complex stroma classes, the latter including also sparse tumor cells. Thus, the TUM sub-cluster and the misclassification could be explained by both the class definition and, from a biological perspective, the fact that tumor tissue invades the surrounding stroma. Moreover, it could be observed in Fig. 4b that NORM cluster includes DEBRIS_MUCUS sub-cluster. Such a result makes sense because in this case mucus containing exfoliated epithelial cells is mainly produced by the glands of the normal tissue component at the periphery of the tissue sample.

Then, we tried to see the activations of the fine-tuned deep models exploiting Grad-CAM method [42]. We can see from Fig. 5a-c and Fig. 5e-g the highlighted regions of sample images from V1 and V2 datasets. Figure 5d and Fig. 5h represent patches which have not been included into V2 dataset since they were not clearly classifiable. In particular, Fig. 5d contains both MUSC_STROMA and TUM classes, whereas Fig. 5h contains both DEBRIS_MUCUS and NORM.

Fig. 5.
figure 5

Grad-CAM of Deep fine-tuned classifiers on the test set; a, c, f, from dataset V1 and b, g, e from dataset V2; d, h are patches not belonging to only one class. Labels are those in output of the classifier.

5 Conclusions and Future Works

In this work, three different methods have been compared for multi-class histology tissue classification in CRC. The most promising approach resulted to be to extract pretrained ResNet18 deep features from tiles combined with classification through SVM; in this way the classifier is able to generalize well on external datasets with good accuracy.

We also investigated explainability of our trained deep models observing that some misclassification issues are related to the biology of CRC. The multi-class tissue classification is a useful task in CRC histology, in particular to exploit a multi-layer approach including genomic data (mutational and transcriptional status).

The present paper could be considered a proof-of-concept because the multi-class tissue classification of digital histological images could, not only be extended to other malignancies, but also be considered as the preliminary step to explore, e.g., the relationship between the tumor, its microenvironment and genomic features.