Keywords

1 Introduction

Traditional Technique of extracting features from a set of images is becoming obsolete and as a result, works claiming to use it are being taken lightly unless the authors make a thorough comparison with deep architectures. However, this is also true that no one has actually been able to come up with any such handcrafted or object level feature descriptors that has outperformed deep neural networks. One thing that gets remarkably get overlooked is the type of diversity of dataset that is being used to produce high accuracies in this domain. Histopathological or medical data, in general, is one such dataset that has been challenging researchers around the world due to their heterogeneity and uncertain nature of patterns. In this work, we have taken some of the very famous deep learning methods to do the classification task using transfer learning technique and highlight the fact that even the most efficient networks may fail to improve the classification accuracy when the dataset involved is highly complex in nature. Starting from AlexNet [1], VGG16, and VGG19 [2], three of the most used architectures due to their low computational requirements and high performance, are used in this experimental study.

The histopathological dataset used here is a set of different types of nuclei found in colon cancer i.e., fibroblasts, inflammatory, lymphocytes, neutrophils, eosinophils, etc. The presence of each type of these cell nuclei indicates the nature of cancer [4]. This information is crucial for the pathologists to diagnose the severity and type of cancer. The factors that influence the decision of diagnosis are size, structure, density, chromatin texture and intensity (depends on the staining dye) of the nuclei present in the affected tissues [5]. These factors change with the type of nuclei and hence, pathologists need to know beforehand the type to make a conscious decision. The properties of this dataset are crucial since such type of datasets changes with the type of staining technique used to stain the nuclei and stroma of the tissues. Hence, different staining technique gives different color and texture features. So to summarize, color and texture features play a very important role in classifying different structures, even more than the shape and size features.

With deep learning architectures we can very easily extract the features after each layer but, lack of interpretability restricts us to know the actual type of features extracted in the process. Therefore, it is difficult to firmly establish whether the quality of features deep learning network is extracting will give good classification performance.

In past years, much work has been done on histopathological images for nuclei classification using handcrafted feature descriptors such as morphological, texture, shape and color features. Shape representation through DTW-Radon based descriptors by authors in [3] and [6] establised a rotation and scaling invariant lossless transform for detecting various shape properties in several numeral, character and symbol datasets. Their methods could be used to detect shape features in nuclei datasets for classification purposes but, since the dataset is very huge and contains large number of samples, their method would incur huge computational cost. Liu et al. in [7] tested the various types of features that can be extracted from images and through feature selection methods they found out the most relevant features for cell nuclei classification. However they did not mention the kind of dataset they used for extracting features, therefore it is hard to say if their findings are universal for all types of datasets. Authors in [8] studied various nuclei classification methods on different types of cancer kinds such as prostate, breast, renal clear cell and renal papillary cell cancer. They showed that the classification methods gave different accuracies on each one of these datasets. Their results proved that there cannot be one definitive method that would give better results across all types of cancer. They also used deep learning methods like LeNet, EncoderNet, Color-EncoderNet on their datasets but only Color-EncoderNet was able to give best results among the 10 methods they tested on 3 out of 4 datasets. These studies hint that even using the deep learning framework does not guarantee good results in the case of complex histopathological images. Other effective methods such as [10] and [11] to detect candidate region of interests and extracting features for further processing of biomedical images use hand-crafted feature descriptors and local variations within the images. These methods are effective for small datasets that have greater inter and less intra class separability. But, in our case, where the dataset is complex and has less inter class separability, relying only on hand-crafted features is not a feasible approach. To prove this, we have tested few state of the art algorithms on hand-crafted feature descriptors and compared the obtained results with deep learning methods. Deep learning method used in [12] would have been an initial approach for classification but their method use grayscale images and also the dataset used is not from biomedical domain. So, the complexity and feature relation is highly deviated from our intensity and color centric RGB histopathological dataset.

2 Experiments

We have taken some of the very recent deep architectures and trained our dataset on them to find out their performance.

2.1 Dataset

Image dataset from which nuclei points are extracted as patches is taken from [9]. The dataset came with annotated nuclei and their location in the data. We prepared our own data points using the method in [14]. From each image in the dataset as shown in Fig. 1, the nuclei present in this image were annotated by pathologists. Annotated nuclei center pixel coordinates were recorded for each of the images along with their corresponding labels. Using this recorded information about all the nuclei, total 22444 nuclei samples of height and width 27 around the center pixel coordinate, with RGB color channels, were collected in a folder. 22444 nuclei were segregated into four classes viz. Epithelial nuclei, Inflammatory nuclei, Fibroblast nuclei and miscellaneous other types as the fourth category. The number of samples in each class affect the final results by a great margin. In our dataset, class 1 i.e. epithelial class has total 7,722 nuclei, class 2 (inflammatory nuclei) has 5,712 samples, class 3 (fibroblast nuclei) has 6,971 class points and the miscellaneous category has mixed type data of total 2,039 sample points. These raw nuclei images were then divided into train and test set. 70% of the samples from each class were taken as input for training and the rest 30% of the samples were used in testing. However, The input size of each image in our dataset had to be resized to \(224 \times 224 \times 3\) since, this is the size that the AlexNet, VGG16 and VGG19 architectures take as input. Figure 2 shows the sample nuclei dataset.

Fig. 1.
figure 1

\(500 \times 500\) H&E stained histology image samples of colorectal adenocarcinomas

Fig. 2.
figure 2

Example of nuclei dataset. Row 1: epithelial nuclei, Row 2: inflammatory nuclei, Row 3: fibroblasts, Row 4: miscellaneous

2.2 AlexNet

AlexNet by Krizhevsky et al. [1] is the very first architecture inspired from LeCun et al. [15] which gained popularity after 2012 ImageNet challenge. It has 5 convolutional layers followed by 3 fully connected layers. We divided the dataset into 7:3 ratio for training and testing. Initially, we kept the learning rate incremental, starting from 0.01 and increased it up to 0.00001 i.e. 1e−05. we observed the minibatch accuracy very low and the overall accuracy on pretrained AlexNet was observed 0. So, incremental learning rate did not work with our dataset. Hence, we kept the learning rate constant.

2.3 VGG16

We investigated the effect of increasing depth of convolutional layers by testing the performance of VGG16 [2] on our dataset accuracy. The number of parameters increases with the depth and hence the computation requirements. We trained our dataset using the pre-trained model because learned features are often transferable to different data and then it also takes less training time as compared to the experiment where the model is trained from the scratch [13]. Training any deep learning architecture from scratch is not feasible for both accuracy and time performance since the network has to learn again the trivial features like edges and lines which becomes a redundant task if the accuracy does not improve as the training progress. Using the concept of transfer learning helps propagate the generic features through the model. Only the features specific to the dataset are learned through model training.

2.4 VGG19

VGG19 [2] has more depth than VGG16 i.e. 19 convolutional layers and hence, improved performance. Working on this theory we trained VGG19 on our dataset and made few observations included in Sect. 3.

Apart from transfer learning, random changes in batch size and number of epochs were performed to select the optimal hyperparameters. We selected the batch size of 300 and trained the architectures for 100 number of epochs.

3 Results and Discussions

We have evaluated our classifier performance using Precision (or Positive Predictive Value PPV), Recall (or True Positive Rate TPR), F1 score, Accuracy and time taken by three architectures. Accuracy and time comparison among three architectures are shown in Table 1. It is observed that with deep architectures large batches can be parallelized across many machines, reducing training time significantly. Also, large batch size reduces the number of parameter updates required to train a model which in turn results in reduced model training times. Therefore, we kept the batch size high. To establish our design choice of a large batch size we did random batch size changes, starting from 64. We noticed no change in accuracy but, the time required to train each batch increased by 100%. Earlier, the time for each epoch, in case of 300 batch size, was around 20 min, which increased to 42 min when the batch size was reduced to just 64 and number of epochs to 30. This happens when lower batch size takes a number of iterations to do the weight update due to more number of computations. So, It was more feasible to train our dataset with a larger batch size considering the time efficiency. The recent article by authors in [21] have studied the effect of increasing batch sizes on ImageNet and CIFAR10 datasets using recent state of the art deep learning algorithms like ResNet and Inception-ResNet-V2. They confirmed that the large batch size reduces the training times significantly and are better than decaying learning rate when the effect on accuracy is not significant. Figures 3a, b and c are the ROC curves of three networks. Each figure has four curves representing four classes of nuclei i.e., Epithelial, Fibroblast, Inflammatory and miscellaneous. We have also compared our deep learning architecture performance with handcrafted descriptors we used in [14] to measure their retrieval performance on our dataset. Comparison Table 5 clearly outlines the fact the handcrafted descriptors are clearly no match to deep learning algorithms since there is a huge difference in classification metrics. While the same descriptors performed better in retrieving CT, MRI, and ultrasound images such as in [16,17,18], they performed very poorly on our dataset when we used the same feature subset for classification. It is important to note that the feature descriptors specially designed to retrieve medical images in [16,17,18] performed even poorer than the ones that were designed for retrieving colored images [19, 20]. So, it establishes the fact the color information is an important feature in case of histopathological images. Handcrafted features that work on grayscale images will not give optimum performance in such datasets.

Table 1. Accuracy and time comparison
Table 2. Confusion matrix of AlexNet
Fig. 3.
figure 3

ROC curves of AlexNet, VGG16 and VGG19

We made following observations from the results we obtained.

  1. 1.

    ROC curves are shown in Fig. 3 shows the performance of each architecture AlexNet (Fig. 3a), VGG16 (Fig. 3b), and VGG19 (Fig. 3c). To compare the differences among these curves we took True Positive Rate (TPR) value at 90% in all three curves and noticed the corresponding False Positive Rate (FPR) with respect to each class. FPR value should be minimized with respect to each class. In case of class 1 (Epithelial) minimum, FPR is given by VGG19 and maximum FPR is by AlexNet whereas for class 4 (Miscellaneous) minimum and maximum FPR is given by VGG16 and VGG19 respectively. For Inflammatory nuclei category, FPR is almost similar in all three methods and in the class of Fibroblast nuclei, VGG16 gives the minimum FPR and VGG19 outputs maximum FPR. After analyzing the three ROC curves, we inferred that there is no unique pattern to declare the best classifier for all 4 classes. They show different patterns with respect to each class. This difference in patterns may become a problem when determining the best classifier among the three. However, due to an imbalance in the data samples, it is expected that the fourth class which has the least number of samples will perform the worst. This gives the clue to the best classifier question, which is, the classification method that performs the best with minority class should be the best classifier. Here VGG16 has the minimum FPR with minority class. Hence, VGG16 is the best classifier among the three. This is also reflected in the classification metrics Table 5.

  2. 2.

    if we compare our results with ImageNet dataset accuracies on these networks, that is AlexNet has top 1 accuracy of 56.1% [1] and top 5 accuracies of 80%, VGG16 has top 1 and top 5 accuracies of 70.6% and 89.9%, and for VGG19 it is 68% and 85.5% respectively [2], we see that there is a significant improvement of atleast 12% in top 1 accuracy and 6% increase when comparing top 5 accuracies of AlexNet and VGG19.

  3. 3.

    Hence, by observation of accuracy changes among datasets, we can very certainly say that our dataset was indeed difficult to classify for these architectures.

  4. 4.

    We also made observations among class wise accuracy, and uniformly we noticed from confusion matrices that class 1 i.e. Epithelial nuclei scored the best with highest percentage of 84.9% in case of VGG19 (Table 2). Class 2 (Inflammatory nuclei) second with the highest percentage of 65.9% in AlexNet (Table 3), class 3 (Fibroblasts) third with highest 76.6% in VGG16 (Table 4) and miscellaneous nuclei in class 4 scored the least accuracy among all three architectures with best value of 42.7% in VGG16 (Table 4).

  5. 5.

    This variation in accuracies reflect on the structure of the nuclei in the database. Miscellaneous nuclei contained all other small groups of nuclei found in colon cancer, hence this class did not have any particular pattern in majority. Therefore, the classifier could not make the best decision for this class.

  6. 6.

    We observed from Table 5 that despite VGG19 having the deepest network, did not perform better than VGG16. But, it is however not a very significant improvement. VGG16 is only 1% more sensitive (recall) than VGG19 (Table 5). Also, when we look at the time took by VGG16 and VGG19 from Table 1 for training, VGG19 took 6 times more time than VGG16. Hence, if we have to choose between VGG16 and VGG19, VGG16 becomes the better choice both in terms of accuracy and time.

  7. 7.

    From the comparison of the handcrafted and deep learning architectures in Table 5, it is trivial to deduce that deep architectures performed better than handcrafted descriptors used in this study.

Table 3. Confusion matrix VGG19
Table 4. Confusion matrix of VGG16
Table 5. Comparison between methods through performance parameters

4 Conclusion

Through this experimental work, our objective was to establish that, the state of the art deep learning networks perform better than handcrafted features but may not produce great results for all kinds of datasets such as the Histopathological data whereas, AlexNet, VGG16, and VGG19 produces classification accuracy better in ImageNet dataset as mentioned in point 2 of Sect. 3. Histopathological data is highly complex and incomprehensible to the non-experts. Without the consultation of domain expertise of the experienced pathologists, one can never be sure of the nature of the objects present in the images. Hence, proper classification of the images is a complex task even for humans for such datasets. Deep learning algorithms do not address the dataset heterogeneity problem and their performance in different domains of data. Therefore, with our work, we have tried to reflect on the fact that otherwise widely used deep learning algorithms used for classifying histopathological data are not the best feasible methodology alone. Hence, only handcrafted or only deep learning architectures are not enough for classifying complex histopathological data. Their combination shall be exploited to achieve the better performance.