Keywords

1 Introduction

In 2015, Out of 2.4 million cases of Breast Cancer in the US, 523,000 deaths were reported. In the US, it is estimated that approximately 260,000 new cases of invasive breast cancer will be diagnosed in 2018 [1, 2], with about 40,920 women mortalities. Worldwide, Breast cancer claims the maximum mortality rates among all cancer diseases afflicting women.

Early screening and diagnosis can improve treatment and survival rates [47]. Initial screening is generally done by breast palpation and regular check-ups using mammography or ultrasound imaging, followed by detailed diagnosis with breast tissue biopsy and histopathology analysis and clinical screening. Hematoxylin and eosin (H&E) stained biopsy tissues are analyzed under the microscope for various parameters like nuclear atypia, tubules, and mitotic counts. Visual identification using H&E stained biopsies is non-trivial, tedious and can be exceedingly subjective, with average diagnostic concordance between pathologists approximately 75% [3]. Whole slide imaging (WSI) scanners are increasingly being used for digitizing histopathology slides enabling automated image processing and machine-learned methods for image enhancement, normalization, localization of the tissue, segmentation, quantitative analysis, detection, and diagnosis.

Convolution neural networks [4,5,6,7,8, 49] are the de-facto choice for researchers in this field and have outperformed conventional machine learning algorithms in many other medical image applications [9,10,11,12] including diabetic retinopathy, bone diseases detection [44], bone fracture detection [45, 46], pneumonia detection, etc. Deep networks require large training data to generalize though and publicly available annotated breast cancer datasets are small, thereby needing special methods to be viable. Data augmentation techniques like flipping, rotation, patching etc. and transfer learning approaches are promising. Conventional machine learning with handcrafted [13,14,15,16,17] features for Medical Imaging diagnosis doesn’t generalize in the real world due to variations in tissue preparation, staining and slide digitization which has a significant impact on the tissue/image appearance. Pre-trained deep networks [18] have been used as a feature extractor in many real world applications for Diabetic Retinopathy [19], Handwritten digits recognition [20, 21], image retrieval [22, 23], Remote sensing [24, 42], Mammography breast cancer image classification [25, 26].

In ICIAR 2018 [27] Grand Challenge, 400 microscopy and whole-slide images from the BreAst Cancer Histology (BACH) extended dataset were classified into normal, benign, in-situ carcinoma and invasive carcinoma. Rakhlin et al. [28] report deep feature classification with multiple pre-trained deep networks, with the best accuracy of 93.8% on this dataset. Also, Rakhlin et al. [28] report that deep feature classification outperforms fine-tuning approach on ICIAR 2018 Grand Challenge dataset.

Habibzadeh et al. [29] use fine-tuning on pre-trained Inception (V1, V2, V3, and V4) and ResNet (V1 50, V1 101, and V1 152) to classify H&E stained microscopy images from BreaKHis dataset as benign or malignant. Their best-reported result for classifying into benign and malignant is from ResNet V1 101 with fine-tuning all layers with 98.4% confidence. Despite a lot of studies available on transfer learning and fine-tuning ConvNets [30], and to the best of our knowledge, we find no literature evaluating or comparing these two approaches, pre-trained deep feature classification and fine-tuning ConvNets on the same Breast Cancer dataset. In this paper, we evaluate these two approaches using BreaKHis dataset [31].

2 Dataset

The dataset we have used is the Breast Cancer Histopathological Database (BreaKHis) [31] which has 7,909 microscopic images of breast biopsy images collected from 82 patients across multiple magnifying factors (40x, 100x, 200x, and 400x). This dataset has two classes, 2480 benign and 5429 malignant images. Height and width of each image are 700 × 460 pixels, 3-channel RGB, 8-bit depth in each channel, PNG format. This dataset was provided to us by Fabio et al. [31] from the P&D Laboratory, Parana, Brazil (Table 1).

Table 1. Image distribution by magnification factor and class [31]

3 Data Augmentation and Pre-preprocessing

Data augmentation is an important step to create diverse, supplemented training dataset from small datasets to train deep networks. The training images are augmented by flipping the images along their horizontal and vertical axes and also rotating them by 90, 180, 270°. In the pre-processing step, the Mean image is calculated by averaging the images and the mean image is subtracted from all train and test images for brightness normalization. After mean subtraction, all the images are resized to (224 × 224 × 3), recommended image size for InceptionV2, ResNet-50, and DensNet-169 architectures.

4 Methods

4.1 Deep Feature Extraction and Classification

We used Pre-trained deep networks trained on ImageNet [32] – a dataset for object recognition for 1000 object classes and trained on 1.2 Million images. These pre-trained ConvNet models are used as generalized feature extractors since the top layers extract discriminant features like edges, textures etc. By removing the last fully connected output layer from the pre-trained deep network and extract feature vectors called Deep Features from the truncated network. The similar approach was used in [48]. These Deep Features are then used as input to standard classifiers like Random Forest, Logistic Regression etc., this is known as Deep Feature Extraction and Classification.

We use standard pre-trained DenseNet-169 [33], ResNet-50 [34] and InceptionV2 [35] networks from Keras distribution [36] trained on ImageNet. These pre-trained networks are used as fixed deep feature extractors for the breast cancer dataset by removing the last fully-connected (bottleneck features) and softmax classifier layers.

The extracted deep feature vectors (CNN codes) - InceptionV2 (1 × 38400), ResNet-50 (1 × 2048), DenseNet-169(1 × 94080) are then classified by traditional machine learning classifiers. We split the dataset 70% for training, 30% for testing. We build three different machine learning model to classify the deep features using Logistic Regression [37], LightGBM [38] and Random [39] Forest. The models were trained on NVIDIA Quadro K630 GPU [43] (Fig. 1 and Table 2).

Fig. 1.
figure 1

Deep feature classification flow chart

Table 2. Classifier with model parameters

4.2 DenseNet-169 Fine Tuning

Fine-tuning is another promising transfer learning technique for medical image classification, Habibzadeh et al. [29] report fine-tuned ResNet classification accuracy of 98.7% and Spanhol et al. report 90.0% accuracy using AlexNet fine-tuning. A continuation of these techniques, we select DenseNet-169 [33] to fine-tune, since they are easier and faster to train with no loss of accuracy due to improved gradient flow as compared to other networks [40, 41]. We took DenseNet-169 pre-trained on ImageNet, freeze the top layers because they capture universal features, remove the last softmax layer and replace it with an output sigmoid layer (binary classification). We fine-tune the last layer with small learning rate on cancer images as shown in Fig. 2. The dataset is divided into three parts, training (60%), validation (20%) and testing (20%). In the training phase, the data augmentation is applied to increase the training images. We use Stochastic Gradient Descent (SGD) optimizer with - learning rate = 0.0005, decay = \( 1e^{ - 6} \) and Momentum = 0.9. Each epoch operates on a batch of 16 images that are randomly sampled from the training set and the network is trained for 12 epochs. The models are trained on NVIDIA Quadro K630 GPU [43].

Fig. 2.
figure 2

DenseNet-169 fine-tuning

5 Results

We report standard classification metrics including classification accuracy, F1 score, Sensitivity (SN) & Specificity (SP). Sensitivity (SN) also called True Positive Rate, measures the proportion of actual positives (malignant) that are correctly identified as such, and represents the model’s ability to not overlook actual positives (malignant) (Tables 3 and 4).

Table 3. Accuracy of deep features classification in percentage
Table 4. Accuracy of DenseNet-169 fine-tuned model

Specificity (SP) also called the True Negative Rate, on the other hand, measures the proportion of actual negatives (benign) that are correctly identified as such, and represents the model’s ability to not overlook actual negatives (benign). ResNet-50 with Logistic Regression classifier consistently outperforms other deep feature classification models across all magnification factors. Higher magnification factors perform poorly for deep feature classification method. Fine-tuned DenseNet-169 with last layer tuning demonstrated the best accuracy among all models with 99.25 ± 0.4% (Figs. 3, 4 and Tables 5, 6).

Fig. 3.
figure 3

Overall classification accuracies for deep feature classification and fine-tuned DenseNet-169. In Figure, we select logistic regression accuracies of all deep features to compare with the DenseNet-169 fine-tuned model result, as it outperforms other classifiers. We visualize fine-tuning approach performs better compare to Deep feature classifiers for all the magnification.

Fig. 4.
figure 4

ROC curve of 400x magnification image, (a) ResNet-50 feature with LightGBM classifier, (b) DenseNet-169 feature with logistic regression, (c) Inception V2 feature with LightGBM classifier, (d) DenseNet-169 fine-tuning

Table 5. F1 score, specificity (SP) and sensitivity (SN) for deep feature classification
Table 6. F1 Score, specificity and sensitivity for DenseNet-169 fine-tuning

6 Conclusion

In this paper, we benchmark two transfer learning approaches using popular pre-trained networks namely ResNet, Inception and DenseNet for Breast Cancer Benign/Malignant classification. Deep Features extracted from pre-trained ResNet-50 and logistic regression classifier performs better among all the deep network feature classification and the accuracy is 94 ± 1%. In another experiment, a continuation of the literature [29], fine-tuned the DenseNet-169 with strong data augmentation. The average accuracy of the DenseNet-169 fine-tuned model is 99.3% and it is an improvement of 3% to 5% higher than the deep network feature classification and shows better performance compared to other proposals in literature.

As per the study [28], Deep feature classification performs better when the dataset is small. Our experiment presents that Fine-tuning approach with strong augmentation techniques outperforms deep feature classification when the dataset size is moderate or large. The outcomes are expected to be more comprehensively evaluated in the future considering DenseNet-169 fine-tuned model will be used for semantic segmentation on whole-slide histopathology images.