Keywords

1 Introduction

Breast cancer (BC) is the most common type of cancer and the fifth most common cause of cancer mortality among women globally [1]. While, different types of imaging technologies, have been employed for diagnosis of BC, histopathology biopsy imaging has been a ‘gold standard’ in diagnosing breast cancer because it captures a comprehensive view of effect of the disease on the tissues [2].

However, image examination by pathologists is often subjective and may not be easily quantified. Thus, computer-aided diagnosis (CADx) systems provide valuable assistance for physicians and specialists. These help in overcoming the subjective interpretation and relieve some workload of pathologists. An important part of such CADx systems is the automation of image analysis to determine whether a tissue sample is malignant or a benign. Due to automated image analysis some tasks involved in diagnostic workflow can be made more efficient and precise.

However, automated image analysis can be challenging as inconsistency in histopathology slide preparation such as differences in fixation, staining protocol, non-standard imaging condition, etc. can cause variability in tissue appearance (colour and texture). The texture variation is typically captured by classifiers employing traditional texture features. To mitigate the effect of colour variability, a straightforward approach is to use gray-scale images [3, 4]. On the other hand, a stain (or colour) normalization preprocessing can be performed, which is typically a more sophisticated process involving methods such as histogram matching, colour transfer, colour map quantile matching approach and spectral matching etc. [5, 6].

However, it is observed that some inter-image colour variation might be informative [5]. Similarly, recent research in digital histopathology has indicated significance of colour information in quantitative analysis on histopathology  [7, 8]. As can be seen from Fig. 1, along with texture, colour information is also available in images which can be utilized to get a more discriminating representation.

Fig. 1.
figure 1

Sample of histopathological images (first row: benign tumor, second row: malignant tumor) from BreakHis dataset at magnification factor of 40X.

From a machine learning perspective, various methods which do not employ normalization have also been proposed  [9,10,11]. Our proposed method falls in this category where features are directly extracted from image (without normalization). This follows the philosophy that instead of reducing the colour variation, we learn the colour variation (along with the texture variation) as a part of the classification process.

We believe that the colour-texture variability can be better captured with joint colour-texture features [12]. Such features consider the mutual dependency between colour channels and texture information. These features can be defined with individual colour channels, or with correlated pairs of colour channels. Such jointly defined colour-texture features can locally adapt to the variation in the image content [13].

In addition, different from existing works where a small set of classifier was utilized, here a total of 22 classification frameworks experimented with. These classification frameworks include Quadratic Discriminant, Subspace Discriminant, RUSBoosted Trees, Boosted Tree, Coarse Gaussian SVM, Weighted KNN etc. We argue that such an exploration of joint colour-texture features and various classifiers leads to the selection of better suited features and classifiers to this specific problem. Due to space-constraints, we report the features and classifiers which correspond to top five results for each image magnification.

1.1 Related Work

In recent years, a number of methods have been investigated for BC histopathology classification. However, most of these method use traditional morphology and texture features. Kowal et al. [14] utilized four different clustering algorithms for nuclei segmentation and extracted 21-dimensional feature vector. In  [14], three different classifiers are reported for each clustering algorithm separately. This s carried out on a dataset which contained 500 images of cytological samples that were extracted from 50 patients. Filipczuk et al. [15] presented a diagnosis system where nuclei were estimated by the Hough transform. Four different classifiers trained on 25-dimensional feature vector was used for classification using 737 images of cytological sample which had drawn from the same place as [14]. Based on above discussed methods, it is realized that for accurate system nuclei should be segmented properly as subsequent analysis is based on segmentation. However, segmentation of histological images is not a trivial task and is prone to mistakes. Instead of relying on the accurate segmentation, [16] investigated multiple image descriptors along with random subspace ensembles and proposed two-stage cascade framework with a rejection option using a dataset composed of 361 images. In another work [17], an ensembles of one-class classifiers were assessed by the same authors on the same dataset.

The works in [9, 10] also propose the use colour information in addition to texture. Milagro et al. [9] combinations of traditional texture features and colour spaces is considered. Furthermore, they have also considered different classifiers such as Adaboost learning, bagging trees, random forest, Fisher discriminant and SVM. In [10], authors utilized colour and differential invariants to assign class posterior probabilities to pixels and then performs a probabilistic classification. While our intuition of using colour information and a set of classifiers is similar to [9], our integrated joint colour-texture features also consider dependency between colour channels and texture, rather than extracting traditional texture features independently from colour channels. Moreover, unlike ours, and similar to the above discussed works, [9, 10] do not consider experimentation with respect to different optical magnifications, which is an important aspect [4]. Furthermore, we report our results on a public benchmark dataset, unlike all the above approaches.

With regards to the concern of benchmarking, it has been observed that the dataset used in the above works are not publicly available to the scientific community, and such datasets contain rather small number of images. Spanhol et al. [4] introduced the BreakHis dataset which intended to take away the impediment of publicly available data set. The BreakHis dataset contains fairly large amount of microscopic biopsy images (7909) that were collected from 82 patients in four different magnifications (40x, 100x, 200x, 400x). The details of dataset are provided in Table 1. Figure 1 shows the images of benign and malignant tumor given in different magnifications.

Table 1. Detailed description of BreaKHis dataset
Fig. 2.
figure 2

An Integrated model for BC image classification.

In the same study, a series of experiments utilizing six different texture descriptors and four different classifiers were evaluated and showed the accuracy at patient-level. In [18], Alexnet [19] was used for extracting features, and classification was reported on image-level as well as patient-level, using this dataset. Bayramoglu at el. [20] proposed a magnification independent model utilizing deep learning and reported accuracies for both multi-task network which predicts magnification factor and malignancy (benign/malignant) simultaneously, and single task network which predicts malignancy.

1.2 Salient Aspects of This Work

Considering that the area of BC histopathology image analysis is still an emerging one, as new approaches are developed, the evaluation and comparison among such frameworks is of increasing importance from a clinical perspective. In this context we consider the following aspects about methodology and evaluation which drives our work.

  1. (a)

    As implied above, there is scope for further exploration of suitable features and classifiers for this problem, which can better capture the discriminative information to address the classification task. Thus, in this work we look into employing joint colour-texture features for this task. Motivated from [4], where conventional texture features (GLCM, CLBP, PFTAS etc.) along with small set of well known classifiers were utilized, in this work we explore a relatively larger set of classifiers for the joint-color texture features. This provides their comparative performance under one roof, and indeed, for some classifiers we demonstrate an improved performance over the state-of-the-art.

  2. (b)

    The above discussed methods yield a continuous value for a scoring, rather than single value for making decisions. In the discussed methods [4, 18], patient and image level score were used as performance measures. However, it is also important to convert a patient score to a decision (benign or malignant), using a decision threshold on the patient-level scores, and finally comment on the quality of the diagnostic test in the context of the accuracy of such patient-level decisions. For such a quality check, the receiver operating characteristics (ROC) curve that includes all the decision threshold, offers a more compressive assessment. In diagnostic test assessment, area under the ROC curve (AUC) can be used to judge the quality of approaches. A value of AUC that lies in range 0 to 1, where 0 and 1 correspond to inaccurate and accurate test respectively. A value of 0.5 for AUC indicates no discrimination, 0.7 to 0.8 is considered acceptable, 0.8 to 0.9 taken as fair or good or some time excellent test, and more than 0.9 is considered as outstanding [21]. In light of this, we suggest an integrated model using all magnifications (as elaborated below), that uses the AUC as a performance measure.

  3. (c)

    In some previous work [4, 18] a model corresponding to each magnification was built independently based on different combinations of features and classifiers. We believe that instead of just relying on the individual scores correspond to each magnifications, assessment of overall score calculated as the ratio of total images classified to total images of patient, can also yield useful information with respect to a beneficial decision. For instance, for a patient who has large variation in scores, the decision cannot be made reliably by just looking at the highest score. In this work, while we report results on individual magnifications, we also suggest an integrated model that makes use of all magnifications, and can yield a more reliable system in terms of the AUC. Figure 2 depicts the proposed integrated model, wherein, x1, x2, x3, and x4 are the total number of input images of four different magnifications, and y1, y2, y3, and y4 are the corresponding classified images.

2 Methodology

In this section, we briefly discuss about the images descriptors and classifiers we have utilized for this study.

2.1 Joint Colour-texture Features

In order to find suitable feature for each magnification various features are utilized. Due to space constraints, we provide only a short introduction of features which are included in combinations that yields the top results. For more details please refer [12].

  1. 1.

    Normalized colour space representation [22]: The matrix of complex numbers (C1+iC2), where C1 and C2 are the normalized colour channel chosen based on the range and average values of the colour channels is used to extract textural (Gabor filter) features.

  2. 2.

    Multilayer coordinate clusters representation [23]: To describe the textural and colour content of an image, it splits the original colour image into a bundle of binary images, where each binary image represents a colour code based on a predefined palette (quantized colour space). Patches of such binary patters are then clustered and the method computes the histograms of occurrence of the binary patches based on the cluster centres. This process is repeated for each layer, and the resulting histograms are concatenated. Depending on, how many samples (n) are taken on each axis of the colour space, resulting palettes (N=\( n^{3} \)) will be 8, 27 and 64 levels.

  3. 3.

    Gabor features on Gaussian colour model [24]: The following two stages are used to extract color-texture: (1) Measurement of color in transformed space (based on a Gaussian colour model), (2) Utilization of Gabor filter bank for texture measurement.

  4. 4.

    Complex wavelet features and chromatic features [25]: Dual Tree Complex Wavelet Transform (DT-CWT) is applied to each color channel separately. The final feature vector is a concatenation of all DT-CWTS from different channels.

  5. 5.

    Opponent colour local binary pattern (OCLBP) [26]: This is an extension of standard Local Binary Pattern (LBP) which developed as the joint colour-texture operator for colour images. It is a concatenation of all LBPs extracted from different channels including colour channels separately (intra channel) and opponent colour channel ((\(c_{1},c_{2}\)), (\(c_{1},c_{3}\)) and (\(c_{2},c_{3}\))) jointly.

2.2 Classifiers

We explore various supervised classifiers, for which we provide a short description below [27].

  1. 1.

    Support Vector Machine (SVM): It is a supervised machine learning algorithm that learns a hyperplane which separates a samples of one class from samples of other class with maximum margin. Depending on the type of the kernel and, its scale that used to make the distinction between classes, a variety of SVMs exists.

    1. (a)

      Linear SVM

    2. (b)

      Quadratic SVM (Quadratic kernel)

    3. (c)

      Cubic SVM (Cubic kernel)

    4. (d)

      Fine Gaussian SVM (Radial Basis Function (RBF) kernel, kernel scale set to \(\sqrt{P}\)/4)

    5. (e)

      Medium Gaussian SVM (RBF kernel, kernel scale set to \(\sqrt{P}\))

    6. (f)

      Coarse Gaussian SVM (RBF kernel, kernel scale set to 2\(\sqrt{P}\))

    where P is the number of predictors.

  2. 2.

    Decision Tree: It is a top-down approach that uses a tree-like graph of possible solutions including resource costs, and utility. Several variations of tress are exist based on maximum number of splits utilized in the tree.

    1. (a)

      Simple Tree (maximum number of splits is 4)

    2. (b)

      Medium Tree (maximum number of splits is 20)

    3. (c)

      Complex Tree (maximum number of splits is 100)

  3. 3.

    Nearest Neighbors Classifier: It does not make any underlying assumptions about the distribution of data. It locates the data into some clusters, or groups and classified an unclassified point into the cluster for which it has a higher probability of getting classified based on distance metrics. Depending on number of neighbors and metric used, a variety of k-NN exists.

    1. (a)

      Fine KNN (number of neighbors is set to 1, euclidean metric)

    2. (b)

      Medium KNN (number of neighbors is set to 10, euclidean metric)

    3. (c)

      Coarse KNN (number of neighbors is set to 100, euclidean metric)

    4. (d)

      Cosine KNN (number of neighbors is set to 10, Cosine distance metric)

    5. (e)

      Cubic KNN (number of neighbors is set to 10, cubic distance metric)

    6. (f)

      Weighted KNN (number of neighbors is set to 10, distance based weight)

  4. 4.

    Discriminant Analysis: It assumes that different classes generate data based on different gaussian distributions and predicts membership in a group or category based on observed values. We consider two types of discriminant analysis based on boundary type formed between classes.

    1. (a)

      Linear Discriminant (linear boundaries)

    2. (b)

      Quadratic Discriminant (non-linear boundaries such as ellipse, parabola)

  5. 5.

    Ensemble Classifier [28]: It is a set of classifiers trained to solve same problem and, their output are combined to classify a new sample. The employment of logistics to make different schemes (combination) leads to different ensemble methods:

    1. (a)

      Boosted Tree

    2. (b)

      Bagged Tree

    3. (c)

      RUSBoosted Trees

3 Experimental Results and Discussion

For fair comparison with existing approach [4, 18, 20], we have randomly chosen 58 patients (70%) for training and remaining 25 for testing (30%). We train the above mentioned classifiers using image representations of chosen 58 patients, and also used five trials of random training-testing data selection. These trained models are tested using remaining image representations of 25 patients. Due to the disproportionate ratio of normal and abnormal cases, the same procedure is repeated for five trails (each time different patients for training and testing are chosen) and average results are reported. The discussed protocol is followed for all magnification, i.e. same patients are used for training for all magnifications. In subsequent subsections, we will discuss the evaluation metrics used to discuss the present work, performance evaluation of each magnification as well as for integrated model, AUC performance evaluation and performance comparison.

3.1 Evaluation Metric

There can be various ways to evaluate the model when the observed variable lies in continuous range (discussed in introduction section). In some previous work [4, 18], patient recognition rate (PRR) that further depends on patient score (PS), and image recognition rate (IRR) were used to report the results. The first measure takes the decision patient-level while second at image-level (i.e. without using patient information) The definition of these measures are given as follows:

$$\begin{aligned} PRR = \frac{\sum ^{N}_{i=1}PS_{i}}{N} \end{aligned}$$
(1)

where N is the total number of patients (available for testing). The patient score is define as follows,

$$\begin{aligned} PS = \frac{N_{rec}}{N_{P}} \end{aligned}$$
(2)

\( N_{rec}\) and \( N_{P} \) are the correctly classify and total cancer image of patient P respectively.

$$\begin{aligned} IRR = \frac{TCCI}{TI} \end{aligned}$$
(3)

where, TCCI and TI are the total correctly classified image and total images respectively.

In addition, we also employ the ROC curve and AUC computation [29] to grade quality of the framework as a system for patient-level diagnosis.

3.2 Performance Evaluation

Tables 2, 34 and 5 illustrate the performance of the models corresponding to each magnification. For each magnification, results are reported for five best combinations which are ranked based on the obtained patient score.

In proposed study, we compute the AUC based on the ROC obtained using the patients scores. Hence, in Tables 2, 3, 4 and 5, we give more prominence to the patient score. In each table, fourth and the fifth row shows the patient and the image score obtained for top combinations and the corresponding features and classifiers are given in the second and third row.

It is observed from the tables that, all the features are not appropriate for same classifier. Hence, suitable combinations of features and classifiers are more advantageous to quantify the images of different magnification.

We also suggest an integrated model, where we consider best feature-classifier combination (based on the patient score) for each magnification. The integrated model yields a patient-level score of 88.40% and image-level score of 88.09%. We note that the integrated model is performing similar to the individual magnifications in terms of score. However, as we demonstrate next, the integration can be considered more reliable, based on the AUC analysis.

Table 2. Top 5 features and classifiers combination for 40x magnification.
Table 3. Top 5 features and classifiers combination for 100x magnification.
Table 4. Top 5 features and classifiers combination for 200x magnification.
Table 5. Top 5 features and classifiers combination for 400x magnification.
Table 6. AUC comparison for different magnification.
Table 7. Performance comparison.

3.3 AUC Evaluation

As discussed in Sect. 1.2, it is important to take decisions on patients (rather than images), and that ROC and the related AUC is an effective way to rate such diagnostic systems. Here, we consider the same in context of reliability of the test for patient-level decisions, by thresholding patient-level scores. Note that this ROC computation on patient-level scores is different from the traditional ROC analysis for in pattern classifiers (e.g. for image-level classification).

Table 6 details the value of AUC obtained for all magnification levels as well as for integrated model. A threshold on the real-valued scores determines a final label (benign or malignant). The ROC curve is computed using different values of threshold. We also compute the optimal threshold for the ROC curve [30]. Table 6 illustrates the range of this optimal threshold estimated using five trials.

From the reported results in Table 6, it is clear that the AUC for models corresponding to single magnification, is lower than that for the integrated model, thus ascertaining the good quality of inference of the integrated model. The value of 81.92 for AUC for the integrated model signifies a good quality test [21]. We also note that the variation of the optimum threshold among the five trials is one of the lowest. This suggests that the integrated model yields a stable value of the optimum threshold.

3.4 Performance Comparison

Table 7 compares the proposed method with state-of-the-art methods which use same dataset and also the same protocol. We can observe from the table that, except for the 40x magnification case, the proposed framework outperforms the others approaches. Furthermore, one can also observe that the proposed work yields the least variance in scores. Thus, we demonstrate that suitable joint colour-texture features and classifier combination are effective for BC histopathology image classification.

4 Conclusion

This study proposes an integrated model over multiple magnifications for breast cancer histopathological image classification. In this work, we employ a wide range of joint colour-texture features and classifiers. We demonstrate that some of these features and classifiers are indeed effective for a superior classification performance. In addition, the present study also focuses on measuring the performance of the integrated model based on the AUC criteria, and deduce that the this yields better results than the classification at individual magnifications.