Keywords

1 Introduction

Breast cancer is considered to be the major cause of death among women after lung cancer. It is the result of the unrestricted growth of breast cells. According to GLOBOCAN cancer survey [1] about 1.67 million new cases of breast cancer were diagnosed in the year 2012 which constituted about 25% of all the cancers. Moreover, an approximate figure of 266,120 new cases of breast cancer is anticipated in men and women in the year 2018. Early detection and treatment are necessary in order to combat the mortality rate due to breast cancer. Mammography is one of the most genuine methods for screening and detection of breast cancer as compared to other methods such as breast self-examination (BSE), surgery and clinical breast examination(CBE). It uses X-rays for analysis of breasts in order to locate suspicious lesions. It results in the formation of an X-ray image called a mammogram which is studied by a radiologist. Computer-aided diagnosis (CAD) systems assist the radiologists in the understanding of breast images in order to detect the suspicious regions. The CAD system helps in increasing the diagnostic accuracy and thus improves the mammogram interpretation rate.

Talha [2] used discrete wavelet transform (DWT) along with discrete cosine transform (DCT) for extracting features. The obtained features were classified as normal or abnormal using SVM. Beura et al. [3] used two dimensional DWT and gray level co-occurrence matrix (GLCM) for extracting the relevant features from the ROI, followed by the selection of a subset of the extracted features using F-test and t-test and used backpropagation neural network for classification. Pratiwi et al. [4] presented a classification of mammograms using radial basis function neural network (RBFNN) based on GLCM texture based features. A CAD system has been proposed by Mohamed et al. [5] wherein GLCM is used for feature extraction along with three different classifiers, namely, SVM, ANN, and KNN. Dong et al. [6] used dual contourlet transform for feature extraction and an improved KNN classifier. Reyad et al. [7] showed a comparison of statistical, local binary pattern (LBP) and multi-resolution features based on DWT and contourlet transform and SVM as a classifier. Wang et al. [8] presented a mass classification scheme which utilized hidden features of mass to expose the hidden distribution pattern. Phadke et al. [9] proposed a CAD system which utilized a combination of local and global features to find out the abnormalities in the mammograms with the help of SVM. Liu et al. [10] combined a support vector machine based recursive feature elimination technique along with normalized mutual information to eliminate singular disadvantages. Zhang et al. [11] developed an ensemble system for the classification of the region of interest as benign or malignant with the help of SVM by using mass shape features. Gedik [12] introduced a new method for extracting features based on fast finite shearlet transform and used SVM for classification. Elmoufidi et al. [13] used dynamic K-means clustering algorithm for regions of interest (ROI) detection on the mini-MIAS dataset. Hariraj et al. [14] used wiener filter for noise removal, GLCM for feature extraction and SVM and KNN for classification. From the literature, it is realized that the improvement in the modules like feature extraction, feature reduction and classification leads to improvement in the overall performance of a CAD system. There exists an enormous scope to develop an improved CAD system to correctly diagnose the mammograms. Hence, keeping this in mind, authors are motivated to propose a CAD system using the compound local binary pattern for feature extraction, principal component analysis for feature reduction and different classifiers like SVM, KNN, ANN, C4.5, and Naive Bayes. Further, as per the best knowledge of the authors, this is the first attempt to propose a CAD system with this combination (CLBP+PCA+SVM, KNN, ANN, C4.5, and Naive Bayes).

2 Proposed CAD Framework

The proposed CAD system comprises of mainly three modules, namely, feature extraction using compound local binary pattern (CLBP), feature reduction using principal component analysis (PCA) and classification using SVM, KNN, ANN, C4.5, and Naive Bayes. The complete design of the presented scheme is represented in Fig. 1.

Fig. 1.
figure 1

Framework of CAD

2.1 Preprocessing and ROI Extraction

Noise and unwanted pectoral muscles are removed from the mammograms in the preprocessing stage. The mammograms are provided with information regarding the size of the abnormality. Hence to extract the ROI, a suitable cropping mechanism is used. Figures 2 and 3 represents the ROIs of the MIAS and DDSM databases respectively.

Fig. 2.
figure 2

ROI of MIAS dataset

Fig. 3.
figure 3

ROI of DDSM dataset

2.2 Feature Extraction Using Compound Local Binary Pattern

The output of a classifier is determined by the quality of the extracted features. The local binary pattern (LBP) is a simple and efficient texture feature extraction technique. However, it does not take into consideration the difference in magnitude between the center and neighboring pixel values. Therefore, this method produces conflicting results. In order to incorporate the magnitude information along with the sign, a new technique called compound local binary pattern (CLBP) which is an extension of LBP is introduced [15, 16]. CLBP allocates a code of 2P-bit to the middle pixel depending on the P number of neighboring pixels. Each of the P neighbors gets encoded with two bits. The first bit encodes the sign information while the second bit encodes the magnitude of difference with respect to a threshold value. This is illustrated in Eq. (1).

$$\begin{aligned} s(i_n,i_m)={\left\{ \begin{array}{ll} 00 &{} { i_n-i_m<0, ~~\left| i_n-i_m \right| \le Avg} \\ 01 &{} i_n-i_m<0, ~~\left| i_n-i_m \right| >Avg \\ 10 &{} i_n-i_m\ge 0,~~\left| i_n-i_m \right| \le Avg \\ 11 &{} \text { otherwise } \end{array}\right. } \end{aligned}$$
(1)

where, \(i_{m}\) is the pixel intensity of the middle pixel, \(i_{n}\) is the pixel intensity of the surrounding pixel and Avg is the average magnitude of the difference between \(i_{n}\) and \(i_{m}\) in the local neighborhood.

Fig. 4.
figure 4

CLBP example

For example, in a 3\(\,\times \,\)3 neighborhood with 8 neighboring pixels, the center pixel is assigned a 16-bit code. This increases the number of features. Thus the two 8 bit patterns which are obtained by dividing the 16-bit pattern helps in reducing the number of features. The first one is generated by joining the bit values in the up, right, down, and left directions of the center pixel, respectively and the other one is formed by combining the bit values in the north-east, south-east, south-west, and north-west directions of the center pixel respectively. Figure 4 illustrates a CLBP example. Therefore, each pixel gets two 8-bit binary codes after the application of the CLBP operator on all pixels followed by dividing the obtained 16-bits into two 8-bits. Thus, two encoded images are obtained for an image from which two histograms are generated. These two histograms are then combined to obtain a histogram which serves as a feature vector for the whole image.

2.3 Feature Reduction Using Principal Component Analysis

PCA converts the features into a set of linearly uncorrelated variables called principal components [17]. It helps in reducing the dimensionality of the original feature set. It maps the data from a higher dimensionality space to a lower dimensionality space thus reducing the number of redundant features. The obtained reduced set contains maximum variability of the original data.

2.4 Classification

SVM is a supervised learning model which is used for classification and regression purposes [5]. It constructs a hyperplane that has the maximum distance from the data. ANN imitates the biological neural networks. It has an input layer, one or more hidden layers, and an output layer [5]. It is a supervised learning model. The generated output is compared with the actual output and an error (difference) is generated. Based on this error, the weights are adjusted unless and until the desired output is obtained. KNN is used for classification and regression [5]. The unknown sample is given a label which is most common among its k neighbors. C4.5 is used for generating decision trees [18]. It is an extension of ID3. It is also called a statistical classifier as the decision tree generated by it can be used for classification. Naive Bayes is based on Bayes’ theorem and is used in medical imaging [19]. It belongs to a family of probabilistic classifiers. Based on training, it classifies features and gives them labels taken from a finite set. In all the above classifiers, training is carried out with 70% data and the rest 20% data is utilized for testing.

In the proposed scheme, SVM, KNN, ANN, C4.5, and Naive Bayes are used for segregating the images into normal, benign or malignant.

3 Results

MATLAB 2017a environment is used for carrying out the experiments. All images are taken from Mammographic Image Analysis Society (MIAS) [20] and Digital Database for Screening Mammography (DDSM) [21] repositories. MIAS dataset comprises of 319 images out of which 207 are normal, 64 are benign and 48 are malignant ones. A total of 291 images are collected from DDSM dataset out of which 180 are normal, 55 are benign and 56 are malignant images. The ROIs are extracted by cropping the original images and resizing them to 256\(\,\times \,\)256. Then from each of the ROIs, texture features are extracted using CLBP. A feature vector consisting of 512 features is generated. It may be possible that all the 512 features which are extracted do not contribute towards the overall performance of the proposed model. Hence, to reduce the feature set and to curb the curse of dimensionality problem, PCA is applied which reduces the feature vector length to 20 keeping 95% variance of the original data. The reduced feature set is thus fed to different classifiers to classify the mammograms.

Table 1 lists the values of different performance metrics like accuracy (Acc), sensitivity (Sn) and specificity (Sp) obtained with the proposed model for different classifiers for MIAS dataset.

Table 1. Performance measure of MIAS dataset (A-Abnormal, N-Normal, B-Benign, M-Malignant)

From the table, it is noticed that SVM has the highest accuracy of 100% followed by C4.5 with an accuracy of approximately 95.92%, ANN with 88.1%, and KNN and Naive Bayes both with an accuracy of 83.3856% for normal and abnormal images. In the case of Benign-Malignant, SVM has an accuracy of 100%, followed by C4.5 with an accuracy of approximately 91.07%, ANN with 80.4%, KNN with an accuracy of 76.7857%, and Naive Bayes with an accuracy of 71.4286%. Similarly, the results obtained for DDSM dataset are shown in Table 2.

Table 2. Performance measure of DDSM dataset

It is observed that SVM and ANN both have an accuracy of 100% followed by KNN with an accuracy of 99.66%, C4.5 with an accuracy of 98.9691%, and Naive Bayes with an accuracy of 98.6254% for normal and abnormal images. In the case of Benign-Malignant, SVM has an accuracy of 100%, followed by C4.5 with an accuracy of 95.4955%, ANN with an accuracy of 93.7%, KNN with an accuracy of 81.08%, and Naive Bayes with an accuracy of 80.18%. The performance of the proposed scheme is matched with some of the recent approaches with respect to accuracy as depicted in Table 3.

Table 3. Comparison of Accuracy of Diferent Models (A-Abnormal, N-Normal, B-Benign, M-Malignant)

4 Conclusion

Detection and diagnosis of breast cancer at an early stage helps in reducing the fatality rate to a greater extent. Hence, it becomes utmost important to develop an efficient and reliable CAD system which can classify the mammograms accurately. In this article, a model CAD system (CLBP+PCA+SVM, KNN, ANN, C4.5, and Naive Bayes) is proposed. In the presented scheme, compound local binary pattern (CLBP) which is a texture feature extraction technique is used. A total of 512 features are extracted which are then converted to a reduced feature set of size 20, with the help of PCA. The reduced feature set is fed to various classifiers like SVM, KNN, ANN, C4.5 and Naive Bayes to evaluate the performance measures.

It has been observed that SVM obtains the highest accuracy rate among all the classifiers for both Normal-Abnormal and Benign-Malignant classification. Further, it has also been observed that in the majority of the cases, the proposed model achieves better results than that of the competent schemes.

The proposed work can be extended towards the formulation of alternative feature extraction, feature reduction, and classification schemes to obtain an improved classification accuracy.