1 Introduction

Breast cancer is the second major deadliest cancer that affects women all over the world and listed at top among major health problems. The statistics provided by National Cancer Institute, Surveillance, Epidemiology, and End Results (SEER) program, indicate that the lifetime risk of developing breast cancer among American women is 12.2 % (aka: one in eight), exceeded only by the lung cancer [2, 45]. In the European Community, breast cancer represents 19 % of cancer deaths and the 24 % of all cancer cases [14, 24]. 25 % of all breast cancer deaths occur in women, diagnosed between the age of 40 to 49 years. In the United States for instance, breast cancer remains the leading cause of death for women in their forties [24]. The World Health Organization’s International Agency for Research on Cancer (IARC) has estimated more than one million cases of breast cancer to be faced annually and reported that more than 400, 000 women die each year from this disease [28]. Cancer can be divided in different stages 0–4 based on the area it has spread in, using surgical procedures. Lower stage numbers indicate the early stage of a cancer which can easily be diagnosed. It is therefore essential to detect the breast cancer at early stage in order to reduce life fatalities [28]. However, detection of breast cancer at its early stages is difficult as it’s usually has no symptoms at the beginning. The mortality of breast cancer has declined in women of all ages [28] and this fortunate reduction is considered to be related with the extensive awareness of the disease, self-screening process, widespread usage of mammographic screening and improvements in the treatment process.

Due to its reliability, mammography (an x-ray image examining method of the breast) is considered to be a most effective screening method for the detection of breast cancer. The mammograms are first digitized and then filtered/ analyzed with the help of powerful image analysis techniques in order to develop computer aided diagnosing (CAD) systems for effectively assisting the radiologists. A CAD is a set of automatic or semiautomatic tools developed to assist radiologists in the detection and / or evaluation of mammographic image [24]. There are three types of breast lesions; mass, calcification and architectural disorder [45]. The target of this research work is to identify an optimized feature extraction strategy to learn about the structure of each suspicious abnormalities in ROIs and then assigning a malignancy risk degree using an efficient classification method.

We cannot ignore the importance of biopsy (in the medical terms) in order to detect the masses, most accurately. It is however an expensive procedure and involves some risks e.g., patient discomfort, post biopsy side effects, chances of missing cancerous tissues based on different biopsy methods and is therefore recommended as an eventual solution for mass detection purpose. On the other hand, CAD systems are easy to use tools that are inexpensive and by analyzing the digital mammograms they can effectively assist the radiologists in their decision making process (as a second expert opinion). The idea of using CAD system for breast cancer detection is not recent. CAD systems are used earlier for this task and proved to be useful in the screening process of digital mammograms and in turn detection of early stage malignancies [24, 28, 45]. However, there exist controversial results and views against the usage of CAD systems mainly because of their high false positive and false negative rates in the breast cancer detection, which makes radiologist not really trust them [24]. False negative results occur when CAD system declares a mammogram to be normal even when breast cancer is present. The main cause of the false negatives is the density of the breast, as both dense tissues and tumors are appeared as white regions in the mammogram which makes it difficult to distinguish between them. As women get older, their breasts become fatty and false negatives are less likely to occur. A false positive is a region in the mammogram that is benign but interpreted as suspicious by the CAD system. High false positive results occur commonly when analyzing the mammograms of the younger women because of the same reason of dense breast tissues. In this research work, our motivation is to investigate six different feature extraction mechanisms to optimize the performance of CAD systems.

For detection of masses in mammograms, we can identify three main stages that constitutes a CAD system: 1) detection and segmentation of potential abnormal areas, 2) false positive reduction, and 3) discrimination of benign and malignant masses. The detection and segmentation stage identifies potential mass regions, and detect their precise outlines. The detected ROIs by this stage include not only masses but suspicious normal tissues as well. The false positive reduction stage classifies the detected ROIs into mass and normal ROIs. The detected mass ROIs are further discriminated as benign and malignant in the final stage. Many efforts have been made so far for false positive reduction and benign-malignant classification but these are still challenging problems. In this research work, we investigate and compare robust, optimized and discriminative feature extraction mechanisms for false positive reduction and benign-malignant classification to effectively address these challenging issues.

The feature extraction technique proposed in [20] is observed to perform well when tested to identify normal tissues from true malignant masses. This task is however simple as compared to false positive reduction and benign-malignant classification tasks based on the fact that the normal tissues can very easily be discriminated from true malignant masses due to highly dissimilar patterns. In this paper, we are interested in observing the performance of this feature extraction method and its variations (based on two state of the art feature transformation strategies), and other Gabor feature extraction techniques exist in the literature, for these two complex classification problems. The variants of the method in [20] with the collaboration of feature transformation strategies can further ensure that only the most representative properties are used (by removing the redundant responses of the bank) for discrimination between the normal and abnormal tissues. All of these methods analyze the textural properties of masses using a bank of several Gabor filters (discussed later). The key idea behind the usage of several Gabor filters is to improve the performance of breast cancer recognition system by responding strongly to the features that best distinguishes between the normal and abnormal tissues, from different orientations of filters in different scales. Based on the size of mammogram region, filtered by the bank, extraction of textural patterns can be done either locally (sub-region of ROI is filtered) or globally (entire ROI is filtered). Both of these local and global textural descriptors, characterize the micro-patterns (e.g., edges, lines, spots and flat areas) in digital mammograms that are very helpful for detection of masses [24]. Local textural descriptors, preserve at the same time the spatial information of masses and other regions in the digital mammograms and thus become more attractive choice for the same mentioned task.

The filters in the Gabor bank are initialized with different scales and orientations to extract any possible patterns in the ROIs that might be helpful for discrimination of normal and abnormal tissues. Although, Gabor filters are used for the breast cancer detection earlier (see e.g., [20, 45] and references therein), this work propose more variants of the existing feature extraction strategies [20] which are observed to offer much better performance than their original form. Feature transformation algorithms (used in this paper) effectively remove the redundant or irrelevant responses of the Gabor filters and thus are extremely helpful for improving the performance of a CAD system. Manually extracted normal/ abnormal tissues are filtered with the Gabor filters to extract directional features which are eventually used for classification of digital mammograms.

The remainder of this paper is organized as follows. In the next section, we review the related research work. In Section 3, we present the methodology with brief discussion on the feature extraction strategies and classification algorithm. Subsequently, in Section 4, we present experimental results to show the effectiveness of the feature extraction techniques. Finally, Section 5 will conclude this work.

2 Related work

Mass detection problem has attracted the attention of many researchers, and many detection techniques have been proposed [20]. For a detailed review of these methods, an interested reader is referred to the review papers [10, 13, 31, 38]. In the following paragraphs, we give an overview of the most related recent mass detection methods.

Most of the existing methods differ in the types of features that have been used for mass detection and the way these features have been extracted. Different types of features such as texture, gradient, grey-level, shape [31] features have been employed for mass detection. Texture is an important characteristic that helps to discriminate and identify the objects. In addition to other identification/detection tasks, texture descriptors have been used for detecting normal and lesion regions in mammograms [29, 37, 42]. Wei et al. [43] extracted multiresolution texture features from wavelet coefficients and used them for the discrimination of masses from normal breast tissue on mammograms. They used linear discriminant analysis for classifying the ROIs as mass or non-mass. This method was tested with 168 ROIs containing biopsy-proven masses and 504 ROIs containing normal parenchyma, and resulted in Az (percentage area under ROC curve) equal to 0.89 and 0.86 for the training and test groups.

If texture is described accurately, then texture descriptors can perform better than other descriptors [24]. Lladó et al. [24] used spatially enhanced LBP (Local Binary Pattern) descriptor, which is basically a texture descriptor, to represent textural properties of masses and to reduce false positives; this method achieved an overall accuracy of Az = 0.94 ± 0.02 (percentage area under ROC curve) on 512 ROIs (256 normal and 256 masses) extracted from mammograms from DDSM database. LBP based method outperforms other CAD methods for mass detection. But LBP descriptor builds statistics on local micro-patterns (dark/bright spots, edges, and flat areas etc.) and is not robust against noise. The scheme proposed by Sampaio et al. [36] used geo-statistic functions for extracting texture features, SVM for classification and obtained the accuracy of Az = 0.87.

Gabor wavelets are among different methods which have been used for texture description in various image processing and analysis approaches [17, 40]. Gabor filters decompose an image into multiple scales and orientations and make the analysis of texture patterns easy. Mammograms contain a lot of texture, and as such Gabor filters are suitable for texture analysis of mammograms [3, 35] as well. Different texture description techniques using Gabor wavelets differ in the way the texture features are extracted. Gabor wavelets have also been used to extract features for mass detection [23, 45]. Zheng [45] employed Gabor filters to create 20 Gabor images, which were then used to extract a set of edge histogram descriptors. He used KNN along with fuzzy c-means clustering as a classifier. The method was evaluated on 431 mammograms (159 normal cases and 272 containing masses) from DDSM database using tenfold cross validation. This method achieved true positive (TP) rate of 90 % at 1.21 false positive per image. The data set used for validation is biased toward abnormal cases which will surely favor the mass cases, and it cannot be regarded as fair evaluation. This method extracts edge histograms which are holistic descriptor, and does not represent the local textures of masses.

Lahmiri and Boukadoum [23] used Gabor filters along with discrete wavelet transform (DWT) for mass detection. They applied Gabor filter bank at different frequencies and spatial orientations on HH high frequency sub-band image obtained using DWT, and extracted statistical features (mean and standard deviation) from the Gabor images. For classification, they used SVM with polynomial kernel. The method was tested on 100 mammograms from DDSM database using tenfold cross validation. This method achieved an accuracy of 98 %. Costa et al. [7] explored the use of Gabor wavelets along with principal component analysis (PCA) for feature extraction, independent component analysis (ICA) for efficient encoding, and linear discriminant analysis (LDA) for classification. The success rate of this method with feature extraction using Gabor wavelets was 85.05 % on 5090 ROIs extracted from mammograms in DDSM database.

Geralodo et al. [22] have used Moran’s index and Geary’s coefficients as input features for SVM classifier and tested their approach over two cases i.e., normal vs. abnormal and benign vs. malignant regions classification. They obtained accuracy of 96.04 % and Az ROC of 0.946 with Geary’s coefficient and an accuracy of 99.39 % and Az ROC of 1 with Moran’s index for the classification of normal or abnormal cases. For the second case (benign vs. malignant), an accuracy of 88.31 % and Az ROC of 0.804 with Geary’s coefficient and accuracy of 87.80 % and Az ROC of 0.89 with Moran’s index is reported. The method is tested over 1394 ROI images collected from DDSM database using tenfold cross validation. In the research work of Ioan Buciu et al. [21], raw magnitude responses of 2D Gabor wavelets are investigated as features for proximal SVM. A total of 322 mammogram images from Mammographic Image Analysis Society (MIAS) database are used for three experimental cases i.e., discrimination between the three classes: normal, benign and malign (using one against all SVM classification), normal vs. tumor (benign and malign) and benign vs. malign using 80 % data features for training and 20 % as testing sets. The features dimension in this case is equal to the number of pixels present in the downsampled mammogram images (for a single Gabor filter), later PCA is used for dimensional reduction. The best results (in terms of accuracy) for the three experimental cases are: 75, 84.37 and 78.26 %, respectively. In order to observe the robustness of the method, ROI images corrupted with quantum noise are used for feature extraction and the method achieves comparable results (lesser decrease in recognition rate) with those of noise-free ROI images.

The aforementioned research works related to 2D Gabor wavelets are mostly concerned with using a generic (non-optimized) setting of filters present in the bank [1, 4, 21, 33, 45]. Following the same trend, we identified some main contributions of this paper as follows:

  • Comparison of feature extraction methods for false positive reduction and benign-malignant classification.

  • A new Gabor feature extraction method named Statistical Magnitude Gabor Response (SMGR) is proposed which significantly reduces the feature size for classification.

  • The variants of windows based SMGR method (proposed in our earlier work [20]) are supported with two state of the art feature reduction algorithms based on which they have reduced the erroneous predictions up to a significant level and thus are very attractive for the radiologists.

  • With tenfold cross validation experiments, methods are confirmed to perform robustly or weakly when trained with different ratios of normal and abnormal ROIs.

  • Detailed experiments using common machine learning evaluation methodologies and measures e.g., area under the ROC value, sensitivity, specificity, accuracy, are provided for a more general performance comparisons.

3 Methods

In this section, we discuss the feature extraction strategies one by one, for mass classification in digital mammograms. We review commonly used feature reduction techniques (in Section 3.2 and 3.3), which we are going to employ for extracting different types of Gabor features. We have observed (in our experiments) that different methods have quite a different impact on the recognition rate. Some methods give poor performance and the others are extremely accurate. This section is further divided in the following subsections. First, a brief overview of the Gabor filter bank is provided. Afterwards, feature transformation algorithms are discussed that are helpful for achieving better performance results followed by feature extraction methods. In the final subsection, we reviewed the SEL based weighted support vector machine that is used for classification purpose.

3.1 Gabor filter

Texture is an important part of the visual world of animals and humans and they can successfully detect, discriminate, and segment texture using their visual systems [32]. Textural properties in an image can be used to collect different information’s e.g., micro-patterns like edges, lines, spots & flat areas. Masses in an ROI do contain strong edges and local spatial patterns at different frequencies and orientations. These micro-patterns are helpful in recognition of cancerous regions in a CAD system. Gabor filters can effectively be used to detect these micro-patterns & this research work aims to validate this statement. A brief overview of the Gabor filters is given in the next paragraph.

Gabor filters are biologically motivated convolution kernels [8] that have enjoyed widely usage in a myriad of applications in the field of computer vision & image processing e.g., face recognition [44], facial expression recognition, iris recognition, optical character recognition, vehicle detection [46] etc. In order to extract local/ global spatial textural micro-pa tterns in ROIs, Gabor filters can be tune with different orientations and scales thus provide powerful statistics which could be very useful for breast cancer detection. The general function g(x, y) of 2D (for image) Gabor filter family can be represented as a Gaussian kernel modulated by an oriented complex sinusoidal wave can be described [46]:

$$ g\left(x,y\right)=\frac{1}{2\pi {\sigma}_x{\sigma}_y}.{e}^{\left[-\frac{1}{2}\left(\frac{{\tilde{x}}^2}{\sigma_x^2}+\frac{{\tilde{y}}^2}{\sigma_y^2}\right)\right]}.{e}^{\left(2\pi jW\tilde{x}\right)}. $$
(1)
$$ \tilde{x}=x. \cos \theta +y. \sin \theta \kern0.84em and\kern0.84em \tilde{y}=-x. \sin \theta +y. \cos \theta . $$
(2)

Where σ x and σ y are the scaling parameters of the filter and describe the neighborhood of a pixel where weighted summation takes place. W is the central frequency of the complex sinusoidal and θ ∈ [0, π) is the orientation of the normal to the parallel stripes of the Gabor function.

A generic strategy for constructing the Gabor filter bank is adopted from [26]. A particular bank of Gabor filters contain multiple individual Gabor filters adjusted with different parameters (scaling, orientation and central frequency). In this paper, different combination of Gabor filter bank e.g., a Gabor filter bank containing 6 filters (2 scales{S} × 3 orientations{O}) referred to as GS2O3, 15 filters i.e., GS3O5, 24 filters i.e., GS4O6 and 40 filters i.e., GS5O8 are used with initial max frequency equal to 0.2 and initial orientation set to 0. The orientations and frequency for a bank are calculated using following equations [46]:

$$ orientation(i)=\frac{\left(i-1\right)*\pi }{O} where\left\{i=1,2,\dots \dots, O\left( total\; orientations\right)\right\} $$
(3)
$$ \begin{array}{cc}\hfill frequency(i)=\frac{f_{\max =0.2}}{{\left(\sqrt{2}\right)}^{i-1}}\;\hfill & \hfill where\left\{i=1,2,\dots \dots, S\left( total\; scales\right)\right\}\hfill \end{array}. $$
(4)

3.2 Principal component analysis

Principal component analysis (PCA also known as Karhunen Loève transform) [11, 39] is a popular feature reduction technique that linearly projects a high-dimensional feature vector (e.g., Gabor feature vector without class label, Eq. 13) to a low-dimensional space whose components are uncorrelated. The low-dimensional space (eigenspace) is spanned by the principal components which are the linear combinations of the original space. Given an unlabeled Gabor feature vector (Γ i  ∈  J) representing the ith ROI, first, an average Gabor feature vector ψ is computed for a total of N ROIs in the training data.

$$ \psi =\frac{1}{N}{\displaystyle \sum_{i=1}^N{\varGamma}_i}. $$
(5)

In order to ensure that the data samples have zero mean, the difference (Φ i  = Γ i  − ψ) of each Gabor feature vector from the average Gabor feature vector is calculated and the covariance matrix C is estimated as follows:

$$ C\approx \frac{1}{N}{\displaystyle \sum_{i=1}^N{\varPhi}_i{\varPhi_i}^T=A{A}^T}. $$
(6)

Here, A = [Φ 1Φ 2.... Φ N ]. Since, it is computationally intractable to find a number of J eigenvectors u i and eigenvalues for this high dimensional correlation matrix C ∈  J × J of a typical ROI image size, the eigenvectors v i for the matrix A T A ∈  N × N are calculated first, where J ≫ N. The eigenvectors u i corresponding to the correlation matrix can then be calculated as follows [39]:

$$ {u}_i={\displaystyle \sum_{j=1}^N{v}_{ij}{\varPhi}_j} $$
(7)

Given a high dimensional input Gabor feature vector (Γ ∈  J), the subtraction from mean is done (Φ = Γ − ψ) and projection to low dimensional space is performed as follows:

$$ \overset{\sim }{\varPhi }={\displaystyle \sum_{i=1}^{R_k}{w}_i{u}_i,\kern1.32em \mathrm{where}\kern0.36em {w}_i={u_i}^T\varGamma }. $$
(8)

Here, w i are the coefficients of projection matrix and R k are the first few k-ranked eigenvectors that correspond to the k largest eigenvalues. In our experiments, we have used k = [5, 10, 15,..... All] where ‘All’ corresponds to all the eigenvectors.

3.3 Linear discriminant analysis

Linear discriminant Analysis (LDA) [11, 15] is a supervised linear transformation based feature reduction strategy. It projects a high-dimensional feature vector (e.g., Gabor feature vector with class label, Eq. 13) to a low-dimensional space such that the ratio between intra class scatter (within class) S W and the inter class (between class) scatter S B is maximized. Considering the same definition of the symbols as given in Section 3.2 for PCA, these scatters can be defined as follows for the multiclass classification problem containing C class labels:

$$ {S}_B={\displaystyle \sum_{i=1}^C{N}_i\left({\varPsi}_i-\varPsi \right){\left({\varPsi}_i-\varPsi \right)}^T}. $$
(9)
$$ {S}_W={\displaystyle \sum_{i=1}^C{\displaystyle \sum_{x_k\in {y}_i}\left({x}_k-{\varPsi}_i\right){\left({x}_k-{\varPsi}_i\right)}^T}}. $$
(10)

Here, ψ i corresponds to the average Gabor feature vector for class i, N i is the number of training samples that belong to class i and x k is the kth instance in class i. So, we try to find out the optimal projection W optimal such that the ratio between intra class scatter matrix of projected samples and the inter class scatter matrix of projected samples is maximized, given as follows:

$$ {W}_{optimal}= \arg \max (W)\frac{\left|{W}^T{S}_BW\right|}{\left|{W}^T{S}_WW\right|}. $$
(11)

For the selection of most representative features in the projected space, the setting of k = [5, 10, 15,..... All] is used, where k represents the number of projected features used for classification purpose. In our case, the number of training samples are much lesser than the number of features and therefore the intra class scatter matrix would tend to be a singular matrix and the LDA computational will be so demanding. In order to cater this issue, PCA is used as a pre-processing step to project the original training data into low-dimensional space and then LDA projection is performed.

3.4 Feature extraction strategies

A detail description of six different feature extraction methods is given in follows.

Magnitude gabor responses transformed with PCA and LDA

The feature extraction method presented in [21] produces Magnitude Gabor Responses (MGR) by applying Gabor filters to the entire ROI image and use magnitude values of the filtered pixels, directly as feature values without any further post processing. The dimension of features in this case is equal to the number of pixels in ROIs, multiplied with the number of Gabor filters, present in the bank. For example, using MGR [21], the dimension of features is about 40960 for an ROI resolution of 32 × 32 pixels with a bank containing 40 Gabor filters. Clearly, such a huge dimension makes the classification task challenging due to the presence of several irrelevant and redundant Gabor responses. In order to handle this shortfall of MGR, PCA has been used [21]. In addition to PCA, LDA can also be used to overcome the same problem. In this way, two feature extraction methods are formed; first, PCA_MGR that uses PCA, and second, LDA_MGR that uses LDA, to transform the features, generated by MGR, into low dimension space.

First order statistics of magnitude gabor responses

The third feature extraction strategy (proposed in this paper) is based on the further processing of MGR features such that when a Gabor filter is applied to an ROI, the resultant magnitude Gabor responses are represented with only three statistical values (mean, standard deviation and skewness), thus called Statistical Magnitude Gabor Response (SMGR) based feature extraction strategy. SMGR reduces the dimension of extracted features, significantly, as compared to MGR and WSMGR (discussed below), and offers comparable recognition rate to that of the MGR. The dimension of features produced with SMGR is already low and therefore doesn’t require any further reduction.

When a Gabor filter is applied to a pixel value, it generates a complex number having real and imaginary parts. The magnitude/ absolute value of the complex number (a + bi) is calculated as follows:

$$ \left|a+ bi\right|\sqrt{a^2+{b}^2}. $$
(12)

For SMGR, a single Gabor filter is applied to all the pixels of ROI and magnitude values are calculated. Later, three statistical values (mean, standard deviation and skewness) of the magnitude values are used as features for that particular Gabor filter. The method is repeated for all the Gabor filters in the bank to generate the feature vector for the given ROI.

The methods discussed so far are global feature extraction techniques; in the following we discuss some local feature extraction techniques:

Windows based first order statistics of magnitude Gabor responses

In [20], moments based magnitude values for a group of pixels in the overlapping windows are used to construct a feature vector. Instead of applying a Gabor bank on the entire ROI image, an ROI is first segmented/ partitioned into overlapping windows. In particular, each hypothesized ROI image is first divided into equal sizes of square patches/ blocks and then later by combining these patches several overlapping windows are formed (for more detail [20]). In this way, by increasing/ decreasing the size of a patch, ROI image can be partitioned in different sizes & numbers of windows. Feature extraction (for WSMGR) is performed by convolving ROI windows with a Gabor filter bank. So, WSMGR method actually generates Windows based Statistical Magnitude Gabor Responses (WSMGR) for the textural patterns present in the ROIs. Gabor filters are applied to the overlapping windows of ROIs and statistical representative values for the filtered pixels in these windows are used as feature values. A slight modified design strategy (for feature extraction) is already used for texture base features extraction [26, 46]. Bhangaleet. al. [4] applies Gabor filter on an entire ROI and divide the filtered ROI into non-overlapping blocks and later, for a block, mean and standard deviation of the pixels intensities (in the block) are used as feature values. For WSMGR, an ROI is first partition into overlapping windows (i.e., small regions as shown in Fig. 1) and a single Gabor filter is applied to all the pixels of the window and magnitude values are calculated. Later, three statistical values (mean, standard deviation and skewness) of the magnitude values are used as features for that particular Gabor filter. The method is repeated for all the Gabor filters in the bank on all the windows of ROI to generate the feature vector for the given ROI. Partitioning the ROI, prior to filtering, makes the filtering process highly parallelizable e.g., in the presence of multi-core CPUs and GPUs, multiple windows can be filtered, in parallel [20].

Fig. 1
figure 1

Segmentation of ROI in blocks and overlapping sub-windows (left to right)

The raw responses of Gabor filters bank (in the form of complex values) can also be used as features for classification (as was the case for MGR) but usually some post processing is performed to acquire most representative features e.g., (Gabor energy features, threshold Gabor features and moments based Gabor features) [17, 46]. For WSMGR, magnitude responses of each Gabor filter in the bank are collected from all windows and represented by three moments: the mean μ i,j , the standard deviation σ i,j and the skewness k i,j (where i corresponds to the ith filter in the bank and j to the jth window) [20].

The moments correspond to the statistical properties of a group of pixels in a window and positioning of pixels is essentially discarded which compensates for any errors that might occur during extraction/ segmentation of ROIs into overlapping windows. Suppose, we are using a Gabor bank of 40 filters (i.e., GS5O8); applying this filter on nine windows [20] of a single ROI, yields a feature vector of size 1080 + (1) class. A row feature vector in this form is shown below:

$$ \left[{\mu}_{1,1},{\sigma}_{1,1},{k}_{1,1},{\mu}_{2,1},{\sigma}_{2,1},{k}_{2,1},\dots, {\mu}_{40,1},{\sigma}_{40,1},{k}_{40,1},{\mu}_{1,2},{\sigma}_{1,2},{k}_{1,2},.\dots, {\mu}_{40,9},{\sigma}_{40,9},{k}_{40,9}, class\right]. $$
(13)

WSMGR significantly reduces the dimension of extracted features (without feature reduction strategy) as compared to MGR, as shown in Table 1. The dimension of features using SMGR is lower than WSMGR, however, the recognition performance of WSMGR is better than SMGR as discussed in Section 4. We further collaborate the WSMGR with two feature transformation strategies (PCA and LDA) in order to observe any performance gain. In this way, two more feature extraction methods are formed; first, PCA_WSMGR that uses PCA, and second, LDA_WSMGR that uses LDA, to transform the features, generated by WSMGR, into low dimension space.

Table 1 Feature dimension for MGR, SMGR and WSMGR for different experimental configurations, without applying feature transformation strategies

3.5 Classification

In this paper, we have investigated Successive Enchantment Learning based weighted Support Vector Machine (SELwSVM) for the classification of tumors, present in the ROIs. In our case, we are dealing with a binary classification problem where the target is to build a classification model that can accurately label the unseen data to either belong to the ‘mass present’ or ‘mass absent’ classes. SVM classifiers [41] are the most advanced ones, generally, designed to solve binary classification problems; thus perfectly suite our requirements. The only difference between SELwSVM [12, 27] and normal SVM lies in the selection of training samples to be use during the training phase based on a weighting scheme assign to the class labels. SELwSVM makes use of a subset of entire training data to build the classifier and assign unequal weights to the class labels (e.g., based on their frequencies). Whereas on the other hand, normal SVM exploits complete training data to learn the classification model with equal weights assigned to each class label. Keeping in view this difference, we first discuss the weighting scheme (for the class labels) use in our work along with the successive enhancement learning strategy followed by a brief review of SVM classifier.

SELwSVM [12, 27] is recommended to be use when dealing with highly skewed datasets for the classification purpose. In general, when extracting ROIs from different locations of mammograms, most of these ROIs are labeled as “mass absent” and only a few of these belong to the “mass present” class; thus results in a highly unbalanced dataset. This property of mass classification dataset makes SELwSVM an ideal approach to be investigated for the classification purpose. Moreover, misclassification of “mass present” cases is more dangerous and has severe effects towards the causalities. Hence, accuracy of “mass present” class is more important and misclassification of this class should be given high penalty as compared to the “mass absent” class. It can be achieved by assigning higher weights to the “mass present” class and in turns assigning higher penalty for the misclassification of samples belonging to the same class. We adopt weighting scheme as used in [27] that assign the ratio of penalties for different classes to the inverse ratio of the training class sizes and same weights for the samples belonging to same class. The weight of each class is given as:

$$ \left\{\frac{W_1}{L2}\right.=\frac{W2}{L1},{W}_1+{W}_2=1. $$
(14)

Here, W 1, W 2 and L 1, L 2 denotes the weight and instance numbers in majority and minority classes, respectively. A potential concern when dealing with highly skewed dataset is whether the randomly selected training samples are well representative of the majority class [12]. To address this issue, we use successive enchantment learning strategy where the basic idea is to select iteratively the most representative “MC absent” examples from all the available training images while keeping the total number of training examples small [12]. This method of learning resembles the bootstrap technique [27] and shown to improve the generalization performance of SVM [12, 27]. The pseudo code of SEL is given as follows; for more understanding, readers are kindly referred to [12] where it is discussed, how choosing “difficult training samples” from the majority class actually improves the recognition rate?

Input: Training data (Gabor textural features for the ROIs with the labels)

Output: Classification model

Select randomly an initial set of training examples ‘Z ’ from the available training data

Classification model = Train the SVM classifier with ‘Z

REPEAT

Apply the Classification model to all the mammogram regions (except those already present in ‘Z ’)

Record the “mass absent” locations that have been misclassified as “mass present”

Collect ‘N ’ new input examples (randomly) from the misclassified “mass absent” locations

Update the set ‘Z ’ by replacing ‘N ’ “mass absent” examples that have been classified correctly by

weighted SVM with the newly collected “mass absent” examples

Classification model = Re-train the weighted SVM classifier with the updated set ‘Z

UNTIL (convergence is not achieved i.e., accuracy doesn’t improve in three consecutive iterations)

Algorithm 1. Successive enhancement learning algorithm

Considering the learning scheme of SVM, the aim is to find an optimal hyper-plane that can separate the data belonging to different classes with large margins in high dimensional space [5]. The margin is defined as the sum of distances to the decision boundary (hyper-plane) from the nearest points (support vectors) of the two classes. SVM formulation is based on statistical learning theory and has attractive generalization capabilities in linear as well as non-linear decision problems [6, 41]. SVM uses structural risk minimization as opposed to empirical risk minimization [41] by reducing the probability of misclassifying an unseen pattern drawn randomly from a fixed but unknown distribution.

Let D = {(x i , y i )} N i = 1  ⊂  J × {+1, − 1} be a training set where x i is the ith training instance containing J features, y i is the class label of x i having two values {+1 or −1}. Finding an optimal hyper-plane based on large margin framework implies solving a constrained optimization problem using quadratic programming and can be stated as:

$$ f(x)={\displaystyle \sum_{i=1}^N{\alpha}_i{y}_ik\left({x}_i,x\right)+b} $$
(15)

Where α i  > 0 are the Langrange multipliers, k(x i , x) is the kernel function and sign of f(x) gives the membership class of x. For linearly separable problems or linear SVM, kernel function is simply the dot product of the two given points in the input space. However, for non-linear SVMs, the original input space is mapped to the higher dimensional space through a non-linear mapping function (possibly making the data linearly separable), using different suitable kernels (for computational efficiency) defined as a dot product in the new space and satisfies the Mercer’s condition [41]. In this new formulation, the misclassification penalty or error is controlled with a user defined parameter C (regularization parameter, controlling tradeoff between error of SVM and margin maximization), and is tied with the kernel. There are several kernels available to be used e.g., linear, polynomial, sigmoid, radial basis function (RBF) etc. In our experiments, RBF kernel is used as given by:

$$ k\left({x}_i,x\right)= \exp \left(-\gamma {\left\Vert {x}_i-x\right\Vert}^2\right),\gamma >0. $$
(16)

The γ is the width of the kernel function. There are two parameters now tied with the RBF kernel, γ and C. Tuning these parameters in an attempt to find a better hypothesis is called model selection procedure. For model selection, we have first performed a loose grid search (coarse search for computational efficiency) to find the better region in the parameter space. Later, the finer grid search is conducted in the region found by loose grid search. This model selection procedure is recommended in the work of Chih-Wei Hsu et al. [18]. The selected parameters are feed into the kernel and SVM is finally applied to our data sets. Detailed discussion on the statistical formulation and computational aspects of SVM can be found in the work of Vapnik [41].

4 Results & discussion

In this section, the experimental results for all the feature extraction strategies (discussed in Section 3) are presented and discussed in a fair amount of detail. We conducted the experiments for two problems: false positive reduction i.e., to classify ROIs into normal and mass (benign + malign) and, the classification of mass ROIs into benign and malignant. First, overview of the database used for the validation of the methods is given. Then, a fair amount of discussion is carried out for the empirical evaluation of the methods for the two diagnosis problems. The extracted ROIs are in different sizes, for processing them with Gabor filter bank, it is necessary to resize them into the same resolution; we tested three different resolutions: 128 × 128, 64 × 64 and 32 × 32. For extracting features (based on WSMGR), each ROI can be partitioned into blocks of different sizes for defining overlapping windows. We tested three block sizes: 32 × 32, 16 × 16 and 8 × 8. Afterwards, we perform statistical comparison of the methods (in terms of recognition rate and area under the ROC value) using a non-parametric Friedman test with Holm post-hoc test [9, 16] in order to see whether or not the differences in performance of different methods are actually statistically significant.

4.1 Database & evaluation methodology

The mammogram images used in our experiments are taken from Mammographic Image Analysis Society (MIAS) [25] database; this database consist of more than 2000 cases and is commonly used as a benchmark for testing new proposals dealing with processing and analysis of mammograms for breast cancer detection. Each case in this database is annotated by expert radiologists; the complete information is provided as an overlay file. The locations of masses in mammograms specified by experts are encoded as code-chains. We randomly selected 109 cases from the database. Using code chains, we extracted 20 ROIs which contain true masses; the sizes of these ROIs vary depending on the sizes of the mass regions. In addition, we extracted 54 ROIs containing normal but suspicious tissues and 35 benign ROIs. Some sample ROIs are shown in Fig. 2.

Fig. 2
figure 2

(top row) Normal but suspicious ROIs (middle row) Benign mass ROIs (bottom row) Malignant mass ROIs

The evaluation of the methods is performed using tenfold cross validation and area under the ROC curve (Az value) analysis. In particular, a data set is randomly partitioned into ten non-overlapping and mutually exclusive subsets. For the experiment of fold i, subset i is selected as testing set and the remaining nine subsets are used to train the classifier. Using tenfold cross validation experiments, the performance of methods can be confirmed against any kind of selection biased of the samples for training and testing phases. It also helps in determining the robustness of the methods when tested over different ratios of normal and abnormal ROIs used as training and testing sets (due to random selection, ratios will be different). The SVM classifier gives a membership value of each class when an unknown pattern is presented to it. The ROC (receiver operator characteristics) curve can be obtained by varying the threshold on this membership value. The area under ROC curve (Az) is used as a performance measure. The other commonly used evaluation measures are accuracy or recognition rate (RR) = (TP+TN)/(TP+FP+TN+FN), sensitivity (Sn) = TP/(TP+FN), specificity (Sp) = TN/(TN+FP), where TNis the number of true negatives, TP is that of true positives, FP is that of false positives and FN denotes the number of false negatives.

4.2 False positive reduction

In this section, the classification of diagnosis case (suspicious normal vs. masses) is investigated based on the proposed method. Mass ROIs contain two types of ROIs: 1) benign, and 2) malignant. It becomes difficult to discriminate between the normal and mass ROIs mainly because of the reason that the benign ROIs are structurally closer to both the normal and mass ROIs. One major point in favor of WSMGR is its low-dimensional feature space as compared to the huge dimensional space generated under the feature extraction strategy of MGR [21]. SMGR on the other hand results in the smallest feature size, see e.g., Table 1. It may please be noted that the block size is only relevant to WSMGR strategy as given in Table 1. In Tables 2 and 3, experimental results are given for the diagnosis case (normal vs. masses) for all the feature extraction strategies (discussed in Section 3) based on performance measures: accuracy and Az, respectively. It can easily be observed that the dimension of average number of features generated using WSMGR is substantially smaller than the dimension of features produced with MGR as given in Table 1. This reduction in feature space not only makes the WSMGR method computationally less demanding but it also improves the recognition rate of the SELwSVM, as discussed in follows. For convenience and ease of reference, dataset names are assigned to different experimental configurations e.g., D1 to D12.

Table 2 Performance of feature extraction strategies over different resolutions of ROIs using tenfold cross validation based on accuracy (normal vs. masses)
Table 3 Performance of feature extraction strategies over different resolutions of ROIs using tenfold cross validation based on Az. (normal vs. masses)

The performance of LDA_WSMGR, in terms of all the reported average performance measures, is better than all of the other feature extraction strategies, as can be observed from Figs. 3 and 4, and Tables 2 and 3. In fact, WSMGR (without using any feature transformation strategy) is better than PCA_MGR and LDA_MGR, keeping in view the average accuracy and average Az. values across all the datasets. SMGR perform very poorly with (11.80 ± 12.01) average sensitivity and (0.11 ± 0.11) average Az. value and therefore cannot be considered as a robust and reliable method for false positive reduction problem. The best case results are highlighted for each algorithm in Tables 2 and 3. We can see that LDA_WSMGR obtained accuracy of (100 %) with (Az. = 1) for D7 which corresponds to the experimental configuration of 64 × 64 ROI resolution, 16 × 16 WSMGR block size and the bank with 24 filters denoted as GS4O6. This shows the importance of considering the distinguishing power of features based on the statistics of available class labels associated with the data samples (as used for LDA). The average performance of LDA is observed to be consistent (based on all performance measures) and always give best results.

Fig. 3
figure 3

Performance of feature extraction strategies over all the datasets (D1-D12) using tenfold cross validation based on sensitivity (normal vs. masses)

Fig. 4
figure 4

Performance of feature extraction strategies over all the datasets (D1-D12) using tenfold cross validation based on specificity (normal vs. masses)

4.3 Discrimination of benign and malignant

This section summarizes the results for another difficult classification problem i.e., the discrimination between benign and malignant masses. The discrimination task is relatively hard in this case due to highly identical patterns and similar structures of the two classes (benign and malignant) present in the selected digital mammograms. Figures 5 and 6 shows the comparison of the methods based on sensitivity and specificity. For the experimental case (benign vs. malignant), best average percentage accuracy of (100.00 ± 0.00) and best average Az. value of (1.000 ± 0.000) corresponds to a number of different configurations of input resolution size, block size and Gabor bank as shown in Tables 4 and 5 for LDA_WSMGR. LDA_WSMGR is again observed to be more accurate (in terms of all performance measures) and consistent as compared to its five competitors and thus recommended to be use for mass classification problem. All the algorithms (except LDA_WSMGR) have resulted in almost the same performance with slight differences based on average accuracy and Az. values, observed across all the datasets. It may please also be noted that the experimental configurations are playing an important role in achieving results with varying performance levels, e.g., the Az. value of PCA_WSMGR is 0.371 ± 0.391 for D2 (32 × 32 ROI resolution, 8 × 8 WSMGR block size and the bank denoted as GS3O5) which get improves up to 0.758 ± 0.325 using the same feature extraction strategy in case of D5 (64 × 64 ROI resolution, 16 × 16 WSMGR block size and the bank denoted as GS2O3).

Table 4 Performance of feature extraction strategies over different resolutions of ROIs using tenfold cross validation based on accuracy (benign vs. malign)
Table 5 Performance of feature extraction strategies over different resolutions of ROIs using tenfold cross validation based on Az. (benign vs. malign)
Fig. 5
figure 5

Performance of feature extraction strategies over all the datasets (D1-D12) using tenfold cross validation based on sensitivity (benign vs. malign)

Fig. 6
figure 6

Performance of feature extraction strategies over all the datasets (D1-D12) using tenfold cross validation based on specificity (benign vs. malign)

We empirically described that the WSMGR method is better than MGR in terms of feature complexity and discrimination power for both the diagnosis cases (false positive reduction and discrimination of benign and malignant). Moreover, when WSMGR is compared with SMGR, both gives same performance for (benign vs. malign) case. However, the performance of SMGR is poor for (normal vs. masses) case and therefore this feature extraction strategy cannot be considered as robust method for mass classification problem (in general). WSMGR is a robust method for feature extraction and gives better results; it can be observed however that the performance of this method is still turned to be poor in terms of average sensitivity and Az. and therefore need further refinement. As a refinement of WSMGR, its variant LDA_WSMGR (that uses LDA transformation strategy) has significantly improved the results and achieved 100 % accurate recognition rate for both the diagnosis cases. In the next subsection, we elaborate about the statistical differences between the performance of feature extraction methods based on average accuracy and Az. values.

4.4 Discussion based on statistical comparison

In this study, to test whether the LDA_WSMGR based feature extraction method performs significantly better than those of the other five competitors, a non-parametric statistical test (Friedman) is conducted. The Friedman test is chosen because it does not make any assumptions about the normal distribution of the underlying data (a requirement for equivalent parametric tests) and it is a recommended and suitable test to compare a set of classification strategies over multiple performance output values, according to the guidelines presented in [9, 16]. Table 6 presents the summary of the comparisons of the LDA_WSMGR feature extraction algorithm (the algorithm with the best average rank, considered as control algorithm) with the remaining algorithms used in our experiments according to the non-parametric Friedman test with the Holm’s post-hoc test [9, 16] in terms of percentage accuracy and Az. values for all the datasets as given in Tables 2, 3, 4 and 5 for the two diagnosis cases.

Table 6 Summary of the comparisons of the LDA_WSMGR feature extraction algorithm with the remaining algorithms according to the non-parametric Friedman test with the Holm’s post-hoc test in terms of (i) Az. value and (ii) percentage accuracy

For each algorithm, the average rank (the lower the average rank the better the algorithm’s performance), the p-value (when the average rank is compared to the average rank of the algorithm with the best rank i.e., control algorithm, in our case, it is LDA_WSMGR) and Holm critical value obtained by Holm’s post-hoc test are reported. Based on the fact that the p-value is lower than the critical value (at 5 % significance level), entries in a row are shown in bold when there is a significant difference between the average ranks of an algorithm and the control algorithm (LDA_WSMGR). The rows containing bold entries indicate that the control algorithm has significantly outperformed the corresponding algorithms present in these rows.

According to the statistics of Table 6, LDA_WSMGR performs statistically significantly better than all of its competitors in terms of both percentage average accuracy and Az. values in all of the experiments presented in Tables 4 and 5 for benign vs. malignant classification problem. Almost, same is the situation when considering normal vs. masses case, that LDA_WSMGR is better than all of its competitors (except PCA_WSMGR) based on percentage average accuracy as well Az. values. When observed for normal vs. masses classification problem, PCA_WSMGR seem to perform comparatively equivalent to LDA_WSMGR based on a large p-value for both the performance measures. Interestingly, based on average ranking for both the performance measures, WSMGR is placed at 3rd position for normal vs. masses case but for benign vs. malignant case, WSMGR performs the worst and thus placed at the last position. The difference between the performance of LDA_WSMGR as compared to SMGR is very significant based on smallest p-values for normal vs. masses case. With these statistics, LDA_WSMGR can be considered as attractive choice for the problems being targeted. LDA_WSMGR improve the performance of the proposed system to extreme level and also reduces the feature dimension, significantly (as compared to MGR), which is very helpful to cope with the problem known as curse of dimensionality and offer better generalization ability for a classification scheme; such as SELwSVM.

4.5 Comparison with other methods

It is rather difficult to meaningfully compare the proposed method with other methods in the literature due to many factors. For example, which mammogram database was used for evaluation? Given that the same database was employed, were the same sample of mammograms selected for evaluation? How many samples were used? Which evaluation approach (validation methodology, training and testing set formation with different percentages of ROIs) was used? Were the ratios of ROIs for different classes (e.g., normal, malignant and benign) the same? Even if other methods are implemented and evaluated on the same dataset, it might still not be a fair comparison because the tuning of parameters involved in different methods are not necessarily the same.

In any case, to give a general trend of the performance of our method (LDA_WSMGR) and compare it with state-of-the-art methods in terms of accuracy and Az., we have compiled information from various studies as shown in Table 7. The quantities that are not reported in the literature are indicated with a dash symbol. For some methods, standard deviation values are not available. For the two problems (i.e., normal vs. masses and benign vs. malignant), only the best case mean and standard deviation results are reported for all the methods being compared. For the first problem (normal vs. masses), the proposed method performs better than all the other methods. It may please be noted that some entries in the Table 7 represent the maximum (max) accuracy and Az. values as reported in their paper; the mean and standard deviation of each measure are not given.

Table 7 Comparison with state-of-the-art methods based on average Acc. and Az values

For the second problem (benign vs. malignant), the proposed method also outperforms all reported methods. In general, the proposed method performs better than state-of-the-art techniques for the two classification problems. Note that Ioan et al. [21], Daniel et al. [7] and Hussain [45] also used Gabor filter banks for the description of masses but their descriptors are different; in the first two methods, the descriptors are global while in the third method, the descriptor is local.

5 Conclusions

In this research article, we have discussed and compared six different directional feature extraction methods for mass classification problem. These methods use a bank of Gabor filters to extract the features from textural micro-patterns (present in the ROIs) at different scales and orientations. The features extracted based on LDA_WSMGR feature extraction strategy are shown to best discriminates between the three tissue types (normal, benign and malign masses) used in the experiments and in general, improves the recognition rate of a breast cancer detection system, up to a significant level.

The comparison based on Friedman statistical test reveals that the LDA_WSMGR method is actually statistically significantly better than its competitors on 5 % significance level. All of the feature extraction methods are evaluated over ROI images extracted from MIAS database, using an application oriented fitness function based on successive enhancement learning based weighted Support Vector Machine (SELwSVM) to cater with the skewed/ unbalanced dataset problem. Two state of the art feature transformation algorithms have reduced the dimension of feature space, remarkably. With compact data space, recognition rate of cancerous tissues in the digital mammograms has improved. Model compactness indirectly implies that the feature space will be low dimensional and thus better computational efficiency and better generalization of the classification model is expected and observed.

The methods are empirically analyzed over two diagnosis cases i.e., discrimination between: (i) normal but suspicious and masses (malignant and benign) and (ii) benign and malignant masses. For the two diagnosis cases, we achieved encouraging results, reported as (percentage mean accuracy, mean area under the ROC value over tenfold cross validation): i.e., (100 %, 1.00 for normal vs. masses) and (100 %, 1.00 for benign vs. malignant) based on LDA_WSMGR. It can be observed that LDA_WSMGR has the potential to be further explored in more complex recognition tasks related to breast cancer detection problem. LDA_WSMGR is shown to outperform other state-of-the-art methods available in the literature and thus offer promising capabilities.

There are several future avenues in order to extend the LDA_WSMGR technique. It will be interesting to investigate the performance of LDA_WSMGR method in more complex problem scenarios e.g., recognition and identification of more breast abnormalities like micro-calcification, breast structural disorder etc. The preprocessing of mammogram images for enhancing their quality is also an area that is required to be further investigated. LDA_WSMGR method is required to be tested over noisy mammogram images in order to investigate its robustness power. Other optimization strategies e.g., Genetic Algorithm, Cuckoo optimization are worth enough to be investigated for Gabor filter parameter optimization in the targeted area.