1 Introduction

Breast cancer is one of the main reasons for death in women especially in Asia and worldwide [26]. Breast cancer is ranked second most dangerous cancer in overall. An estimate of more than 2.6 million new breast cancer cases will be diagnosed in 2019 [4].

Among all cancers, breast cancer is the most perilous diseases in women and can be treated only when detected at its initial stage [52]. Breast cancer is due to abnormal growth of the cell in the genes or body tissue [48]. Mammography is one of the widely used imaging modality [30] and plays an important role in the clinical examination for the diagnosis of breast disease [57]. Regular screening of mammograms for detection of masses i.e. small deposits of calcium in the breast which appear as small bright spot [2] is essential [6, 60]. During diagnosis, the identification of suspicious lesions which may be small or ill-defined are also important [8] and need to be segmented. Therefore, segmentation plays an important role in the diagnosis of breast cancer [19, 20, 49, 54].

Initially, mammogram images are manually diagnosed by the Radiologist and are referred to biopsy for final diagnosis. Due to the large size of the mammogram database, manual tracing of the lesions in all the images is tedious and thus can lead to observation errors [12]. To overcome this limitation, the opinion of more than one experts can be obtained, however, this can further lead to inter-observer variability [10, 37, 43]. To overcome these issues, Computer-aided diagnosis (CAD) system was introduced for segmentation and classification of the lesions which can provide another opinion to the radiologist [7, 20, 21].

Segmentation systems are consistent [19, 32, 36, 53] and thus provide a high level of accuracy. Segmentation is an important initial step of the CAD system [49]. In CAD system, existing techniques such as: expected maximization (EM) [45, 47, 55], K-means (KM) [13, 17, 22, 31, 44], Fuzzy c-Means (FCM) [5, 15, 16, 23, 42, 44, 46], multilevel thresholding (MT) [18, 34, 38, 51], region growing (RG) (single seed and multi-seed) [1, 14, 28, 59], and particle swarm optimization (PSO) [9, 29, 34] are found as rapid, fast, and robust. These techniques are free of tuning parameters, can be easily manipulated with smaller storage space, and are very popular in solving statistical estimation problems [19].

Many studies in the literature had employed these segmentation techniques for the detection of masses in the mammogram images. In 2015, Neto et al. [29] presented a segmentation technique followed by Otsu thresholding and PSO. False Positives Reduction was also performed by graph clustering and area filter. The study achieved a 95% segmentation correctness for 100 images from the DDSM (Digital Database for Screening Mammography) dataset. Along with successful segmentation of masses, the technique was also detecting small false regions which should be minimized. Raja et al. [34], proposed an improved PSO with multilevel thresholding using Otsu’s function. With image histogram analysis from a chosen RGB image data set, a threshold from Otsu’s is processed with Improved Particle Swarm Optimization (IPSO), PSO [35], and Darwinian Particle Swarm Optimization (DPSO) techniques. IPSO demonstrate result with less computational time as compared to the other two techniques (0.037 min as compared to DPSO and 0.053187 min as compared to PSO). In the same year, Elmoufidi [13], proposed a hybrid segmentation using K-means and Region Growing to detect different breast tissue regions in mammograms. Using MIAS databases, the proposed method obtained a mean precision percentage of 92.87% on 170 mammograms.

In 2016, Nurhasanah et al. [31], proposed a combined method of fractal 2D Fourier analysis and K-means clustering. Fractal analysis was used to identify the density of normal and abnormal tissues and K-means clustering was used to determine the area of lesions by segmenting the boundary of abnormal tissue. The fractal dimension and intercept value are interpreted with a limit of 1 and 32 respectively for a normal and abnormal mammogram, respectively. K-means showed a better result for computing the area of an abnormal mammogram with less than 1 value for fractal dimension and intercept value greater than 32.

In 2017, Vedanarayanan et al. [55] proposed an efficient technique for segmenting breast cancer region with improved expected maximization and modified snake method. For feature extraction, abnormality, region, and boundary features were used and the backpropagation network was employed to classify the abnormality.

Recently in 2018, Sadad [42] proposed a CAD system by cascading Fuzzy c-Means and region-growing for segmentation and pre-processing, respectively. Decision Tree and KNN classifier were then applied to both MIAS and DDSM databases with hybrid features. For MIAS and DDSM databases, the system observed the accuracy of 98.2% and 95.8%, respectively. In the same year, Melouah [28] performed a study with two region growing methods for mammogram and IRM (Integrated Region Matching) datasets, which are (i) thresholding based region growing [1] and (ii) features similarity based region growing [59]. The methods were applied on 28 mammogram images acquired from mini Mammographic Image Analysis Society (MIAS) dataset and 20 IRM images acquired from RIDER (Reference Image Database to Evaluate Therapy Response) dataset. In this study, both thresholding and region growing methods showed a better result on mammographic images, but thresholding based region growing shows better result on IRM images as compared to features similarity based region growing. It was observed that the SSTT method does not give a good result with the high-intensity regions such as artifacts and labels. Also in the SSFS method, the reference values should be chosen very carefully that too at the center of the mass region.

As mentioned above, many studies were performed on a small dataset. Also, a comparison of the techniques on a similar dataset was missing. Further, none of the previous studies had performed a detailed performance evaluation and statistical analysis of the obtained results. Benchmarking with previous studies was also missing in many studies. All these limitations motivate us to implement and compare the above-mentioned segmentation techniques on a single dataset to identify the best technique for the detection of breast masses in mammograms, which is the novelty of the proposed work. In this study, performance evaluation is demonstrated by the Jaccard Index (JI) and the Dice Similarity Coefficient (DSC). Bland Altman and regression plots indicate the graphical similarity between segmentation and manual (ground truth) results. Statistical tests such as t-test, Mann Whitney, Wilcoxon, Chi-Square, and Kolmogorov–Smirnov (KS) are also performed to prove the stability and reliability of the study [3, 24, 33, 39].

The paper is organized as follows. Section 2 includes a brief summary of all the segmentation methods. Section 3 shows binary results of all the segmentation techniques and manual tracings performed by the two Radiologists. Section 4 presents a comparison of the segmentation results with the manual results using JI and DSC measures. Various statistical tests results are also presented. Finally, the contribution of the work is summarized in the Section 6.

2 Materials and methods

The mini-MIAS database is the widely used publicly available database of mammographic data. This database contains about 322 screening mammography cases, which contains approximately 207, 64, and 51 normal, benign, and malignant cases, respectively [50]. For this study, 20 images are randomly selected from this mini-MIAS database which includes 10 images of both benign and malignant cases. The proposed methodology is a five-stage process as shown in Fig. 1.

Fig. 1
figure 1

Process flow diagram

2.1 Generation of ROI

The manually cropped region encloses the tumor area. In the first stage, input mammographic images from the database are manually cropped to obtain the desired ROI of fixed size (100 × 100). All the manual cropping are performed by Expert 1 who had a clinical experience of more than 20 years.

2.2 Pre-processing

Pre-processing enhances the intensity of the ROI image and further remove the noise. In the second stage, pre-processing is performed using open and close morphological operation using a single MATLAB command. Morphological opening involves erosion followed by the dilation and the inverse i.e. dilation followed by erosion happens in the case of morphological closing. In a binary image, dilation increase the thickness of the objects and erosion shrinks the thickness of the objects. The extent of thickening and shrinking is controlled by a structuring element. In our study, the structural element with a disk shape was chosen. The radius of the structuring element is chosen as four and six for malignant and benign images, respectively. The reason for this is because a higher and a lower radius are causing blurring of the ROI regions. The parameters fixed are found to be suitable for all the images.

2.3 Segmentation

The third stage involves extraction of the desired mass lesions by various segmentation techniques, namely: EM, FCM, KM, MT, PSO, and RG. In the fourth stage, segmentation results are compared with the manual tracings obtained by two experienced radiologists. Finally, in the fifth stage, performance evaluation is performed. The segmentation techniques employed in the current study are discussed briefly in the sub-sections below.

2.3.1 Expectation maximization (EM)

The expectation maximization algorithm is a repetitive process to compute the Maximum Likelihood or probabilities of missing/hidden data from the most likely observed dataset. The EM algorithms basically consist of two processes: E-process and M-process [45, 47, 55]. In our work, a vector of class means, a vector of class variances, and vector of class proportions are used for computing the Expectation and Maximization. The vector of class means, a vector of class variances, and the vector of the class are computed by using the grayscale image and class value. The class value (which is the number of iterations) is morphologically chosen as 10. The maximum likelihood condition (which is termination condition) is the difference of likelihood of Maximization and Expectation. In our study, the maximum likelihood condition is morphologically chosen to be greater than 0.001.

2.3.2 Fuzzy c-means (FCM)

Fuzzy c-means is an unsupervised clustering iterative method in which the dataset is divided into many clusters with a membership function for each cluster. This method continuously updates the center of clusters with associated membership function and moves the center location within the dataset [5, 15, 16, 23, 44, 46]. To computer membership function and updated cluster center, the fuzziness coefficient, number of clusters, and maximum iteration were morphologically chosen as 2, 10, and 50, respectively.

2.3.3 K-means

K-means is an unsupervised clustering algorithm in which k number of clusters is defined with k number of the centroid. Now, points near the centroid are associated to make a group. Again the process of choosing centroid for the group is repeated until the new centroid location is fixed [13, 17, 22, 31, 44]. For computing cluster centers, minimum Euclidean Distance is used for assigning the nearest pixel. The number of clusters is morphologically chosen as 10.

2.3.4 Multilevel thresholding (MT)

Multilevel thresholding is a segmentation technique in which multiple thresholds for the gray image is determined by segmenting the image into many different regions. The segmentation is performed according to the brightness of different mass lesions. From multiple thresholds, one threshold is eliminated each time. A larger desired region is obtained by combining the threshold of a smaller segmented region with its adjacent region. Statistics like mean, standard deviation, and variance of gray images are evaluated to separate objects with multiple thresholds [18, 34, 38, 51]. Histogram entropy is used to compute the threshold. For optimization, the maximum threshold value is morphologically chosen as 10.

2.3.5 Particle swarm optimization (PSO)

In particle swarm optimization, each particle in the population has objective functions which are moving with adaptable velocity. The objective functions are maintained in search space, so as to optimize the best fitness function objective. The velocity and positions of each particle are updated by two values: p-best (fitness by own) and g-best (fitness by neighbour) [9, 29, 34, 35]. In our study, we have chosen the minimum number of particles in the swarm as10, dimension of the fitness function is twice the number of particles, swarm size is three times the dimension of the fitness function, lower fitness bound is 0, upper fitness bound is 255, maximum speed is 255; starting speed is 0, inertia constant is 0.73, cognition constant and social constant is 2.05. These values are the same as used by Neto et al. [29]. Apart from these parameters, the number of iteration was morphologically chosen as10.

2.3.6 Region growing (RG)

The region growing algorithm, start from a manually located seed points applied over suspicious regions of masses in a mammogram. In region growing, initial seed points are located and the region regularly increases by comparing neighbour pixels with the same properties with the pixel of seed point [1, 14, 28, 59]. In our study, a seed (pixel value) is selected in the ROI. The segmented region is evaluated by using density weighting function. The density weighting function depends on the threshold and mean value of four neighbouring pixels. The algorithm will terminate when the distance between the segmented region and the new four neighbouring pixels become higher than a threshold (optimized intensity difference of the region). A value of 10,000 is specially chosen as the maximum threshold.

2.4 Manual segmentation by experts

Manual segmentation is a process of tracing masses in mammogram images by a Radiologist or medical practitioner. Manual delineations by the Radiologist is taken as a gold standard in this study for the evaluation of segmentation methods [25]. In this study, two experienced Radiologists with an experience of 20 years perform manual tracing of the mass lesions in all the mammogram images.

2.5 Statistical analysis

To evaluate the similarity between the different segmented techniques, parametric and nonparametric statistical tests are performed [3, 11, 39, 40]. The t-test (commonly known as Students t-test) is a parametric test used to show the difference between group sample data with respect to mean or average. Mann Whitney, Wilcoxon, Chi-Square and KS tests are nonparametric statistical tests [39]. Here, Mann Whitney testis used to compare how sample data are similar from the observed data, by evaluating the median of the sample data [3]. Wilcoxon test is used to evaluate the similarity between matched or repeated data in a single sample to perform difference rank mean, Chi-Square test is used to calculate dissimilarity between the measured and the expected data which are computed from the null hypothesis condition, and lastly, KS test is used to evaluate the normality test [41]. The significance of all tests depends on the p value (probability of chance) which should be greater than 0.05 to accept the null hypothesis condition. All the statistical tests are performed using MedCalc 16.8.4 software (Osteen Belgium) and a p value less than 0.05 is considered statistically significant [27].

Jaccard Index (also known as Jaccard Similarity Index) and Dice Similarity index (also known as Sorensen Dice Coefficient) is performed to evaluate the similarity between the segmentation and the manual results [56, 58]. Bland Altman and regression plots are also studied to evaluate the difference and to identify the correlation between the segmentation and the manual results [33, 41]. Descriptions of all the statistical and performance tests are shown in Table 6 in Appendix 1.

3 Results

Table 1 shows the general statistical summary of the manual tracing results performed by both the Radiologists and the segmentation techniques for all the patients. All these statistical analyses are performed by taking mass lesion area as the parameter. Table 2 shows the detected area of mass regions (for both malignant and benign cases) observed by Radiologists and different segmentation techniques.

Table 1 Statistic summary of the detected area of mass lesions observed by different segmentation techniques and by both the Radiologists
Table 2 Detected area of mass regions observed different segmentation techniques and by both the Radiologists

Figure 2 presents the ROI generation process. Figure 2a shows the original mammographic image, Fig. 2b show the original mammographic image with the manually cropped region (yellow overlay) and Fig. 2c indicate the manually cropped image (i.e. ROI). Figs. 3 and 4 shows the binary and overlay results of detected mass lesions using six different segmented techniques. Similarly, Figs. 5 and 6 shows the binary and overlay results of the detected mass lesions obtained from the manual tracings performed by Radiologist 1 and Radiologist 2, respectively. It can be observed that the results of all segmentation techniques are very similar to the manual results. Therefore, to identify the best segmentation technique, performance analysis is carried out in the next section.

Fig. 2
figure 2

ROI generation process

Fig. 3
figure 3

Binary results of detected mass lesions using six different segmented techniques: a Expected Maximization, b Fuzzy c-Means, c K-means, d Multilevel thresholding, e Region Growing, and f PSO

Fig. 4
figure 4

Overlay results of six different segmented techniques: a Expected Maximization, b Fuzzy c-Means, c K-means, d Multilevel thresholding, e Region Growing, and f PSO. Red contour is the detected mass lesion

Fig. 5
figure 5

Binary results of detected mass lesions obtained from the manual tracings performed by a Radiologist 1 and b Radiologist 2, respectively

Fig. 6
figure 6

Overlay results obtained from the manual tracings performed by a Radiologist 1 and b Radiologist 2, respectively. Red contour is the detected mass lesion

4 Performance evaluation

The performance evaluation of all segmentation techniques with manual tracing results are observed in three parts: (i) comparison using JI and DSC measures; and (ii) performing statistical analysis namely: t-test, Mann Whitney, Wilcoxon, Chi-Square and KS test; and (iii) observing Bland Altman and regression plots.

4.1 Comparison of segmentation results with manual tracing results

The validation of the segmentation results is tested by comparing the observed results with the manual tracing results. The results of the JI and DSC test are shown in Table 3.

Table 3 Comparison between different segmentation technique and manual tracing results performed by both Radiologists 1 and Radiologists 2

4.2 Statistical analysis

For statistical analysis t-test, Mann Whitney, Wilcoxon, Chi-Square, and KS tests are performed. Table 4 shows a statistical analysis and coefficient correlation of all the six different segmentation techniques and the two manual tracing results performed by both the Radiologists.

Table 4 Statistical test summary between segmentation technique and manual tracing results performed by both Radiologists 1 (R1) and Radiologists 2 (R2)

4.3 Graphical analysis

In Appendix 2, Figs. 7 and 8 present the Bland Altman plots and Figs. 9 and 10 present the regression plots between different segmentation techniques and the manual tracing results obtained from both the Radiologists.

5 Discussion

In our study, mass lesions in mammogram images are segmented by different segmentation techniques and the results are validated by using the manual tracings as shown in Fig. 1. The results of the study are now discussed in the below subsections.

5.1 A note on statistical summary

The statistical summary (see Table 2) of breast lesion area was performed to compare segmentation techniques with both the manual tracing results. From Table 2, it was observed that the median of lesion area for FCM (2.7264) and EM (2.7756) are very close to Radiologist 1 (2.6984) and Radiologist 2 (2.7421). This proves a high degree of similarity between segmentation and manual results. In Table 2, the median of the lesion area for region growing (3.3028) shows less similarity with both the Radiologists. This demonstrates poor performance of region growing method in detection of breast masses.

5.2 A note on the comparison of segmentation results with manual tracing results

From Table 3 it can be observed that JI for FCM and EM showed the highest mean value (0.73, 0.72) and lowest SD value (0.06, 0.08), with both the Radiologists, whereas region growing showed the lowest mean value (0.61) and highest SD (0.17) value for the same. Similarly for DSC, KM, and EM shows the highest mean value (0.84, 0.84) and lowest SD value (0.04, 0.05), with both the Radiologists, whereas region growing shows the lowest mean value (0.84) and highest SD (0.15) value for the same. FCM and EM are clustering algorithms with fewer inhomogeneities variation and have the robustness to noise ambiguity. This feature allows them to retain much information as compared to other techniques especially region growing. Region growing suffers from high inhomogeneities variation as it merges neighbour pixels with the statistical properties.

Higher the value of JI and DSC more is the similarity between the two results. It can be observed that all segmentation techniques have good accuracy and hence, the results are validated. In Table 3, the mean value of JI and DSC for both FCM and EM proved their similarity with both the manual tracing results. Hence, FCM and EM segmentation accuracy are very much acceptable with manual segmentation as compare to the results obtained from other segmentation techniques.

5.3 A note on statistical test

In Table 4, for all the combinations we observed that the KS test with a p value greater than 0.10 and Chi-Square test with a constant p value of 0.236 showed no statistically significant difference. In the t-test for all the combinations, the p value was greater than 0.05 which proved that there is no difference between the two results. Similarly, in the Wilcoxon test, the p value also satisfies the condition which proved that the median of paired observations differences is statistically significant. Region growing technique in both the tests does not satisfy the condition of similarity with both Radiologists. The higher p value of both EM and KM showed their ability to perform accurate mass lesion segmentation as compare with the manual tracing results.

5.4 A note on graphical analysis

Bland Altman plots as presented in Figs. 7 and 8 in the Appendix 2. From the plots, it was observed that variability between EM, FCM, KM, and Radiologist 1 and 2 is lower since all the points are uniformly scattered within the two predictor intervals (mean ± 1.96 SD) excluding few outliers. The low variations proved higher similarity between segmentation and manual tracing results.

Similarly, in the regression plots as presented in Figs. 9 and 10 in the Appendix 2. The regression line, 95% confidence interval, and 95% predictor intervals interpret accurate analysis of the similarity between the two results. For both EM and KM, most of the points lie within the confidence interval which showed a high degree of similarity between segmentation and manual tracing results.

Previous studies had used both FCM and EM for the segmentation of mass lesions in mammogram images. Pavan et al. [42], proposed a segmentation technique for mammogram using FCM. They employed various FCM features on breast tissues and showed performance with 93% consistency with Breast imaging reporting and data system (BIRADS). Also, Vedanarayanan et al. [55] proposed an efficient technique with improved EM for mammograms with showed 100% sensitivity. In the proposed study we have used the same segmentation techniques and the observed results are consistent with the above studies.

5.5 Comparison against current literature and benchmarking

A lot of work has been done so far on breast segmentation. Table 5 shows the comparison between the proposed study and the current literature using ten different attributes, namely: year, number of frames, objective, the technique used, validation, ground truth, number of manual tracers, accuracy, benchmarking, and performance evaluation. In 2014, Dheeba [9], proposed Particle Swarm Optimized Wavelet Neural Network for the detection of breast abnormalities in digital mammograms. A pattern classifier was used for the classification of the suspicious regions. The proposed algorithm was tested on 216 mammograms and a sensitivity and specificity of 94.167% and 92.105%, respectively was observed. Elmoufidi [13], Neto [29], and Raja [34] had tried to perform segmentation of cancer masses in the infected breast using mammogram images. Elmoufidi [13] validated their results against one radiologist and obtained a mean precision percentage of 92.87%. Neto [29] also observed the accuracy of 95.2% but similar to Raja [34], Nurhasanah [31], and Vedanarayanan [55] did not validate their results. In 2018, Sadad [42] performed classification of the tumor from Mammograms using many classifiers and observed the accuracy of 98.2%. The study also validated their results against the MIAS dataset. Very recently, Melouah [28] performed a comparison of automatic seed generation methods using region growing technique. The observed results were benchmarked against one radiologist. Ours is the first study that had tried to compare six different segmentation methods, namely; EM, K-means, FCM, multilevel thresholding, RG, and PSO on a single database which brings in the novelty. The results are validated by two different radiologists and performance evaluation was also performed by JI, DSC, Bland Altman, and Regression plots.

Table 5 Survey of other breast cancer segmentation methods in literature and proposed work

5.6 Strength, weakness, and extensions

The main objectives of this study were to evaluate the performance of lesion detection in the mammogram using different segmentation technique and its comparison with manual delineations provided by expert Radiologists. For validation JI and DSC, measures were employed. Similarly, extensive statistical and graphical tests are used to validate the findings of the study. However, the study also suffers from some limitations such as low data size and manual ROI generation. In the future, the proposed semi-automated system can be made fully automated by using the automated ROI generation system. The segmentation results can be tested on more real-time mammogram datasets. In future, with a larger database, we will implement and compare the performance of different approaches discussed in this study with that of deep neural network. Inter- and intra-observer variability analysis can also be performed in the future. Developing methods to enhance interaction between CAD and medical experts is also important for its clinical acceptance and belief.

6 Conclusion

This study reported a comparative evaluation of six different segmentation techniques namely: EM, FCM, K-means, multilevel thresholding, RG, and PSO for the detection of mass lesions in the mammographic images. Further, the main aim was to compare the performance of these techniques with manual delineations provided by two experienced Radiologists. The result indicates that FCM and EM as the most accurate techniques which can be implied in the clinical settings. Performance evaluation and statistical analysis further proved the stability and reliability of the study. Results of this study indicate that computer-aided lesion detection systems can assist Radiologists in routine clinical practice for the detection of breast tumors in mammographic images.