1 Introduction

Despite advancements in early detection and treatment [9, 20, 25,26,27], the breast cancer is the leading cause of death among women due to the lack of early diagnosis. Mammography screening is the most adopted technique to performing early detection of breast cancer. In mammography images, suspected breast cancer appears as white spots. Breast density, the presence of tags, artifacts or even pectoral muscle influence the sensitivity of mammography.

In the literature, several studies have been developed for the detection of regions of interest (ROI) in mammograms. Among these studies, we find the study of Alayliet al. [2] in which they employ the thresholding algorithm for the detection of breast cancer. This technique poses the problem of determining the threshold. Abdo et al. [15] proposed a method based on K-means with a mixture of gamma distributions; Singh et al. [33] have used K-means and Fuzzy C-means for the detection of mass center in mammography; Siddheswar et al. [4] proposed a method based on image processing functions, K-means and Fuzzy C-Means clustering. Approaches based on K-means algorithms and Fuzzy C-means have the disadvantage of initializing cluster number and centers. Elmoufidi [13] chose to combine the LBP (Local Binary Pattern) and dynamic k-means algorithm. Liu et al. [19] use GVF snake algorithm for extraction of extrapolated breast object. Mustraet al [22] proposed an Adaptive histogram equalization and polynomial curvature estimation. Agrawal et al. [1] proposed Saliency maps for ROI segmentation and the ROIs classification using entropy features. Jen et all [16] proposed a detection method for abnormal mammograms in mammographic datasets based on the novel abnormality detection classifier (ADC) by extracting a few of discriminative features, first-order statistical intensities and gradients. Kwok et al. [17] used the Hough transform to identify the pectoral muscle. In its approach, the pectoral muscle edge was estimated first by a straight line and then refined to a curve. But, the pectoral muscle limit can’t be properly found when the complex texture exists in the muscle region. However, segmentation may become inaccurate for small pectoral muscles. These techniques can be cited along other ones [6, 21, 24].

Nagi et al. [24] used the morphology and the growing seeded area as pretreatment to detect the pectoral muscle. However, the hypothesis of a segment of right for the representation of the edge of the pectoral muscle is not always correct. Wang et al. [35] presented a method based on a discrete time Markov chain (DTMC) and an active contour model to detect the edge of pectoral muscle.

In this paper, we propose new techniques which address the problems cited in the previous paragraph. Our idea is outlined as follows: we start with an Otsu’s thresholding method. Next, an image classification by estimating the number of classes based on LBP (Local Binary Pattern) technique. To automate the initialization task, we have proposed to apply the classification by k-means dynamic improved by Markov method. The tumors image is the result of the maximum correlation.

This paper is an extension of our work presented in [11] and is organized as follows: In section 2, we present used methods. In Section 3, we describe proposed approach. The results along with discussions are presented in Section 4 and the last section is dedicated to the conclusion.

2 Materials and methods

In this article we have used several methods adopted for their simplicity as well as their efficiencies demonstrated in the literature. This work tried to solve two major problem which are elimination of the pectoral muscle and detection of the tumor on mammogram images. First, we started with an Otsu method which is the most used in the literature to erases unwanted areas and labels in mammograms images. Then we use an LBP method to estimate the average number of classes. Then to extract the optimal number of classes we adopt the algorithm proposed by Elmoufidi et al. [13]. After that, we explored the Marcov method to improve the classes obtained by k-means dynamic. Finally, we took advantage of the correlation to know the classes of pectoral muscle and tumor.

2.1 Otsu method

The principle of Otsu method is to find an optimal threshold that maximizes the difference between two classes [30]. It is performed based on the variance. The optimal threshold Soptimal is one that maximizes the following functions:

$$ \lambda (t)=\frac{\delta_B^2(t)}{\sigma_W^2(t)}\kern0.5em ,\eta (t)=\frac{\delta_B^2(t)}{\sigma_T^2(t)}\kern0.5em ,K(t)=\frac{\delta_T^2(t)}{\sigma_W^2(t)} $$

If η(t) is chosen, then

$$ {S}_{optimal}={argmax}_{to\left[\mathit{\min},\mathit{\max}\right]}\eta (t) $$
(1)

Where \( {\delta}_T^2,{\delta}_B^2,{\delta}_W^2 \) are successively the total variance of the image, the inter-class variance (between-class variance) and intra-class variance (within-class variance).

$$ {\delta}_B^2(t)={\delta}_T^2(t)-{\delta}_W^2(t) $$
(2)
$$ {\delta}_T^2(t)=\sum \limits_{\mathit{\min}}^{\mathit{\max}}{\left(i-{m}_T\right)}^2 $$
(3)

\( {m}_T={\sum}_{i-\mathit{\min}}^{max}i\ast {P}_i \): The total average of all the image points

$$ {\delta}_B^2(t)={P}_{font}(t)\ast {\delta}_{font}^2(t)-{P}_{objet}(t)\ast {\delta}_{objet}^2(t) $$
(4)

Pi: The probability of occurrence of the gray level i in the image.

$$ {P}_i=\frac{number\ of\ pixels\ whose\ gray\ level=i\ }{number\ of\ pixels\ in\ image}=\frac{h(i)}{M\ast N} $$
(5)

Pfont(t), Pobjet(t): The sum of the probabilities of occurrence of gray levels of pixels of the background and that of the object by taking the threshold t.

$$ {P}_{object}(t)=\sum \limits_{i-\mathit{\min}}^t{P}_i,{P}_{font}(t)=\sum \limits_{i=t+1}^{\mathit{\max}}{P}_i=1-{P}_{object}(t) $$
(6)

mfont, mobjet: The average of the pixels belonging to the background and that of the pixels of the object.

$$ {m}_{objet}(t)=\frac{\sum_{i-\mathit{\min}}^ti\ast {P}_i}{P_{objet}},{m}_{font}(t)=\frac{\sum_{i=t+1}^{max}i\ast {P}_i}{P_{font}} $$
(7)

\( {\delta}_{font}^2(t),{\delta}_{objet}^2(t) \): The variance of the class background and the variance of the class object.

$$ {\delta}_{objet}^2(t)=\frac{\sum_{i-\mathit{\min}}^t{\left(i-{m}_{objet}\right)}^2\ast {P}_i}{P_{objet}},{\delta}_{objet}^2(t)=\frac{\sum_{i-\mathit{\min}}^{max}{\left(i-{m}_{objet}\right)}^2\ast {P}_i}{P_{objet}} $$
(8)

[min, max] is the dynamic range of the image.

2.2 LBP (label binary pattern)

The descriptor LBP (Local Binary Pattern) was proposed by Ojala et al. [28, 29] in 1996 for the texture classification.

We consider an image I(x, y) and gc representing the gray level of the central pixel (x, y) Moreover, gp the gray value of its neighbors and P represents thetotal number of neighbors concerned and R is the radius of the neighborhood:

$$ {g}_p=I\left({x}_p-{y}_p\right),p=0,\dots \dots \dots .,P-1 $$
(9)
$$ {x}_p=x+ Rcos\left(\frac{2\pi p}{p}\right) $$
(10)
$$ {y}_p=y- Rsin\left(\frac{2\pi p}{p}\right) $$
(11)

LBP operator is defined as follows:

$$ {\mathrm{LBP}}_{P,R}=\sum \limits_{p=0}^{p-1}S\left({g}_p-{g}_c\right){2}^p $$
(12)

The thresholding function S(x) is defined by:

$$ S(x)=\left\{\begin{array}{c}1,x>0\\ {}0,x<0\end{array}\right. $$
(13)

2.3 K-means

k-means is the simplest unsupervised learning algorithm that solve the problem of classification.

k-means is to minimize the sum of squared distances between all the points and the class center [31].

$$ J(V)=\sum \limits_{i=1}^k\sum \limits_{j=1}^{c_i}{\left(\left\Vert {x}_j-{v}_i\right\Vert \right)}^2 $$
(14)

Where:

  • K: The number of cluster centers;

  • ci: The number of data points in ithcluster;

  • xj − vi‖: The Euclidean distance between xj andvi;

  • vi: The mean of ith in ci during each iteration; it is as Follows:

$$ {v}_i=\frac{\sum_{j=1}^{c_i}{x}_j^i}{c_i} $$
(15)

Let X = {x1, x2x3, ……. ., xn} be the set of data points and V = {v1, v2, v3………, vc] be the set of centers.

The main steps of the method “K-means” can be summarized as follows:

  • Randomly selecting K objects.

  • Assigning each object to the nearest class, each of these classes is characterized by a center.

  • Calculate the new representatives for classes.

  • Repeat 2 and 3 until the centers cease moving.

The intra-cluster distance: is the sum of squared distance from all points to their cluster centers (see eq. 16).

$$ intra- cluster=\frac{1}{N}\sum \limits_{i=1}^k\sum \limits_{j=1}^{c_i}{\left\Vert {x}_j-{v}_i\right\Vert}^2 $$
(16)

Where: N is the number of pixels in the image, k is the number of clusters, and vi is the cluster centre of clusterci.

The inter-cluster distance: is the distance between cluster centers (see eq. 17).

$$ inter- cluster=\min \Big({\left\Vert {V}_j-\right.\left.{V}_i\right\Vert}^2 $$
(17)

Where: i = 1, 2, …, k − 1 andj = i + 1, …, k.

$$ Ratio=\frac{intra- cluster}{inter- cluster} $$
(18)

2.4 Hidden markov

The hidden Markov models (Hidden Markov Models or HMM) model random phenomena that are assumed to comprise a first level of a random process of transition between unobservable states (hidden states) and on second level, other random process in each state generates observable values. Assume that Z is a 2D gray-level matrix (M ∗ N).The \( {Z}_i^T \) denotes the intensity measurement at pixel i. Given an image y = (y1, y2, …., yN). Each yi associated with pixel i is an unknown class label xiϵ L where L is regarded as the set of all possible labels. The Gaussian Hidden Markov Random Field (HMRF) can be specified as:

$$ P\left({y}_i\left|{X}_{N_i};\theta \right.\right)=\sum \limits_{1\in L}g\left({y}_i;{\theta}_i\right)q\left(1\left|{X}_{N_i}\right.\right) $$
(19)

Where X = (X1, X2, ……, XN), g(yi, θ1) is a Gaussian probability density function with parameter θ1 = (μ1, σ12) and \( q\Big(1\left|{X}_{N_i}\Big)\right. \) is a conditional probability mass function for the class label l.

We use the MAP and EM algorithm to estimate the parameter set x and θ.

  • MAP algorithm

We seek a labeling of an image, which is an estimate of the true labeling, according to the MAP criterion:

$$ \widehat{X}=\mathit{\arg}\underset{x\in X}{\max}\underset{x\in X}{\max}\left\{p\left(y|x;\theta \right)f(x)\right\} $$
(20)

It is assumed that yi and Xi are pair-wise independent so

$$ P\left(y\left|X;\theta \right.\right)=\prod \limits_{i=1}^NP\left({y}_i\left|{X}_i\right.\right) $$
(21)

and the probability density function for x is the so-called Gibbs distribution (proposed by Geman [14]) is given by:

$$ f(X)=\frac{1}{Z}{e}^{-U(x)} $$
(22)

Where Z is a normalizing constant called the partition function, and U(x) is an energy function given by the form:

$$ U(x)=\sum \limits_{c\in C}{V}_c(x) $$
(23)

Where Vc(x) is the clique potential and C is the set of all possible cliques (see more details in [14]). In this paper, it is assumed that each pixel has at most 4 neighbors in the image domain. Then, on pairs of neighboring pixels, the clique potentials is calculated by:

$$ {V}_c\left({X}_i,{X}_j\right)=\frac{1}{2}\left(1-{I}_{X_i,{X}_j}\right) $$
(24)

The MAP estimation is equivalent to minimizing the posterior energy function

$$ \widehat{X}=\mathit{\arg}\underset{x\in X}{\min}\underset{x\in X}{\min}\left\{U\left(y|x\right)+U(x)\right\} $$
(25)

Where \( U\Big(y\left|x\Big)=\right.\left.{\sum}_i\left[\frac{{\left({y}_i-{\mu}_{X_i}\right)}^2}{2{\sigma}_{X_i}^2}\right.+\frac{1}{2} loglog{\sigma}_{X_i}^2\right] \) Forsolving the MAP problem we can use the same approach proposed in [36].

  • EM algorithm

  • We use the EM algorithm to estimate the parameters θ. Below, it is briefly explained:

  • At the kth iteration, we have Θ(k), and We compute the EM functional:

$$ Q\left(\left.\theta \right|{\theta}^{(k)}\right)=E\left[ loglogp\left(y,x,\theta \right)\left|y,{\theta}^{(k)}\right.\right] $$
(26)
  • For obtaining the next estimate we maximize the EM functional.

$$ {\theta}^{\left(k+1\right)}=\mathit{\arg}\underset{\theta }{\max}\underset{\theta }{\max }Q\left(\theta |{\theta}^{(k)}\right) $$
(27)

More details can be found in [3, 36].

2.5 Cross-correlation

The cross-correlation measurement normalized centered, noted ZNCC (Zero mean Normalized Cross-Correlation) is given by:

$$ ZNCC\left({f}_g,{f}_d\right)=\frac{\left({f}_g-\overline{f_g}\right).\left({f}_d-\overline{f_d}\right)}{\left\Vert {f}_g-\overline{f_g}\right\Vert .\left\Vert {f}_d-\overline{f_d}\right\Vert } $$
(28)

ZNCC(fg, fd) Values belong to the [−1, 1] interval. This measure corresponds to the coefficient of linear correlation classic statistics. This measurement is one of the most used, particularly in [16]. It has the advantage of exhibiting gain and bias type of invariance.

2.6 Region growing

The segmentation method by increasing regions [10, 32] is still used in many applications. In fact, this technique enables us to take into account the positions previously found to accelerate the segmentation. The method begins by sowing « seeds » in the image; they will give birth to regions. Then, regions grow, and then merge so that we finally obtain stable regions. The original pixels are called « seeds » or « primers ». We start with a seed and it is extended by adding adjacent pixels that satisfy the homogeneity criterion.

2.7 Data base

To test our approach, we have used the mini-MIAS database [34]. This database contains 322 digital mammograms images of the size 1024 * 1024 pixels and of the PGM type, these images are in grayscale with a pixel intensity of the interval [0,255], acquired mammogram images are classified into three major cases: normal, benign and malignant. The Fig. 1 shows the various components of an image of the base used.

Fig. 1
figure 1

Example of an image of the mini-MIAS base

3 Proposed approach

The problem we want to solve in this article is how to extract and detect the tumor region in mammograms images. As shown in Fig. 2 the proposed approach is based on the following steps:

  1. Step1:

    The preprocessing phase: (This phase is applied on all the base images) Applying a pretreatment on each image of the database MIAS using Otsu’s method for the binarization and the removal of unwanted areas. Obtained images are stored in a new base (treated MIAS).

  2. Step2:

    The number of recovery phase of average classes: In this step, we want to recover the average number of classes from the MIAS treated base to utilize it in the next phase. To this end, we apply the LBP method.

  3. Step3:

    The recovery phase of the number of optimal classes to extract the optimal number of classes, we use the algorithm proposed in [13], this algorithm take as parameters input an image and the number of average classes recovered in the previous phase and as output, the number of optimal class of input image.

  4. Step4:

    The extraction phase of the classes After the recovery of the optimal number classes, this number is used to initialize the k-means algorithm, as a result of the application of this algorithm; we obtain a set of images, each representing a class.

  5. Step5:

    The adjustment classes phase to get a good classification, we adjusted the classes obtained in the previous phase using the method of Markov [12]. To make this adjustment, this method that uses the original image as a reference image to correct the classes obtained in the extraction phase of the classes.

  6. Step6:

    The selection phase of the tumor class in this step, we want to choose the class that contains the tumor in an automatic manner, to this end; we compute the correlation class obtained in step 5 for each image of the database MIAS_treated. After observing results, we observe that for each image, a class having a high cross-correlation represents the tumor class Table 1. So later this criterion is used to select tumor class.

  7. Step7:

    The pectoral muscle elimination phase; most of the tumor classes selected contain several objects, these objects represent parts of tumors and another part represent (pectoral muscle); so to distinguish the tumor objects in this phase we want to eliminate the pectoral muscle. In order to do this, we apply the method applied in Growing region. It begins with the starting pixel research of this method. This pixel is found either in the left or right corner of the image. After the elimination of the muscle, we clearly see that the objects that remain in the tumor class represent only the tumor.

Fig. 2
figure 2

Proposed approach

Table 1 The values of the correlation classes (NAN meant that the denominator used to calculate the correlation is zero)

4 Result and discussion

4.1 Result

The algorithm described above is used to segment in an automatically manner the breast tumors in a mammogram. As we mentioned in Section 2, the images used in this study were obtained from the mini-MIAS database. Figure 3 present some example of the detecting the breast tumors based on the criteria presented above. This figure shows the results obtained for three images. The output of each step of our algorithm has been shown in different lines. These results show that the proposed method can segment and detect the tumor part with good quality.

Fig. 3
figure 3

Examples of the results obtained on some images

The pectoral is the term relating to the chest. It is a large fan shaped muscle that covers much of the front upper chest. Hence during the mammogram capturing process pectoral muscle also would be captured. The pectoral muscle represents a predominant density region. Hence it will severely affect the result of image processing. For better detection accuracy pectoral region should be removed from mammogram image. The orientation of the breast should be found out to remove the pectoral region. After the removing the artifact, the pectoral region also removed using connected component labeling methods. Figure 3 shows the pectoral muscle removal image. Table 2 show the comparative analysis of pectoral muscle removal results. For the 322 mammograms evaluated, the mean values of accuracy and error are 91,92% and 8,07% respectively.

Table 2 Comparative analysis of pectoral muscle removal

4.2 Discussion

Most of the work (see Table 3) that treat the tumor extraction problem meet two main problems. The first problem is the removal of pectoral muscle and the second problem is the extraction of tumor. For this work, we presented a solution to solve these two problems at the same time. We used the MIAS basis to test our approach, each image contains a defect, we have information about the center and an approximation on the circle radius around the anomaly presented.

Table 3 The current literature versus the contribution of the study

5 Conclusion

In this article, we have proposed a method for classification and automatic detection of the tumor on mammogram images. To improve the quality of detection of the tumor, first, we presented a technique of preprocessing to remove objects that not belong to the breast through the Otsu method. After the preprocessing step, we estimated the number of classes based LBP Technique (Local Binary Pattern). Then we performed a classification from k-means and we improved the classes obtained with this method based on the method of hidden Markov. Finally, we calculated the correlations between these classes and the original image to detect automatically the class that contains the tumor and the pectoral muscle. To eliminate the pectoral muscle, we applied the region growing method. Experimental results compared with previous state-of-the-art methods on mini-MIAS database showed that our method consistently achieved high accuracy of pectoral muscle removal which reaches 91,92%.