1 Introduction

Image segmentation plays an important role in artificial intelligence and image understanding [1, 2]. Over the past decades, works on automatic image segmentation have been a growing interest. Various categories of models for image segmentation models have been explored, such as edge detection, texture analysis or finite mixture model [3].

Among finite mixture models, Gaussian mixture model (GMM) is the most common tool used for image segmentation [4, 5] or video object segmentation [6,7,8,9,10]. The expectation–maximization (EM) algorithm is often used to estimate the parameters of the distribution [11,12,13,14,15].

However, as any finite mixture models, GMM does not consider image spatial information. In fact, the classical GMM considers each pixel as independent, but in an image the objects of interest are composed by connected pixels which share some common statistical properties: values, colors, textures, etc. Many methods have been proposed to incorporate the spatial information in order to improve the conventional GMM [16, 17]. A common way to handle neighboring pixels dependencies is the use of Markov random field (MRF) [18]. So, the incorporation of spatial information in mixtures model based on MRF has been proposed for image segmentation [16, 17, 19, 20]. But these methods suffer from two drawbacks: (1) In the parameters learning step, the model parameters cannot be estimated directly in the maximization step (M-step) of the EM algorithm and (2) the use of MRF is computationally expensive. Although MRF-based GMM shows excellent segmentation results, the very high computational cost limits its use in practical applications. Another approach incorporates local spatial information directly by using a mean template (GMM-MT) [21]. GMM-MT has later been extended by applying either a weighted arithmetic or a weighted geometric mean template to the conditional and the prior probability, called ACAP, ACGP, GCGP and GCAP [22, 23]. These four models are robust to noise and fast to implement. However, these weights are generally equally assigned to the neighbor pixels without any content. A summary of these approaches is listed in Table 1.

Table 1 Summary of methods including spatial information in GMM

Recently, the visual saliency becomes a popular topic for object recognition. This class of methods is based on modeling the visual attention system inspired by the neuronal architecture and the behavior of the primate early visual system. When the goal of an application is the object recognition in an image, a visual saliency map is constructed by the combination of multiscale low-level image features, such as intensities, colors and orientations. These features try to identify the most informative parts on an image, which are candidates to belong to an object [24, 25]. Rensink used the saliency map to detect the region of interest in an image and introduced the notion of proto-objects in [26,27,28]. Itti and Koch proposed a framework for saliency detection that breaks down the complex problem of image understanding by a rapid and computationally efficient selection of conspicuous locations [29, 30]. Then, this group extended the saliency model to object recognition tasks [31]. However, the image features-based saliency map extraction is computationally expensive. To overcome this shortcoming, Hou [32] proposed a simple way to extract a saliency map using the spectral residual in the spectral domain. Essentially, a saliency map reflects the visual importance of each pixel in one image. Therefore, it points out the most noteworthy regions and introduces also rough content-based information.

In this paper we propose to use a saliency map to incorporate context-based spatial information into the conventional GMM for image segmentation. Our model, known as GMM with spatial information extracted from saliency map (GMM-SMSI), is divided into two main steps. Firstly, a saliency map detection is obtained by means of the image spectral residual. Secondly, the saliency map is incorporated as spatial information into the conventional GMM. This two-step approach allowed us to adapt the neighboring template of GMM-MT according to the image content. The proposed model should improve the classical GMM scheme because (1) the saliency map can directly incorporate some spatial information by means of some specific weights assigned to neighbor pixels of the current pixel and (2) it is easy to implement since the saliency detection is an independent step from GMM.

The paper is organized as follows: Sect. 2 describes the proposed method in detail. In Sect. 3 we present and discus experimental results. Finally, Sect. 4 concludes this paper.

2 GMM with saliency map (GMM-SMSI)

2.1 Saliency map

Based on the human visual system, the concept of saliency map has been developed for image understanding and object recognition. The saliency map reflects the regions of an image, which can present an interest in the sense of visual perception. It highlights the pixels, which can potentially contain information to be used in a more complex image classification scheme. In computer vision, visual attention usually focuses on unexpected features in an image. The basic principle of saliency is to suppress the response of frequently occurring features and to keep abnormal features. Researchers made use of the similarities that occur in training images to explore redundant information and so to detect the saliency map. However, this leads to heavy computation cost. To reduce the computational complexity, it is worth exploring the solutions where only one individual image can be used to realize the saliency detection [24,25,26,27,28,29,30,31]. Some researchers tried to extract saliency maps by making use of information in the spectral domain instead of the spatial domain [32,33,34]. These methods are based on the fact that each image shares some statistical redundant average information and can be differentiated by some statistical singularities.

In the spectral domain, the statistical redundant average information can be estimated by the average spectrum \({\mathcal{A}}\)(f) of an image which suggests a local linearity [32]. For an individual image, \({\mathcal{A}}\)(f) can be approximated by filtering the image log spectrum \({\mathcal{L}}\)(f) by a local average filter h n (f):

$$\mathcal{\mathcal{A}}(f) = h_{n} (f) \times \mathcal{\mathcal{L}}(f),$$
(1)

where h n (f) is an n × n mean convolution filter defined by: \(h_{n} (f) = \frac{1}{{n^{2} }}\left( {\begin{array}{*{20}c} 1 &\quad 1 &\quad \ldots &\quad 1 \\ 1 &\quad 1 &\quad \ldots &\quad 1 \\ 1 &\quad 1 &\quad \ddots &\quad 1 \\ 1 &\quad 1 &\quad \ldots &\quad 1 \\ \end{array} } \right)\).

The statistical singularities of each image can be reflected by a spectrum residual \({\mathcal{R}}\)(f) given by:

$$\mathcal{\mathcal{R}}(f) = \mathcal{\mathcal{L}}(f) - \mathcal{\mathcal{A}}(f).$$
(2)

The saliency map \({\mathcal{S}}\)(x) is then calculated based on the residual spectrum as:

$$\mathcal{\mathcal{S}}(x) = \mathcal{\mathcal{G}}(x)*\mathcal{\mathcal{F}}^{ - 1} \left\{ \exp [\mathcal{\mathcal{R}}(f) + \mathcal{\mathcal{P}}(f)]\right\}^{2} ,$$
(3)

where \({\mathcal{G}}\)(x) denotes a Gaussian filter to smooth the resulting map, \({\mathcal{F}}\) −1 is the inverse Fourier transform, \({\mathcal{R}}\)(f) denotes the spectrum residual, and \({\mathcal{P}}\)(f) is the phase spectrum of the image.

Figure 1 shows some examples of image saliency maps. The first column shows the original images. The second column shows the log spectrum of the image (blue solid line) and the spectrum residual (red solid line). It has to be noted that the scales are not the same between these curves. The third column shows the corresponding saliency map. The saliency map is an explicit representation of proto-objects. Most of the authors use a simple threshold in the saliency map to detect the proto-objects. As shown in Fig. 2, this method achieves good results in the images saliency map extraction.

Fig. 1
figure 1

Saliency map construction. The first column shows the original images. The second column shows the log spectrum (blue solid line) and the spectrum residual (red solid line) of the corresponding image. The saliency maps are on the third column

Fig. 2
figure 2

Proto-objects extraction based on saliency map

In our approach, the saliency map is used to assess the regional context around a specific pixel, i.e., the influence of the neighborhood. In this way, each pixel is no longer considered as an individual, but influenced by the neighborhood information.

2.2 Saliency-weighted GMM

GMM is a probabilistic model which represents a distribution by a simple linear combination of Gaussian densities. GMM can be used to cluster N pixels into L class labels [3]. Consider the following symbols: i∈{1, 2,…, N} denotes an image pixel index, x i is the ith pixel in image, and j∈{1, 2,…, L} represents the class label index. In the conventional GMM, the conditional probability that x i belongs to class j is given by Φ(x i | θ j ), which denotes the jth Gaussian distribution component:

$$\varPhi (x_{i} |\varvec{\theta}_{\text{j}} ) = \frac{1}{{2\pi |\varSigma_{j} |}}\exp \left[ { - \frac{1}{2}(x_{i} - \mu_{j} )^{T} \varSigma_{j}^{ - 1} (x_{i} - \mu_{j} )} \right],$$
(4)

where θ j  = {μ j , Σj} denotes the mean and the variance of the jth Gaussian distribution.

In a conventional GMM, the pixel distribution can be described by the following equation:

$$f (x_{i} |\varvec{\varPi},\varvec{\varTheta})= \sum\limits_{j = 1}^{L} {\uppi_{j} \varPhi (x_{i} |\varvec{\theta}_{j} )} ,$$
(5)

where Π = {π12,…,π L } denotes the set of prior probabilities (also called mixture component weights) and Θ = {θ 1 2 ,…,θ L } is the set of parameters of all Gaussian distributions.

The spatial information can be introduced in GMM as a weighted template for computing the conditional probability of x i by its neighbor probabilities [21, 23]. In our model, we utilize the saliency map \({\mathcal{S}}\)(x) to assign the proper weights for neighbors. The saliency-weighted GMM is then:

$$f(x_{i} |\varvec{\varPsi}) = \sum\limits_{j = 1}^{L} {\uppi_{ij} } \left[ {\sum\limits_{{m \in \mathcal{\mathcal{N}}_{i} }} {\frac{{\mathcal{\mathcal{S}}(x_{m} )}}{{R_{i} }}p(x_{m} |\varvec{\varTheta}_{j} )} } \right],$$
(6)

where π ij denotes the probability that the pixel x i belongs to class j, π ij satisfies the constraints \(\pi_{ij} \ge 0\) and \(\sum\nolimits_{j = 1}^{L} {\uppi_{ij} = 1}\); \({\mathcal{N}}\) i denotes the neighborhood of the pixel x i ; R i is the sum of the saliency map values inside \({\mathcal{N}}\) i ; Ψ denotes the parameters set containing all the parameters Ψ = {π11, π12,…, π1L , π21, π22,…, π2L , π N1, π N2,…, π NL , θ 1 , θ 2,…, θ L }; and \({\mathcal{S}}\)(x m ) is the saliency map value at location x m .

We then apply the EM algorithm for the parameters estimation in our model. According to [3], the complete-data log likelihood function is calculated as follows:

$$Q \, = \, \sum\limits_{i = 1}^{N} {\sum\limits_{j = 1}^{L} {\gamma_{ij} \left[ {\sum\limits_{{m \in \mathcal{\mathcal{N}}_{i} }}^{{}} {\frac{{\mathcal{\mathcal{S}}(x_{m} )}}{{R_{i} }}} \log p(x_{m} |\tau_{j} ) + \log\uppi_{ij} } \right]} } .$$
(7)

In the expectation step (E-step), the posterior probability can be calculated as follows:

$$\gamma_{ij}^{(t)} = \frac{{\uppi_{ij}^{(t)} \sum\limits_{{m \in \mathcal{\mathcal{N}}_{i} }} {\frac{{\mathcal{\mathcal{S}}(x_{m} )}}{{R_{i} }}p\left( {x_{m} |\theta_{j}^{(t)} } \right)} }}{{\sum\limits_{h = 1}^{L} {\uppi_{ih}^{(t)} \sum\limits_{{m \in \mathcal{\mathcal{N}}_{i} }} {\frac{{\mathcal{\mathcal{S}}(x_{m} )}}{{R_{i} }}p\left( {x_{m} |\theta_{h}^{(t)} } \right)} } }}.$$
(8)

In the maximization step (M-step), the mean and covariance are computed as follows:

$$\mu_{j}^{(t + 1)} = \frac{{\sum\limits_{i = 1}^{N} {\sum\limits_{{m \in \mathcal{\mathcal{N}}_{i} }} {\gamma_{ij}^{(t)} \frac{{\mathcal{\mathcal{S}}(x_{m} )}}{{R_{i} }}x_{m} } } }}{{\sum\limits_{i = 1}^{N} {\gamma_{ij}^{(t)} } }};$$
(9)
$$\varSigma_{j}^{(t + 1)} = \frac{{\sum\limits_{i = 1}^{N} {\sum\limits_{{m \in \mathcal{\mathcal{N}}_{i} }} {\gamma_{ij}^{(t)} \frac{{\mathcal{\mathcal{S}}(x_{m} )}}{{R_{i} }}(x_{m} - \mu_{j}^{(t)} )} } (x_{m} - \mu_{j}^{(t)} )^{T} }}{{\sum\limits_{i = 1}^{N} {\gamma_{ij}^{(t)} } }}.$$
(10)

And the prior probability is given by:

$$\pi_{ij}^{(t + 1)} = \frac{{\sum\limits_{{m \in \mathcal{\mathcal{N}}_{i} }} {\mathcal{\mathcal{S}}(x_{m} )\gamma_{mj}^{(t)} } }}{{\sum\limits_{h = 1}^{L} {\sum\limits_{{m \in \mathcal{\mathcal{N}}_{i} }} {\mathcal{\mathcal{S}}(x_{m} )\gamma_{mh}^{(t)} } } }}.$$
(11)

2.3 Flowchart of GMM-SMSI

The flowchart of the proposed model is described as follows:

Step 1: Salience map extraction.

  1. a.

    The image is converted to the spectral domain using FFT. This gives the amplitude spectrum \({\mathcal{F}}\)(f) and the phase spectrum \({\mathcal{P}}\)(f) of the image.

  2. b.

    The log spectrum representation \({\mathcal{L}}\)(f) is given by the logarithm of \({\mathcal{F}}\)(f).

  3. c.

    The estimation of the average spectrum \({\mathcal{A}}\)(f) is given by using (1).

  4. d.

    The calculation of the residual value \({\mathcal{R}}\)(f) is given by using (2).

  5. e.

    The generation of the saliency map \({\mathcal{S}}\)(x) is given by using (3).

Step 2: GMM incorporating the saliency map as spatial information.

  1. a.

    The k-means algorithm is used to initialize the parameters sets Ψ (0).

  2. b.

    The EM algorithm (Eqs. (8)–(11)) is applied for the parameters estimation until convergence. At the end, we get the parameters set Ψ(c).

  3. c.

    The image pixels are then labeled based on a highest posterior probability.

A brief example of the flowchart of the conventional GMM and GMM-SMSI is shown in Fig. 3. Compared to the conventional GMM, our algorithm does not segment directly the data based on the posterior maximum value, but extracts a saliency map firstly and incorporates it as weight to perform the GMM algorithm.

Fig. 3
figure 3

Flowchart of the conventional GMM and GMM-SMSI

3 Experiments

In this section, experimental results of GMM-SMSI are compared with some classical methods such as Spatial Variant Finite Mixture Model (SVFMM) [20], Fuzzy Local Information C-Means (FLICM) [35], Hidden Markov Random Field with Fuzzy C-Means (HMRF-FCM) [18], and incorporation of an arithmetic mean template to both the conditional and prior probabilities of a GMM (ACAP) [23]. Our experiments have been performed on MATLAB R2013a and executed on an Intel i5 Core 2.8 GHz CPU with 12.0 GB RAM. We experimentally evaluate these methods on a set of real images from the Berkeley image dataset [36].

The segmentation performance of these methods was evaluated using Probabilistic Rand (PR) index values [37]. It has been shown that the PR index takes values between 0 and 1. A PR value close to 1 indicates a better segmentation, while that close to 0 indicates a worse one. In order to analyze the behavior of the methods, we first performed the experiments on four classes of images content: a tiny object in a large background region, a large object in small background region, buildings, and human face images. Then we globally compared the overall performance of the several methods together using the average PR index and the average computation time when the methods were applied on all the images of the Berkeley image dataset.

3.1 Tiny object in large background region

In the first experiment, we chose an image (481 × 321) with two birds in the sky in order to show the ability to segment tiny objects in a large background region (Fig. 4a). The goal was to segment the image into two classes: the objects with two birds and the sky. Comparing the results of the several methods, we noticed that for SVFMM (Fig. 4b, PR = 0.9835), there was a large misclassification of the sky and also of the region between the birds. FLICM (Fig. 4c, PR = 0.9834) showed a similar misclassification of the sky; however, the two birds were now disconnected. In HMRF-FCM (Fig. 4d, PR = 0.9853), the sky misclassification was smaller than in SVFMM and FLICM; however, the two birds were difficult to separate. The accuracy of the segmentation for ACAP (Fig. 4e, PR = 0.9855) was better than that for the other three methods because there was a good classification of sky; however, the two birds were not separated. Our method, GMM-SMSI (Fig. 4f, PR = 0.9864), was able to distinguish the two birds. Compared to the other methods, the wings of the little bird (green square) showed also more details. Furthermore, our algorithm obtained the highest PR index value.

Fig. 4
figure 4

Tiny object in a large background region segmentation results. a Original image. b SVFMM, PR = 0.9835. c FLICM, PR = 0.9834. d HMRF-FCM, PR = 0.9853. e ACAP, PR = 0.9855. f GMM-SMSI, PR = 0.9864

3.2 Large object in small background region

In the second experiment, we chose an image (481 × 321) with a relatively large starfish in the seabed (Fig. 5a). We also tried to segment the image into two classes: the background and the starfish. As shown in Fig. 5b, the segmentation for SVFMM (PR = 0.6835) was not able not to distinguish the upper part of the starfish from the background. It is shown in Fig. 5c that FLICM (PR = 0. 6834) had a similar behavior compared with SVFMM. Figure 5d shows that HMRF-FCM (PR = 0.7853) achieved good background and object cluster separation, whereas it was not suitable to segment an object with lot of details. The segmentation for ACAP (PR = 0.6855) was better than for SVFMM, FLICM and HMRF-FCM, as shown in Fig. 5e, and the starfish was clearly separated from the background. There were still misclassifications of pixels at the edge of the starfish. It can be noticed (Fig. 5f) that our algorithm (PR = 0.6986) separated clearly the starfish and the background. The details of the starfish can be better distinguished. Furthermore, our algorithm obtained the highest PR index values.

Fig. 5
figure 5

Large object in a small background region segmentation results. a Original image. b SVFMM, PR = 0.6835. c FLICM, PR = 0.6834. d HMRF-FCM, PR = 0.7853. e ACAP, PR = 0.6855. f GMM-SMSI, PR = 0.6986

3.3 Building

In the third experiment, we tried to segment an image of a building (481 × 321) into two classes: the church and the background (Fig. 6a). It is shown in Fig. 6b–d that for SVFMM (PR = 0. 7204), FLICM (PR = 0. 8977) and HMRF-FCM (PR = 0. 8379), the top right corner of the background was misclassified as church region. The misclassification of SVFMM was much larger than that of FLICM and HMRF-FCM. The segmentation accuracy of ACAP (PR = 0.8599) was better, and the sky and the church were clearly separated, as Fig. 6e shows. The segmentation of our method (PR = 0. 8362) showed more details in the church, such as stairs, the left window and the door (Fig. 6f). It was more in phase with human vision. However, it has to be noted that the PR index value of our method was lower than that of FLICM, HMRF-FCM and ACAP since the door, the stairs, the window and even some shadow were classified as background in our method and as object in the ground truth. However, our algorithm showed more details and so offered more information for image understanding.

Fig. 6
figure 6

Building image segmentation results. a Original image. b SVFMM, PR = 0.7204. c FLICM, PR = 0.8977. d HMRF-FCM, PR = 0.8379. e ACAP, PR = 0.8599. f GMM-SMSI, PR = 0.8362

3.4 Human face

We also performed our evaluation on a human face image (481 × 321) as shown in Fig. 7a. The goal was to segment the image into two classes. In Fig. 7b, the information about human face can be detected through SVFMM (PR = 0. 7204), whereas it contained only a part of the eyebrows rather than the whole ones. As shown in Fig. 7c, FLICM (PR = 0. 8977) provided correct facial information; however, the textures of the clothes were not clear. HMRF-FCM (PR = 0. 8379) achieved a better clustering than SVFMM and FLICM but with a lot of lost information, as shown in Fig. 7d. Similar to HMRF-FCM, ACAP (PR = 0. 8599) also achieved better clustering, as shown in Fig. 7e. In Fig. 7f, it can be seen that our algorithm (PR = 0.8362) showed more details of the human face, such as the full eyebrow information. It was also true when considering the texture of the clothes. It was noticed that the PR index value of our method was lower than that of FLICM, HMRF-FCM and ACAP. Actually, these methods proposed a better clustering, but with details loss. However, our algorithm showed more details to offer more information for face recognition and image understanding.

Fig. 7
figure 7

Face image segmentation results. a Original image. b SVFMM, PR = 0.7204. c FLICM, PR = 0.8977. d HMRF-FCM, PR = 0.8379. e ACAP, PR = 0.8599. f GMM-SMSI, PR = 0.8362

3.5 Segmentation performance

In this subsection some objective ways to evaluate SVFMM, FLICM, HMRF-FCM, ACAP and GMM-SMSI are proposed. Table 2 presents the PR index values obtained on a sample of nine different images. These images varied in terms of number of classes to be estimated. Globally, our method obtained almost the best PR index (bold numbers) for these examples. This performance is confirmed by the mean of the PR indexes obtained by the several methods on all the images of the Berkeley image dataset. GMM-SMSI obtained the highest mean PR index (nearly 12.55% higher than SVFMM) and can so be considered to be globally the most accurate algorithm. Figure 8 shows the boxplot of the PR indexes obtained by each method on the whole Berkeley image dataset. It confirms that GMM-SMSI has the highest median value, but also the smallest interquartile range which indicates a higher robustness.

Table 2 Comparison between the different methods applied on the Berkeley image dataset, PR index and mean computation time
Fig. 8
figure 8

Boxplot of the PR for several algorithms applied on the Berkeley image dataset

3.6 Computation time

In this subsection, we evaluate the computation time of SVFMM, FLICM, HMRF-FCM, ACAP and GMM-SMSI. Table 2 also presents the average computation time of each method when applied to the whole image set. GMM-SMSI took the lowest computation time that is nearly 14.67% of this of HMRF-FCM. The result can be explained by several facts: The spatial information is computed only once, so no further spatial research is needed; the spatial information is integrated explicitly in the EM scheme; and the spatial information helped the EM algorithm to converge faster. Based on the experiments, it appears that the proposed GMM-SMSI algorithm brought some benefits in aspects of accuracy, time-cost and the capability to display more detailed information.

4 Conclusion

In this paper, we have proposed a new algorithm, the Gaussian mixture model with saliency map as spatial information based on classical Gaussian mixture model. The saliency map helped to incorporate image content-based spatial information into the GMM. The conditional probability of an image pixel was replaced by the computation of the probabilities in its immediate neighborhood weighted by the image saliency map information. Saliency map assigned the proper weights to pixel’s neighborhood to enhance the role of significant pixels. Since the saliency map extraction was independent of GMM, it makes the proposed model simple to implement. In addition, the parameters of GMM-SMSI can be easily estimated by expectation–maximization (EM) algorithm. In experiments performed on the public Berkeley database, we have demonstrated that the proposed GMM-SMSI method outperformed the state-of-the-art methods in aspects of both classification accuracy and computation time. Moreover, these experiments indicated that our method can detect more object details in an image. In summary, the proposed GMM-SMSI is an accurate, robust and fast algorithm which can be easily implemented and has a good execution time performance.