1 Introduction

Salient object detection [5] refers to the extraction of dominant objects (salient objects) in an image which automatically attracts visual attention. It is a challenging problem in the field of computer vision and has many real-time applications in surveillance systems, remote sensing and image retrieval. It is helpful in automatic target detection, robotics, image and video compression, automatic cropping/ centering to display objects on small portable screens, medical imaging, advertising a design, image enhancement and many more.

Salient object detection involves the transformation of the original image to a saliency map [14] such that the salient objects are highlighted while the background is suppressed. Saliency map generally take the values between [0, 1]. Higher the value of a pixel, higher is its chances to become a salient pixel. The approaches for salient object detection can be broadly classified into two main categories [7]: bottom-up and top-down. Bottom-up approaches involves the extraction of low-level features from the image and then combining them into a saliency map. They are fast, stimulus driven and task independent. While in the top-down approaches, human observation behavior is exploited to accomplish certain goals and is task dependent. Usually top-down approaches are combined with the bottom-up approaches to detect salient objects.

Most of research works mostly focussed on the bottom-up aspect of visual attention. With the advancement of these bottom-up approaches, researchers started distinguishing the two very similar terms: fixation prediction and salient object detection. The fixation prediction models try to mimic the human vision with an objective that the human eyes mainly focus on some of the points in a given scene if shown for a few seconds. These points are helpful in eye movement prediction. The second category of models which are salient object detection models detects the most salient object in an image by segmenting the image into two regions, a salient object and background, by drawing accurate silhouettes of the salient object. Both categories of models construct saliency maps which are useful for different purposes. In literature, the research community has suggested different combination schemes in order to yield a saliency map from a set of low level features. The research work of Itti et al. [14] is motivated by the neuronal activity of the receptive fields in the human visual system. The three features such as intensity, color and orientation were considered equally important and were linearly combined to obtain a saliency map. While Liu et al. [19] proposed a supervised approach to learn a weight vector in order to combine the multi-scale contrast, center-surround histogram and the color spatial distribution features into a saliency map. We also investigated some of the other most popular related models like the one given by Bruce and Tsotsos [6] who modeled visual saliency by utilizing the concept of information maximization. Han et al. [10] applied region growing techniques over the saliency map obtained using the research work of Itti et al. [14] and extracted salient regions. Meur et al. [22] used the subband decomposition based energy for the chromatic as well as the achromatic channels to compute the saliency. Harel et al. [11] extended the work of Itti et al. [14] and gave a graph based visual saliency model. Hou and Zhang [12] gave a simple and fast method for visual saliency detection by extracting the spectral residual of the image. Yu and Wong [29] extracted the salient objects at the grid level instead at the pixel level. Zhang et al. [30] used Bayesian framework to compute the probability of a target at every location in the image. Achanta et al. [2] used an image subtraction technique to generate a frequency tuned saliency model. Achanta and Susstrunk [1] gave the visual saliency model by utilizing the maximum symmetric surround difference for every pixel in the image. Zhang et al. [31] combined position, area and intensity saliency based on the outcome of scalable subtractive clustering, and employed Bayesian framework to classify a pixel into an attention pixel or a background pixel. Goferman et al. [9] proposed a context-aware saliency detection algorithm to detect salient objects. Liu et al. [20] used kernel density estimation method and two-phase graph cut approach to detect salient objects. Shen and Wu [24] incorporated the low rank matrix and a sparse noise in some feature space to detect the salient object. Vikram et al. [27] randomly sampled the image into a number of rectangular regions and computed local saliency over these regions. İmamoğlu et al. [13] proposed a saliency detection model by extracting low-level features based on wavelet transform. Singh and Agrawal [25] modified the Liu et al. [19] model at the feature level and employed a combination of Kullback-Leibler divergence and Manhattan distance to detect salient objects. Liu et al. [21] proposed a novel saliency tree approach to extract salient objects from the image. Zhu et al. [34] used a multisize superpixel approach based on multivariate normal distribution estimation for salient object detection. Peng et al. [23] suggested a saliency-aware image-to-class distances for image classification. Jiang et al. [15] proposed multi-level image segmentation technique which utilizes the supervised learning approach to map the regional feature vector to a saliency score.

Few researchers have extended saliency detection to co-saliency detection, like the one suggested by Fu et al. [8]. They used two layer clustering, where one layer focusses on groups the pixels on each image (single image), and the other layer associates the pixels on all images (multi-image).

Recently researchers have also suggested few models based on deep learning. Zhao et al. [33] proposed a multi-context deep learning framework using deep convolutional neural networks for salient object detection. Lin et al. [18] suggested a model which uses midlevel features on the basis of low-level k-means filters within a unified deep framework in a convolutional manner for saliency detection. Zhang et al. [32] proposed a co-saliency detection method based on intrasaliency prior transfer and deep intersaliency mining. Li and Yu [16, 17] suggested a deep contrast learning method for salient object detection using deep convolutional neural networks.

The common thing that is witnessed from related models is that they explored multiple low-level features of the image and then combined those using different strategies. The features involved were either of the same size of the image or of reduced size. The evaluation of the models is done on publicly available datasets to find their detection accuracy and its computation time. Experimental results demonstrated that most of the models [1, 2, 12, 14, 22, 27, 29,30,31] take less computation time but provide degraded detection accuracy because of either reduced size of image or simpler combination strategies. On the other hand, the models such as [9, 13, 19, 20, 24, 25] achieve better detection accuracy at the cost of higher computation time because they involved either full resolution image or some kind of learning technique is involved in combining the low-level features. However, there is need to develop a model which takes less computation time and simultaneously achieves high detection accuracy. One possible way to realize this objective is to utilize a single dominant feature in a model that is sufficient to describe an image instead of multiple features as commonly used in most of the state-of-the-art methods. In most of the state-of-the-art models dealing with multiple features, we have observed experimentally that color feature is most commonly used and dominates the remaining features. Snowden [26] also suggested that a purely chromatic signal is sufficient to capture visual attention. Color feature can be extracted either at the local level or the global level. Since colors are widely spread in an image, so color as a global feature may be more appropriate.

In this paper we propose an approach which utilizes color feature at the global level to detect the salient object. The motivation of the model came from the fact that image is constructed from several signals (say k), assumed to be Gaussians. Here a signal can be formed from various shades of a color present in the image. Then a mixture of Gaussians needs to build over these signals using a parametric estimation technique. Generally the images present in the datasets consist of thousands of pixels. Estimating the parameters of k Gaussians (strength, mean and covariance) using these large number of pixels will require huge computation time. Instead of this, if these large numbers of pixels are reduced to a smaller number of regions of similar pixels, then the estimation of parameters of k signals will take less computation time.

In the proposed model, the original RGB image of size W × H, where of W and H represent the width and height of the image respectively, is first divided into m superpixels using SLIC superpixels algorithm [3] as it is fast and efficient. Since the superpixel comprises of pixels which are similar in color, hence each superpixel is represented by the mean value of its pixels, thereby reducing the image pixels to only m pixels. The colors of these m superpixels are further clustered into k color components using k-means algorithm. The result of the clustering procedure is used to build Gaussian mixture model, whose parameters are further refined using Expectation-Maximization algorithm. Thereafter, spatial variance of these color components is computed and a center-weighted saliency map is formulated.

It is found that the researchers have adopted superpixels for computing saliency at the local level [28] (i.e. in a specific neighborhood of a superpixel) and not at the global level (i.e. considering the complete image as a whole). The problem that arises here is that only smaller objects are captured and gets higher saliency value, while the larger objects are discarded and gets lower saliency value. To capture the details of the larger objects as well, we used superpixels at the global level. So the use of superpixels and GMM to capture saliency at the global level in a computationally efficient manner is the innovation in the proposed method.

In order to check the efficacy of the proposed model, experiments are carried out on seven publicly available image datasets. The performance is evaluated in terms of precision, recall, F -measure, area under curve and computation time and compared with existing seventeen other popular models.

The paper is organized as follows. Section 2 describes the proposed model. The experimental setup and results are included in section 3. Conclusion and future work are presented in Section 4.

2 Proposed model

In general, humans can effortlessly detect salient objects with high accuracy in real-time. It is a challenge to develop a model which can mimic human behavior such that the model achieves high detection accuracy and takes less computation time. One way of accomplishing this task is to utilize a single dominant feature in the model that best characterizes an image. We have investigated a number of features that are used in different state-of-the-art models and have found that the feature computed in terms of color is most commonly used. Also, Snowden [26] has very well suggested that a purely chromatic signal is sufficient to capture visual attention. There are two different ways of extracting a feature in salient object detection, at local or the global level. In the local level a certain region is picked within an image and saliency is computed over it, while in the global level the complete image is considered while computing the saliency. As far as color is concerned, it is widely spread in an image, so using color as a global feature may be more appropriate.

The proposed model employs the concept of SuperPixels and Gaussian Mixture Model (SP-GMM) which is discussed in detail underneath.

2.1 Gaussian mixture model construction

In the color space, clustering of the RGB image I, i.e. I(p) = [R(pG(pB(p)]T of size W × H into k regions and then constructing Gaussian mixture model is a time consuming process. But if the number of pixels is decreased to m such that m ≪ W × H, then the computation time can be considerably reduced. So the input RGB image is first divided into m superpixels using SLIC superpixels algorithm [3]. Let SP be the set containing the RGB values of m superpixels given by

$$ \mathbf{SP}={\left\{{\boldsymbol{S}\boldsymbol{P}}_i\right\}}_{i=1}^m;{\boldsymbol{S}\boldsymbol{P}}_i=\frac{1}{\left|{\boldsymbol{S}}_i\right|}\sum_{p\in {\boldsymbol{S}}_i}{\left[\boldsymbol{R}(p)\kern0.2em \boldsymbol{G}(p)\kern0.2em \boldsymbol{B}(p)\right]}^T $$
(1)

where SP i is the RGB value of the i -th superpixel, S i is the set of pixels in the i -th superpixel and |S i | represents its size such that \( \sum_{i=1}^m\left|{\boldsymbol{S}}_i\right|= W\times H \). Now the set SP is partitioned into k clusters using k-means algorithm. The result of the clustering algorithm is used as samples to build Gaussian mixture model (GMM).

The parameters of the GMM include the weights, means, and co-varainces of the Gaussians. The initial weight \( {w}_i^0 \) of the i -th cluster is given as

$$ {w}_i^0=\frac{n_i}{m}\kern0.5em i=1,2,\dots, k $$
(2)

where n i is the number of superpixels belonging to the i -th cluster. Assuming that the j –th superpixel belongs to the i -th cluster, the initial mean of the i-th cluster \( {\boldsymbol{\mu}}_i^0 \) is given as

$$ {\boldsymbol{\mu}}_i^0=\frac{1}{n_i}\sum_{j\in {\boldsymbol{P}}_i}{\boldsymbol{SP}}_j\ i=1,2,\dots, k $$
(3)

where P i is the set of superpixels belonging to the i -th cluster. The initial co-variances \( {\boldsymbol{\varSigma}}_i^0 \) are defined as

$$ {\boldsymbol{\varSigma}}_i^0=\frac{1}{n_i-1}\sum_{j\in {\boldsymbol{P}}_i}\left({\boldsymbol{SP}}_{\boldsymbol{j}}-{\boldsymbol{\mu}}_{\boldsymbol{i}}^0\right){\left({\boldsymbol{SP}}_{\boldsymbol{j}}-{\boldsymbol{\mu}}_{\boldsymbol{i}}^0\right)}^{\boldsymbol{T}}; i=1,2,\dots, k $$
(4)

Thereafter, the expectation maximization (EM) algorithm is applied to update the parameters of the GMM until convergence is achieved. Using the current parameters of the l -th iteration the probability of a superpixel j to belong to the i -th cluster is calculated as

$$ {Pr}^l\left( i|{\boldsymbol{SP}}_j\right)=\frac{w_i^l\mathcal{N}\left({\boldsymbol{SP}}_j|{\boldsymbol{\mu}}_i^l,{\boldsymbol{\varSigma}}_i^l\right)}{\sum_{t=1}^k{w}_t^l\mathcal{N}\left({\boldsymbol{SP}}_j|{\boldsymbol{\mu}}_t^l,{\boldsymbol{\varSigma}}_t^l\right)} $$
(5)

Then weight, mean and co-variance of the Gaussians are updated as

$$ \begin{array}{c}\hfill {w}_i^{l+1}=\frac{1}{m}\sum_{j=1}^m{ \Pr}^l\left( i|{\mathbf{SP}}_j\right)\hfill \\ {}\hfill {\boldsymbol{\upmu}}_i^{l+1}=\frac{\sum_{j=1}^m{Pr}^l\left( i|{\mathbf{SP}}_j\right).{\mathbf{SP}}_j}{\sum_{j=1}^m{Pr}^l\left( i|{\mathbf{SP}}_j\right)}\hfill \\ {}\hfill {\sum}_i^{l+1}=\frac{\sum_{j=1}^m{ \Pr}^l\left( i|{\mathbf{SP}}_j\right).\left({\mathbf{SP}}_j-{\boldsymbol{\upmu}}_i^l\right).{\left({\mathbf{SP}}_j-{\boldsymbol{\upmu}}_i^l\right)}^T}{\sum_{j=1}^m{ \Pr}^l\left( i|{\mathbf{SP}}_j\right)}\hfill \end{array} $$
(6)

The log-likelihood for l + 1 iteration is computed as

$$ { l oglik}^{l+1}=\sum_{j=1}^m\left(\mathit{\log}\left(\sum_{i=1}^k{w}_i^{l+1}.\mathcal{N}\left({\boldsymbol{SP}}_j|{\boldsymbol{\mu}}_i^{l+1},{\boldsymbol{\varSigma}}_i^{l+1}\right)\right)\right) $$
(7)

Eqs. (57) are repeated until convergence is achieved. The inequality for the convergence condition is given as

$$ \mathrm{abs}\left({ l oglik}^{l+1}-{ l oglik}^l\right)<1.0 e-3 $$
(8)

Using the final parameter values of the GMM, each and every pixel p of the original RGB image I of size W × H is assigned to the i -th cluster with a probability given as

$$ {Pr}^{final}\left( i|\boldsymbol{I}(p)\right)=\frac{w_i\mathcal{N}\left(\boldsymbol{I}(p)|{\boldsymbol{\mu}}_i,{\boldsymbol{\varSigma}}_i\right)}{\sum_{j=1}^k{w}_j\mathcal{N}\left(\boldsymbol{I}(p)|{\boldsymbol{\mu}}_j,{\boldsymbol{\varSigma}}_j\right)} $$
(9)

where w i , μ i and Σ i are the weight, mean and covariance matrix of the i -th cluster respectively.

2.2 Spatial variance and saliency map computation

The spatial variance measures the distribution of a color component in an image. Lower the spatial variance of a color component better is its chances to be salient and vice-versa. In the spatial domain, variance of the i -th cluster is computed both in the horizontal as well as the vertical direction. The horizontal variance \( {V}_i^h \) of the i -th cluster is given as

$$ {V}_i^h=\frac{\sum_{p\in \boldsymbol{P}}{Pr}^{final}\left( i|\boldsymbol{I}(p)\right).{\left({x}_p-{M}_i^h\right)}^2}{\sum_{p\in \boldsymbol{P}}{Pr}^{final}\left( i|\boldsymbol{I}(p)\right)} $$
(10)

where \( {M}_i^h=\frac{\sum_{p\in \boldsymbol{P}}{Pr}^{final}\left( i|\boldsymbol{I}(p)\right).{x}_p}{\sum_{p\in \boldsymbol{P}}{Pr}^{final}\left( i|\boldsymbol{I}(p)\right)} \), x p is the x-coordinate of the p -th pixel and P is the set of all the pixels present in the image. Similarly the vertical variance \( {V}_i^v \) is computed. The total spatial variance is given by

$$ {V}_i={V}_i^h+{V}_i^v $$
(11)

V i is normalized between [0,1] computed as

$$ {V}_i=\frac{V_i-\mathit{\min}\left({V}_i\right)}{\mathit{\max}\left({V}_i\right)-\mathit{\min}\left({V}_i\right)} $$
(12)

Thereafter, a center-weighted scheme is applied to give more weightage to the clusters present near the center of the image. The position weight D i of the i -th cluster is given by

$$ {D}_i=\sum_{p\in \boldsymbol{P}}{Pr}^{final}\left( i|\boldsymbol{I}(p)\right).{d}_p $$
(13)

where d p is the distance between the pixel p and the image center using the L2 norm. D i is also normalized between [0, 1] computed as

$$ {D}_i=\frac{D_i-\mathit{\min}\left({D}_i\right)}{\mathit{\max}\left({D}_i\right)-\mathit{\min}\left({D}_i\right)} $$
(14)

Finally the pixel-wise saliency map SM is given as

$$ \mathbf{SM}(p)=\sum_{i=1}^k{Pr}^{final}\left( i|\boldsymbol{I}(p)\right).\left(1-{\boldsymbol{V}}_{\boldsymbol{i}}\right).\left(1-{\boldsymbol{D}}_{\boldsymbol{i}}\right) $$
(15)

The values of the saliency map SM are normalized between [0, 1] computed as

$$ \mathbf{SM}=\frac{\boldsymbol{SM}-\mathit{\min}\left(\boldsymbol{SM}\right)}{\mathit{\max}\left(\boldsymbol{SM}\right)-\mathit{\min}\left(\boldsymbol{SM}\right)} $$
(16)

A threshold is applied on the saliency map to generate an attention mask. Fig. 1 depicts the working of the model on certain images.

Fig. 1
figure 1

a Original image b SLIC Superpixels c Saliency map d Ground Truth

3 Experimental setup and results

Intensive care has been taken while evaluating the related models. The parameters as suggested in the related papers have been set accordingly and saliency maps are computed. Table 1 list the parameter values of various models. A qualitative as well as a quantitative evaluation is done in order to measure the performance of the proposed model, and is compared with the existing approaches. All the experiments are carried out using Windows 7 environment over Intel (R) Xeon (R) processor with a speed of 2.27 GHz and 4GB RAM.

Table 1 Parameter values of various models

3.1 Salient object database

The performance of the proposed model and seventeen other related models is examined using the following seven publicly available datasets (Table 2):

Table 2 Datasets used for salient object detection

The test dataset comprises of all these 12,500 images and is used for performance evaluation.

3.2 Qualitative evaluation

The qualitative evaluation of the proposed model and seventeen other related models can be seen in Fig. 2. We have chosen some of the images from the test data set that contain objects differing in shape, size, position, type etc. It can be clearly seen from Fig. 2 that the proposed model yields better saliency maps in comparison to the related methods.

Fig. 2
figure 2

Saliency maps for different state-of-the-art models and the proposed model

3.3 Quantitative evaluation

The quantitative evaluation of the proposed model and seventeen other models is done in terms of precision, recall, F measure, area under curve (AUC), and computation time. Using the ground truth G and the detection result R, precision, recall, F -measure are calculated as

$$ \begin{array}{l}\begin{array}{c}\hfill \Pr ecision=\frac{TP}{TP+ FP}\hfill \\ {}\hfill \operatorname{Re} call=\frac{Tp}{Tp+ FN}\hfill \\ {}\hfill {F}_{\beta}=\frac{\left(1+{\beta}^2\right)\times \Pr ecision\times \operatorname{Re} call}{\beta^2\times \Pr ecision+\operatorname{Re} call}\hfill \end{array}\\ {} TP={\sum}_{G\left( x, y\right)=1}\mathbf{R}\left( x, y\right);\mathrm{FP}={\sum}_{G\left( x, y\right)=0}\mathbf{R}\left( x, y\right)\\ {} FN={\sum}_{\mathbf{R}\left( x, y\right)=0}\mathbf{G}\left( x, y\right);\mathrm{TN}={\sum}_{G\left( x, y\right)=0}\mathbf{R}\left( x, y\right)\end{array} $$
(17)

where β = 1 as we are giving equal weightage to both precision and recall, and TP (true positives) is the number of salient pixels that are detected as salient pixels. FP (false positives) is the number of background pixels that are detected as salient pixels. FN (false negatives) is the number of salient pixels that are detected as background pixels.

AUC is computed by drawing a receiver operator characteristic (ROC) curve. ROC curve is plotted between the true positive rate (TPR) and the false positive rate (FPR). TPR and FPR are given by

$$ \begin{array}{c}\hfill TPR=\frac{TP}{\sum_{\left( x, y\right)}\mathbf{G}\left( x, y\right)}\hfill \\ {}\hfill FPR=\frac{FP}{W\times H-{\sum}_{\left( x, y\right)}\mathbf{G}\left( x, y\right)}\hfill \end{array} $$
(18)

where W and H represents the width and height of the image respectively. The saliency maps corresponding to the proposed model as well as state-of-the-art models are first normalized between [0,255]. Then 256 thresholds are chosen one by one and the values of TPR and FPR are computed and the ROC curve is plotted and finally area under the curve (AUC) is calculated. Table 3 shows the quantitative performance evaluation of the proposed method in comparison to the other state-of-the-art methods on all the seven datasets including their average computation time per image. Their ROC curves are shown in Fig. 3.

Table 3 Quantitative comparison on seven datasets and their computation time
Fig. 3
figure 3figure 3figure 3

ROC for the seven datasets (a) MSRA-B (b) ASD (c) SAA_GT (d) SOD (e) SED1 (f) SED2 (g) ECSSD

The number of superpixels (m) and clusters (k) required to build the Gaussian mixture model play vital role. The number of superpixels were varied from 50 to 500 and it was found that with the increase in the number of superpixels the performance increases till m = 200 and remains constant thereafter. It can be observed from Fig. 4 and Fig. 5 that the best value of performance measures can be obtained at m = 200 and k = 5.

Fig. 4
figure 4

Parameter analysis of the no. of Superpixels (m)

Fig. 5
figure 5

Parameter analysis of the no. of Clusters (k)

Table 3 shows the quantitative evaluation of the proposed model in comparison with seventeen related models. The best results are shown in bold.

  • MSRA-B

  • The proposed model gives fine shape information that fetches it the highest precision, recall and F-measure.

  • The proposed model has the best AUC value except SA [25] model.

  • ASD

  • The proposed model gives the highest precision, recall, F-measure and AUC values.

  • SAA GT

  • The proposed model gives the highest precision, recall, F-measure and AUC values.

  • SOD

  • The proposed model fetches it the highest precision, recall and F-measure.

  • The proposed model has the best AUC value except COSAL [8] model.

  • SED1

  • The proposed model has the highest precision, recall and F-measure.

  • The proposed model has the best AUC value except COSAL [8] model.

  • SED2

  • The proposed model gives the highest precision, recall, F-measure and AUC values.

  • ECSSD

  • The proposed model fetches the highest precision, recall, F-measure and AUC values.

  • Computation Time

  • The SR [12] model takes the least computational time.

  • As compared to the models like Liu [19], AIM [6], GBVS [11], SUN [30], Gof [9], Shen [24], WT [13], SA [25], DRFI [15], DCL [16], MDF [17], the proposed model achieves better detection accuracy and requires very less time.

4 Conclusion and future work

Salient object detection can be achieved by either exploring the bottom-up components or its integration with the top-down components. The research community is mostly fascinated by the bottom-up components as these methods are fast and task independent. Researchers have tried to improve the detection accuracy at the cost of complexity of model which is computationally expensive. Some research efforts are made to reduce the computation time but degraded the detection accuracy. In the proposed model, we attempted to improve the salient object detection accuracy with less computation time. The model employed the use of SLIC superpixels, Gaussian mixture model and Expectation-Maximization algorithm to detect a salient object. Generally the images present in the datasets are of size 300 × 400, i.e. around 0.12 million pixels. Estimating the parameters of Gaussians (strength, mean and covariance) using 0.12 million samples is time consuming. The manuscript attempted to reduce the pixels, say to 200, where there is not much of a loss in the estimated values of the parameters, and then the computation time can be reduced to a considerable extent.

Experimental results demonstrate better performance of the proposed model in comparison to the existing methods in terms of precision, recall and F-measure on all the seven datasets and AUC on four datasets. In comparison to many state-of-the-art models, the proposed model requires less computation time.

There are certain more challenges in detecting salient objects. These include partial occlusion, background clutter, articulation, etc. The datasets used in our experiments contain images with only one salient object. Research work may be extended to detect any number of salient objects or no salient object at all.