1 Introduction

The human visual system and the brain are remarkably fast in processing visual information in a fraction of time. The human visual system makes this process faster by focusing on a “distinctive and attentive” object or action and processing it first before the other regions. The human eye is fixated to a distinctive region with higher priority and spends much of the processing time on it as compared to the other non-distinctive regions. The distinctive region(s) is/are named as salient region(s) and the map describing its distinctiveness is called saliency map in computer vision. The goal of saliency detection algorithms is to estimate the fixation of a human eye according to the distinctiveness. The estimated saliency map is used in different applications to mimic the human visual system such that the modification of the scene induces as less visible artifacts as possible. The estimated saliency map is used in many computer vision applications such as image segmentation [1], object detection [2], image compression [3] and image enhancement [4], to name a few. The efficiency of these applications depends on the accuracy of the underlying saliency detection algorithm.

Rather than improving the existing saliency detection algorithms, can we improve the saliency maps generated by the state-of-the-art saliency detection algorithms using some iterative process? In this work, the values in terms of the saliency map generated by a saliency detection algorithm are improved in the salient region of an image by modifying the original image in an iterative manner. Most of the saliency detection algorithms look for local contrast or edge details at first stage to estimate the saliency map of an image. Therefore, if the local contrast of an image is modified such that the edge details in the non-salient region are suppressed and those present in the salient region are enhanced, it can enable us to modify and improve the saliency map of an image. This motivation is behind the proposed framework to enhance the saliency map of an image generated using the existing saliency detection algorithms. The number of iterations required for the improvement in a given saliency map depends on the image content making it an adaptive algorithm.

The paper is organized as follows. Section 2 surveys various saliency detection algorithms and different criteria considered for defining saliency. The framework for improvement in the saliency map generated by existing saliency detection algorithms is proposed in Sect. 3. The results and the comparisons are discussed in Sect. 4. The paper is concluded with some pointers to future work in Sect. 5.

2 Related Work

Saliency detection algorithms use different features of an image to estimate the human eye fixations. In late 90’s, Itti et al. have defined the saliency map using early visual features comprising of image intensity contrast, color contrast, and local orientation contrast computed at different Gaussian scales [5]. Harel et al. have proposed a graph-based visual saliency model in feature space using bottom-up approach [6]. The above-mentioned saliency detection algorithms do not take care of exact boundaries of salient regions. Achanta et al. have proposed frequency tuned saliency detection algorithm providing exact boundaries of salient regions imparting more frequency content across the boundaries using color and luminance features [7]. The saliency map of an image is estimated using local feature color contrast, sparse sampling, kernel density estimation and Bayesian theory model in [8]. Liu et al. have proposed learning based salient object detection algorithm which trains Conditional Random Field (CRF) algorithm using features: multi-scale contrast (local), center-surround histogram (regional), and color spatial distribution (global) [9]. While detecting eye fixations or salient regions, the context of the salient region is lost. Using local contrast, global features, visual organization rules, and some of the high-level features, the context of an image alongside salient region is preserved in [10]. Cheng et al. have proposed a saliency detection algorithm by combining histogram based contrast and region based contrast [11]. Murray et al. have proposed a saliency model based on low-level vision system at multi-scale decomposition using color and luminance channels [12]. Li et al. have proposed a learning based saliency detection algorithm which combines eye fixation and segmentation models in order to perform segmentation of the salient objects [13]. In [14], salient objects and distractors are separated by learning the distribution of projected features using principal component analysis. Borji et al. have proposed a patch based saliency detection algorithm which defines saliency of patches based on how they are different from surrounding patches and how often they occur in an RGB and Lab color space of an image [15]. Another patch based saliency detection algorithm defines the saliency of a patch using the distance from an average of all patches in the color space and the pattern space along the principal component directions [16]. The weighted dissimilarity between patches using multiple parameters is used to form a saliency map in [17].

Improving the saliency map generated by the saliency detection algorithms is a novel research area. Lei et al. have proposed a framework using Bayesian decision theory after finding rough saliency map from different saliency detection algorithms [18]. They enhanced the saliency map of an image using a conditional probability of pixels having similar color values with that of pixels with higher saliency value in rough saliency map. This framework will fail if the salient object contains different colors and is not captured in rough saliency map. Alternatively, we have proposed to improve the saliency map by modifying the image in order to enhance the saliency value present in the salient region in an iterative manner. From the saliency map generated using an existing saliency detection algorithm, foreground and background regions are found using image segmentation. The image is modified differently for foreground and background regions iteratively. This enables us to enhance the saliency values in the salient (foreground) region and to suppress the saliency values in the non-salient (background) region.

3 Proposed Approach

The purpose of saliency detection algorithm is to look for distinctive regions in an image where human eye fixates. The more the distinctiveness of a pixel, the more the saliency value assigned to a pixel. Most saliency detection algorithms fail to properly distinguish distinctive regions and sometimes assign the same value to a region in an image. Sometimes, they may not be able to assign consistent saliency value to the same object in an image. Our goal is to improve and make the saliency values concentrated in distinctive regions by enhancing the energy present in the distinctive regions iteratively. This can be achieved by smoothing out the non-distinctive region (background) and coarsening the distinctive region (foreground) while generating the saliency map in an iterative manner.

3.1 Methodology

The proposed framework for an improvement in a given saliency map \(S_0\) of image \(I_0\) is described below. Let the image \(I_0\) be of size \(M \times N\).

  1. 1.

    The saliency map of an image \(I_i\) generated using an existing saliency detection algorithm is \(S_i\). The energy \(E_i\) of a saliency map \(S_i\) is defined as a squared sum of gray level co-occurrence matrices (GLCM [19]) \(C_{\theta _j}\) in 4 directions 0\(^\circ \), 45\(^\circ \), 90\(^\circ \), and 135\(^\circ \) with the distance between given two pixels being 1 and is shown in Eq. (1).

    (1)

    here, \( i = 0, 1, \ldots , K\), \(x = 1, 2, \ldots ,M\), \(y = 1,2,\ldots ,N\), \(\theta \in \) {0\(^\circ \), 45\(^\circ \), 90\(^\circ \), 135\(^\circ \)}, \(x_{\theta _j} \in \{-1,0,1\}\), \(y_{\theta _j} \in \{-1,0,1\}\), K is the number of iterations, P is number of intensity levels in an image and \(E_i\) is the energy of the saliency map after the \(i^{th}\) iteration.

  2. 2.

    The image \(I_i\) is segmented into foreground (\(FG_i\)) and background (\(BG_i\)) using kernel k-means algorithm described in [20]. The segmentation method proposed in [20] requires rectangular box \(R_i\) as a seed that includes foreground region (\(FG_i\)) which is a salient region in our case. We find this rectangular box \(R_i\) from binary map \(P_i\) using the saliency map \(S_i\) as shown in Eq. (2).

    $$\begin{aligned} \begin{aligned} P_i(x,y) = {\left\{ \begin{array}{ll} 1, &{} \text {if } S_i(x,y) \ge average(S_i) \\ 0, &{} \text {Otherwise.} \end{array}\right. } \end{aligned} \end{aligned}$$
    (2)

    \(R_i\) of size \(M \times N\) is a minimum area rectangle which contains all the 1’s in \(P_i\). The segmentation method outputs an image \(G_i\) with foreground region intact and background region with values 1. We generate the binary mask \(B_i\) (indicating foreground region) using \(G_i\) as shown in Eq. (3).

    $$\begin{aligned} \begin{aligned} B_i(x,y) = {\left\{ \begin{array}{ll} 1, &{} \text {if } G_i(x,y) \ne 1 \\ 0, &{} \text {Otherwise.} \end{array}\right. } \end{aligned} \end{aligned}$$
    (3)
  3. 3.

    As our goal is to propagate energy towards the salient region (foreground), we exaggerate the foreground region and apply smoothing on the background region of the image. We use local Laplacian filter \(T_l\) described in [21] to exaggerate the details in the foreground with \(\sigma _r = 0.1\). The low pass filtering of the background region is performed using the guided filter \(T_g\) with filter size \(w = 5\) and the guide image to be the same as the input image to be filtered. The modified image \(I_{(i\,+\,1)}\) is generated by combining differently filtered foreground and background regions and it is as described in Eq. (4).

    $$\begin{aligned} \begin{aligned} I_{(i+1)}&= FG_{(i+1)} + BG_{(i+1)} \\&= ( B_i \times T_l(I_i,\sigma _r)) + ((1-B_i) \times T_g(I_i, I_i, w)) \end{aligned} \end{aligned}$$
    (4)
  4. 4.

    We repeat the steps 1–3 iteratively with the modified input image \(I_{(i\,+\,1)}\). The summarized block diagram of iterative process is shown in Fig. 1. The number of iterations to be performed is dependent on the image content and is estimated as described in Sect. 3.2.

Fig. 1.
figure 1

The proposed saliency improvement framework.

Fig. 2.
figure 2

Energy of the improved saliency map as a function of number of iterations. Each curve shows energy variation of a saliency map of modified images over the iterations.

Fig. 3.
figure 3

Effect of iterative process on saliency map: Top row: modified images: (a) Original image, (b)–(e) modified images after \(i= 1,2,3 \text { and } 4\) iterations, Bottom row: saliency maps of the corresponding modified images using existing saliency detection algorithm by [10].

3.2 Optimal Number of Iterations

The energy variation of a saliency map of modified images over the number of iterations is shown in Fig. 2. Observing the energy variations over the number of iterations, the energy \(E_i\) starts decreasing after some iteration. During initial iterations, as the smoothing of a background region tends to gain a constant intensity value, energy value \(E_i\) increases to achieve a value of 1. (Energy of a gray co-occurrence matrix of a constant intensity image is 1). After some iteration, exaggeration of foreground region starts dominating the effect of background region on energy and detail enhancement in foreground region reduces the energy value. As more enhancement of details in the foreground leads to saturation of intensity values, we stop the iterations when energy values start decreasing. The same can be observed from Fig. 3. Figure 3 shows the modified images and their saliency maps after each iteration. From Fig. 3, it is observed that after a certain number of iterations, the intrusion of constant values in the background region and the saturation of exaggerated edges in the foreground region lead to decrease in the energy of a saliency map. Hence, the iterative process has to be stopped as soon as the energy value starts decreasing. The saliency map (\(S_f\)) of the modified image after the last iteration provides the improved version of the saliency map (\(S_{0}\)) of the original image. In this way, the number of iterations required for an improvement in the saliency map is adaptive depending on the given image.

Fig. 4.
figure 4

Visual comparison of improvement in saliency map: (a) Input original images and (b) their ground truths, (c, e, g, i, k, m, o) Saliency maps using existing saliency detection algorithms and (d, f, h, j, l, n, p) modified saliency maps using proposed approach.

4 Results and Discussions

We have tested the proposed framework on the MSRA salient object dataset [9] and compared the improvement in the saliency maps with their corresponding original saliency maps. The comparison is performed using several existing state-of-the-art saliency detection algorithms such as graph-based visual saliency (GBVS) approach [6], spatially weighted dissimilarity (SWD) based approach [17], non-parametric low level vision (NPL) based approach [12], context-aware saliency (CS) based approach [10], distinct patch (Patch) based approach [16], discriminative subspaces (DSRC) based approach [14], and kernel density estimation (KDE) based approach [8]. The maximum number of iterations is kept to 10 which is way higher than 4, the average number of iterations required to enhance the saliency map.

Visual comparison of the proposed approach for a number of images is shown in Fig. 4. Figure 4 shows the original images (Image), their ground truth binary saliency maps (GT), the saliency maps using the existing saliency detection algorithms (X), and their corresponding improved version (\(X_{SI}\)) obtained using the proposed framework. One can observe that the saliency map generated using an existing saliency detection algorithm is improved after the execution of the proposed approach for every state-of-the-art saliency detection algorithm.

Fig. 5.
figure 5

Average precision - recall curves for saliency maps generated using existing saliency detection algorithms and modified saliency maps using proposed approach.

The objective evaluation of the results obtained using proposed framework is carried out using two measures: precision-recall measure and recently proposed structure measure [22]. Precision (Pr) and recall (Re) values with respect to a fixed threshold and the ground truth binary saliency map are calculated as shown in Eq. (5). \(S_{T}\) is thresholded binary saliency map and GT is ground truth map available with the dataset. For a fixed threshold value, better performance is identified by higher precision value and higher recall value. For each threshold, precision and recall values are averaged across the number of images. Figure 5 shows the graph of the average precision-recall values for different saliency detection algorithms considered in this study. It can be observed from Fig. 5 that the implementation of the proposed framework improves the saliency map generated using all the state-of-the-art saliency detection algorithms. We evaluate the quality of the proposed framework using F-measure F which is defined in Eq. (6).

$$\begin{aligned} \begin{aligned} Pr = \frac{\sum \limits _{x=1}^{M} \sum \limits _{y=1}^{N} (S_{T} \cap GT)}{\sum \limits _{x=1}^{M} \sum \limits _{y=1}^{N} S_{T}}, Re = \frac{\sum \limits _{x=1}^{M} \sum \limits _{y=1}^{N} (S_{T} \cap GT)}{\sum \limits _{x=1}^{M} \sum \limits _{y=1}^{N} GT} \\ \end{aligned} \end{aligned}$$
(5)
$$\begin{aligned} \begin{aligned} F = 2 \bigg (\frac{Pr \times Re}{Pr + Re}\bigg ) \end{aligned} \end{aligned}$$
(6)
Fig. 6.
figure 6

Precision, recall and F-measure for saliency maps using existing saliency detection algorithms and their corresponding improvement using proposed approach.

Table 1. Structure measure for saliency maps generated using existing saliency detection algorithms and their corresponding improvement using proposed approach.

Using Eqs. (5) and (6), average Pr, Re and F values for the saliency maps obtained using different algorithms and their corresponding improved saliency maps obtained using the proposed framework are shown in Fig. 6. Higher precision and higher recall values indicate that the proposed framework improves the saliency map generated by all the saliency detection methods considered. We can observe that we tend towards salient object segmentation with a better suppression of non-salient regions using the proposed approach. Recently Fan et al. have proposed a structural similarity measure to evaluate region-aware and object-aware similarities between non-binary saliency map and ground truth (GT) [22]. Region-aware and object-aware structure similarities try to capture global structure and global distributions of foreground objects. The structure measure overcomes the pixel-wise comparison algorithms for better overall global structure comparison. Table 1 shows the structure measure values: the first row shows the average values of structure measures of saliency maps generated using different existing saliency detection algorithms and the second row shows that of using proposed improvement framework. It is noted that S-values are increased by applying proposed improvement framework.

5 Conclusion

The proposed method introduces an iterative process to improve the saliency map obtained by an existing saliency detection algorithm. The saliency values are forced to be more concentrated in distinctive regions and suppressed in non-distinctive regions. This is achieved by smoothing non-distinctive or background regions and enhancing the details of distinctive or foreground regions using edge-preserving filters. The performance of the saliency improvement framework is shown to be effective using precision, recall, F-measure and recently proposed structure measure. The proposed saliency improvement technique can be used for various applications of computer vision which require salient object detection and segmentation. In future, we would like to extend the work for improving the salient object detection in videos.