1 Introduction

Human visual system (HSV) has the ability to orientate rapidly toward the most salient objects with selective visual attention, which allows humans to semantically understand a visual input quickly and efficiently [18]. This selective visual attention has been intensively analyzed in psychology, neuroscience, and computer vision. In computer vision, there are several methods, for simulating the selective visual attention, which highlight salient object(s) in an image. Nowadays, saliency detection is widely applied to a variety of computer vision applications including image segmentation [10], objection recognition [45], image quality assessment [41], image editing and manipulating [9], information discovery [34], visual tracking [48], content-based image retrieval [6], etc.

Fig. 1
figure 1

Examples of saliency aggregation where saliency maps from the state-of-the-art methods do not show similar detection characteristics on variety of input images. From left to right, there are input images, the state-of-the-art methods, e.g., GMR [55], MC [20], DSR [30], DRFI [21], and RBD [21], our proposed saliency bagging framework, and ground truth (GT). Examples are selected from different datasets, e.g., THUR15K [7], MSRA10K [8], ECSSD [54], and SED2 [2] datasets

Saliency detection methods can be roughly categorized as either bottom-up or top-down approaches. The bottom-up methods [1, 8, 14, 19, 20, 22, 24, 29, 33, 37, 42, 46, 52, 55, 57, 60] are data-driven which usually use low-level visual features like intensity, orientation, or pattern from pixels or regions. These methods do not require training task so that they efficiently perform, while they are sensitive to noise regions due to data-driven characteristics. The contrast prior is one of the most applied principles in bottom-up methods. Based on the surroundings of contrast prior, these methods often categorized into local [19, 24] and global [8] contrast prior methods. The local contrast effectively highlights boundaries (e.g., edges and texture) of the object while neglects to highlight the entire object. However, global contrast identifies the whole object but fail to keep the inner information of the object. To improve saliency detection performance, some methods jointly consider local and global contrast priors [36]. Furthermore, some bottom-up methods explore visual features, such as boundary prior [46], and central bias [60]. Recently, label propagation-based methods [20, 55] have drawn more attention. Initially, seeds are picked based on some prior information and then, labels are propagated from seeds to other nodes in a graph. If the object appears on the image boundary, then these methods inaccurately identify the object. In contrast, top-down methods [13, 16, 17, 21, 23, 39, 47, 54, 56, 59] are task-driven which require high-level information and learning mechanisms. These methods require training task which needs data collection, and hence it is complex and not efficient to implement. These methods are difficult to generalize due to task-driven characteristics. Lately, deep learning methods [15, 26, 27, 31, 51, 61] have been widely and effectively applied for salient object detection.

Despite the development of several effective methods, none could show superiority than others for a variety of input images. Therefore, the saliency map computed using these state-of-the-art saliency detection methods is not reliable due to their performance dependence on individual images. As depicted in Fig.1 that individual state-of-the-art method performs better at some images or some part of the image instead of performing well on all images. To address this problem, saliency aggregation has drawn rich attention recently. The saliency aggregation methods [4, 25, 38, 43, 53] incorporate several existing saliency maps according to diverse mechanisms. Accordingly, the saliency aggregation result can be superior to each selected existing methods that investigated in [4].

Consequently, the saliency aggregation is proposed to acquire the superiority of several existing saliency methods and alleviate the misleads of individual methods to improve the quality and robustness of the saliency map across a variety of images. Here, we propose a novel framework called saliency bagging in which multiple existing saliency detection methods are exploited to enhance the quality of the saliency map. The main contribution of the proposed research work is threefold:

  1. 1.

    A novel, simple, and efficient framework called saliency bagging is proposed which effectively incorporates advantages of selected saliency methods to improve the quality of saliency maps.

  2. 2.

    A novel integration logic is proposed to combine individual saliency maps into a final saliency map which chooses saliency value among individual saliency values for an image pixel based on the category of the pixel.

  3. 3.

    Extensive experimental evaluation is conducted with 25 different methods including 17 classic best-performing methods of the last decade, five existing selected, and three aggregation methods to verify the superiority of the proposed framework. The experimental results demonstrate that the proposed framework is better performs and robust across a variety of images. In addition, it is also computationally more effective than compared aggregation methods.

The rest of the paper is organized as follows. Section 2 presents related work and state-of-the-art methods in saliency detection. In Sect. 3, the proposed model is discussed in detail along with the step-wise procedure. Experimental results are demonstrated in Sect. 4, while Conclusion and Future Work are given in Sect. 5.

2 Related work

An extremely ample set of saliency detection methods have been investigated and obtained admirable performance for detecting salient objects in images over the last few years. Here, we broadly divide these methods into four classes according to their approaches used:

2.1 Bottom-up methods

Bottom-up methods are usually based on low-level visual attributes (e.g., color, orientation, and intensity) and prior knowledge (e.g., contrast, uniqueness, compactness, and boundary). Biologically motivated first computational saliency method was introduced by Itti et al. [19]. The method extracted center-surround contrast at multiscale with different features to measure the saliency. In [14], graph hypothesis is used to measure activation maps from the raw features. After that, Achanta et al. [1] proposed a simple model based on center-surround contrast that computes saliency of each pixel as the absolute difference between the mean pixel value of the image and blurred image pixel value. Perazzi et al. [42] decomposed an image into suitable elements that reduce unwanted details and measures rarity and distribution of each element in the form of contrast. They assign saliency value to each pixel from element contrast. Klein and Frintrop [24] formulated saliency computation in an information-theoretic way using Kullback–Leibler-Divergence (KLD). Cheng et al. [8] measured saliency via histogram-based contrast and region-based contrast using global knowledge. Uniqueness-, focusness-, and objectness (UFO)-based saliency computation is presented in [22]. Seo and Milanfar [46] computed local regression kernels, and then saliency is determined using self-resemblance measure (SeR). Zhang et al. [60] proposed a method in which rarity was measured using self-information of local visual cues derived from a Bayesian framework (SUN). Background priors used with geodesic distance to compute saliency in [52]. Yang et al. [55] suggested a graph-based manifold ranking (GMR) with background and foreground seeds to compute saliency values. Alternatively, saliency can be measured in terms of absorbed time, which is estimated in the absorbing Markov chain (MC) [20]. Lu et al. [37] proposed multi-graph-based rankings to measure the coarser saliency map and further refine the coarser saliency map. Li et al. [29] determined saliency by aggregating various low-level visual features. Yuan et al. [57] proposed saliency detection model that exploits a robust boundary prior to estimating the saliency. Further, regularized random walk ranking (RRWR) is applied on prior saliency estimation to generate saliency maps. Foreground-center-background (FCB)-based saliency model is proposed by Guang-Hai et al. [33] in which three saliency measures viz. foreground, center, and background are computed and then combined to generate final saliency map.

2.2 Top-down methods

Top-down methods use high-level knowledge and supervised learning approach for certain objects. Kanan et al. [23] proposed appearance-based saliency detection method that computed saliency in a Bayesian framework. Goferman et al. [13] proposed the context-aware (CA) saliency method which identifies salient object along with surrounding context via four principles namely local low-level features, global considerations, visual organization rules, and high-level information. Margolin et al. [39] proposed method to determine distinctive pattern for saliency computation (PCA). Hou and Zhang [16] converted an image into frequency domain and extract the spectral residual (SR) to compute saliency. Jiang et al. [21] suggested discriminatively features integration method (DRFI) to measure saliency values. Yan et al. [54] suggested a saliency model which addresses the scaling problem by a hierarchical framework (HSD) that combines significant information from three image layers at different scales, while this work extended on CSSD [47]. Zhang et al. [59] introduced saliency computation method based on the minimum barrier distance (MBD) transform. Yang et al. [56] measured saliency which jointly learns a conditional random field (CRF) and a visual dictionary. Yuang et al. [17] proposed saliency method which jointly considering object proposals and multiple instances learning (MIL) for saliency measure (MILPS).

2.3 Deep learning methods

Recently, many deep learning methods have been developed to saliency analysis and obtained good state-of-the-art performance. However, these methods required an extremely rich set of training data and a huge amount of computation time for deriving suitable network parameters (i.e., a set of learnable parameters). Hence, such methods may be not applicable to the real-time application and degrade performance on unseen visual data. Li and Yu [27] proposed a method which measures saliency using multiscale features computed by convolutional neural networks (CNNs). He et al. [15] suggested superpixel-based CNN that learns the hierarchical contrast features and keeps contextual information to generate a saliency map. A multi-context deep CNN architecture jointly considered the global and local context to compute saliency values [61]. Li et al. [31] introduced a fully convolutional neural network (FCCN) which encode priors knowledge and constitute multi-task learning to explore the inner correlation between saliency detection and semantic image segmentation. Salient instance segmentation approach which generates a saliency map according to a distinct object instance in an input image [26]. Wang et al. [51] proposed saliency computation based on recurrent fully convolutional networks (RFCNs). The network incorporates visual priors in a more appropriate manner. The recurrent framework has the ability to automatically learn enriching saliency values by iteratively adjusting its previous errors.

2.4 Saliency aggregation methods

Existing saliency detection method varies on a variety of images (e.g., complex structure, boundary touching, small object, larger object, multiple objects, etc). In order to capture advantage of the superiority of each existing saliency method, many researchers have developed the saliency aggregation mechanism that suitably combined these methods in such a way to enhance performance universally on any kind of images. Borji et al. [4] suggested aggregation approaches based on statistical information and linear summation with three combination functions. Mai et al. [38] introduced three data-driven saliency aggregation methods to combine outcomes of individual existing saliency methods which enhance performance compared to individual existing saliency methods. Le et al. [25] suggested various unsupervised and supervised saliency aggregation methods and they observed that the simple average of the saliency maps generated by two best methods. Qin et al. [43] suggested an aggregation method named cuboid cellular automata (CCA) to incorporate various saliency maps generated by exiting saliency methods. Recently, Xu et al. [53] proposed an arbitrator model (AM) to integrate various existing saliency methods. The method derived a reference map on jointly considering the majority voting on multiple existing saliency methods and external knowledge. Afterward, the method learns the expertise of existing saliency methods. Lastly, a novel integration framework is employed based on Bayesian inference to integrate the existing saliency methods of varying expertise and the reference map.

3 Saliency bagging: proposed framework

Bagging [5] is a kind of ensemble learning approach that improves machine learning outcome by aggregating several models. In this approach, various learner models are required to apply on data set to generate various outcomes. These outcomes are integrated together to generate a better outcome than any of the single learner outcome. The integration task is defined in two different ways: the first way is average outcomes generated by various learners if the outcomes of various learners contain numerical values. The second way is majority voting on outcomes generated by various learners if the outcomes of various learners contain class labels. The learner models are independently produced error. Hence, the integration task reduces the individual error of each learner model and evolves a better learner. Here, we propose a novel framework for salient object detection called saliency bagging which exploits the concept of bagging. In this framework, various existing saliency detection models are exploited to generate a saliency map which is robust than several other existing models. The proposed framework entails three main steps:

Step 1:

Selection of existing salient object detection models and generation of initial saliency maps

Step 2:

Generation of integrated binary map using initial saliency maps as obtained in step 1

Step 3:

Generation of final saliency map using integration logic

A schematic representation of the proposed framework is shown in Fig. 2. In the proposed framework, first initial saliency maps are obtained using existing saliency detection models. Next, binary maps corresponding to these initial saliency maps are derived by applying adaptive thresholding as discussed later. Afterward, these binary maps are combined using majority voting to get the integrated binary map. The integrated binary map labels the image pixels in a better way in comparison with each individual binary map due to bagging. Based on the pixel labels of the integrated binary map, final saliency map is generated by applying integration logic in which the foreground pixels and background pixels are assigned the highest and the lowest saliency values, respectively, as available from the set of initial saliency maps.

Fig. 2
figure 2

Schematic representation of proposed framework

3.1 Selection of existing saliency detection models and generation of initial saliency maps

The performance of the proposed saliency bagging framework depends on the selection of existing saliency detection models. Hence, the selection mechanism to select existing saliency detection models plays an important role here. Thus, the selection mechanism must choose those models which perform effectively on various categories of images. Therefore, we require a common environment to determine the performance of the existing saliency detection models on various datasets. One such study has been carried out by Borji et al. [3] in which several salient object detection models have been compared under the same environment. The performance of these methods has been measured in terms of Precision, Recall, F-measure, Area under the curve (AUC), and receiver operating characteristic (ROC), etc. Based on the benchmark given by Borji et al. [3], we have selected some existing saliency detection models whose performance is best individually in each of the above-mentioned five performance measures. The objective of the proposed framework is to retain the advantages and suppress the disadvantages of existing saliency detection models. Hence, the proposed model incorporates various effective existing saliency detection models with ensemble learning mechanism, i.e., bagging to improve the quality of the saliency map. Now, we briefly introduce the selected existing saliency detection models which are used to generate the initial saliency maps as follows.

3.1.1 Graph-based manifold ranking (GMR) model

Yang et al. [55] proposed a saliency model that estimates the saliency of each image element (superpixel) as its ranking score. The model represents the input image as a graph \(G= (V, E)\), where V is the set of nodes which illustrate elements of the image and E represents the set of edges between any pair of nodes i and j. The weight \(w_{ij}\) between nodes i and j is computed as follows:

$$\begin{aligned} w_{ij}=e^{-\frac{\left\| c_{i}-c_{j} \right\| }{\sigma ^{2}}} \quad i,j \in V \end{aligned}$$
(1)

where \(c_{i}\) and \(c_{j}\) is the mean color of corresponding nodes and \(\sigma \) is a constant to regulate the strength of the weight. The objective of the manifold ranking is to estimate rank \(f_{i}\) of each node in the graph with respect to \(y_{i}\) and optimal ranking is computed as follows [55]:

$$\begin{aligned} {\mathbf{f}}^{*}= & {} \mathrm{arg} \mathop {\hbox {min}}\limits _ f \frac{1}{2}\left( \sum _{i,j=1}^{n} w_{ij}\left\| \frac{ f _{i}}{\sqrt{d_{ii}}}-\frac{ f _{j}}{\sqrt{d_{jj}}} \right\| ^{2}\nonumber \right. \\&\left. +\,\mu \sum _{i=1}^{n}\left\| f _{i}-y_{i} \right\| ^{2}\right) \end{aligned}$$
(2)

where \(d_{ii}=\sum \nolimits _{j} w_{ij}\), \(\mu \) control the smoothing and \({\mathbf{f}^{*}}\) is optimized ranking vector. Equation (2) is simplified to obtain the following equation:

$$\begin{aligned} {\mathbf{f}}^{*}=(\mathbf {D}-\alpha \mathbf {W})^{-1}\mathbf {y} \end{aligned}$$
(3)

where \(\mathbf {D}\) is the degree matrix and \(\mathbf {W}\) is the affinity matrix. Using Eq. (3), saliency map is obtained by selecting a various set of queries.

3.1.2 Absorbing Markov chain (MC) model

Similar to the model given in [55], Jiang et al. [20] also represent the input image as a graph but with different constraints on edges. The weight of the edge is computed in a similar manner to [55]. Here, saliency detection is formulated as an absorbing Markov chain model over image graph. The superpixels at the boundary are duplicated as virtual boundary nodes and are called absorbing nodes in the given model. Rest of the nodes are termed as transient nodes. Afterward, transient matrix P is obtained as follows [20]:

$$\begin{aligned} \mathbf{P}= \mathbf{D}^{-1} \times \mathbf{A} \end{aligned}$$
(4)

where A is the affinity matrix and \(\mathbf{D}\) is the degree matrix define as \(\mathbf{D}=\mathrm{diag}(\sum \nolimits _{j} a_{ij})\). The fundamental matrix N is derived from Eq. (4), and absorbing time vector y is computed as follows:

$$\begin{aligned} \mathbf{y}=\mathbf{N} \times \mathbf{c} \end{aligned}$$
(5)

Next, the saliency map is generated with the help of the normalized absorbing time vector \( {\bar{\mathbf{y}}}\) as follows:

$$\begin{aligned} \mathbf{S}(i)= {\bar{\mathbf{y}}}(i) \qquad i=1,2,...,t \end{aligned}$$
(6)

Lastly, the saliency map obtained using Eq. (6) is further refined to get the final saliency maps.

3.1.3 Dense and sparse reconstruction (DSR) model

Li et al. [30] designed dense and sparse appearance model for finding salient regions in an images. The input image is segmented into N segments and represented as a matrix \( \mathbf{X}=[\mathbf{x}_{1}, \mathbf{x}_{2},..., \mathbf{x}_{N}] \in \mathbb {R}^{D \times N} \), where \( \mathbf{x}_{i}\) is a feature vector of ith segments and D is feature dimension. A background templates set is constructed with the help of background segments \(\mathbf{b}_{M}\) feature vectors and represented as \(\mathbf{B}=[\mathbf{b}_{1}, \mathbf{b}_{2},...,\mathbf{b}_{M}]\) where M is number of background segments. Next, dense and sparse reconstruction error is calculated with the help of background template. The dense reconstruction error of segment i is calculated as:

$$\begin{aligned} \varepsilon ^{d}_{i}=\left\| \mathbf{x}_{i}-(\mathbf{U}_\mathbf{B} \beta _{i}+ {\bar{\mathbf{x}}} ) \right\| ^{2}_{2} \end{aligned}$$
(7)

where \( \mathbf{U}_\mathbf{B}\) eigenvectors from the normalized covariance matrix of B, \(\beta _{i}\) is the reconstruction coefficient of segment i, and \( {\bar{\mathbf{x}}}\) is the mean feature of \(\mathbf{X}\). The sparse reconstruction error of segment i is calculated as:

$$\begin{aligned} \varepsilon ^{s}_{i}=\left\| \mathbf{x}_{i}-\mathbf{B} \alpha _{i} \right\| ^{2}_{2} \end{aligned}$$
(8)

where \( \alpha _{i}\) is an encoded representation of segment i. The reconstruction errors are smoothed based on context obtained by K-means clustering. After that, saliency value is assigned to each pixel by combining multiscale reconstruction errors which are further refined with an object biased Gaussian model. Finally, Bayes’ formula is applied to integrate both reconstruction error measures for generating final saliency maps.

Fig. 3
figure 3

Initial saliency maps. a Images b GMR [55] saliency maps c MC [20] saliency maps d DSR [30] saliency maps e DRFI [21] saliency maps f RBD [62] saliency maps

3.1.4 Discriminative regional feature integration approach (DRFI) model

Supervised learning-based saliency model is proposed by Jiang et al. [21] that extracts several features such as regional contrast, regional property, and regional backgroundness and construct feature vector \(\mathbf{x} \) of each region of the input image. In learning phase, the training set contains a set of confident regions \( {\varvec{\mathcal {R}}}=\{R_{1}, R_{2},...,R_{Q} \}\) and corresponding saliency scores \( {\varvec{\mathcal {A}}}=\{a_{1},a_{2},...,a_{Q}\}\) from multi-level segmentation of a set of images. The region is considered to be confident if its \(80\%\) pixels belong to either foreground or background, and corresponding saliency score is assigned as either 1 or 0. Then, learn a random forest regressor f from the training data \({\varvec{\mathcal {X}}} = \{\mathbf{x}_{1}, \mathbf{x}_{2},..., \mathbf{x}_{Q}\}\) and the saliency scores \( {\varvec{\mathcal {A}}} = \{\mathbf{a}_{1}, \mathbf{a}_{2},..., \mathbf{a}_{Q}\}\). Thereafter, a set of multi-level saliency maps \(\{A_{1},A_{2},...,A_{M}\}\) (where M is number of levels) are generated and integrated to get final saliency map.

3.1.5 Robust background detection (RBD) model

A robust background measure-based saliency model is proposed by Zhu et al. [62] that characterizes the spatial geography of image regions with respect to image boundaries and magnitude of robustness. The model represents the input image as graph \(G=(V, E)\), where V nodes of the graph represent superpixels of the image and E edges on pair p and q that represent adjacent node. The weight \(d_{\mathrm{app}}(p,q)\) is calculated as the Euclidean distance between two nodes p and q on mean color in CIE-Lab color space. After that, the boundary connectivity is calculated with the help of the length along the boundary and spanning area of each superpixel. Next, the background weighted contrast is computed as:

$$\begin{aligned} wCtr \left( p\right) =\sum _{i=1}^{N}d_{\mathrm{app}} \left( p,p_{i} \right) w_{\mathrm{spa}} \left( p,p_{i} \right) w_{i}^{\mathrm{bg}} \end{aligned}$$
(9)

where \(w_{\mathrm{spa}} \left( p,p_{i} \right) =exp \left( - \frac{d_{\mathrm{spa}}^{2} \left( p,p_{i} \right) }{2 \sigma _{\mathrm{spa}}^{2}} \right) \), \(d_{\mathrm{spa}} \left( p,p_{i} \right) \) is the distance between the centers of superpixels p and \(p_{i}\), set \(\sigma _{\mathrm{spa}}=0.25\) and \(w_{i}^{\mathrm{bg}}\) is computed from the boundary connectivity value of superpixel \(p_{i}\) represents background probability. Further, an optimization model is designed for optimizing saliency value of superpixel \(\{s_{i}\}^{N}_{1}\) in which the cost function is defined as [62]:

$$\begin{aligned} \sum _{i=1}^{N}w_{i}^{\mathrm{bg}}s_{i}^{2}+\sum _{i=1}^{N}w_{i}^{\mathrm{fg}} \left( s_{i}-1 \right) ^2+\sum _{i,j}w_{ij} \left( s_{i}-s_{j} \right) ^2 \end{aligned}$$
(10)

where \(w_{i}^{\mathrm{bg}}\) is foreground probability and \(w_{ij}\) controls smoothness. Finally, the saliency map is generated by minimizing the cost function.

3.1.6 Generation of initial saliency maps

In this work, we exploit the bagging approach to measure the saliency of the images. We first compute initial saliency maps \(\mathbf {S}^j=\lbrace \mathbf {S}_{1}^j,\mathbf {S}_{2}^j,...,\mathbf {S}_{N}^j \rbrace \) of j-th image and N represents the number of selected existing saliency detection models. The initial saliency maps are computed with the help of various selected existing saliency detection models. Few sample images along with the initial saliency maps generated using selected existing saliency detection models are shown in Fig. 3.

3.2 Generation of integrated binary map

The initial saliency maps obtained in the previous step are transformed into corresponding binary maps where each pixel has value either 0 or 1. Suppose \(\mathbf {B}_i^j\) is the binary map corresponding to initial saliency map \(\mathbf {S}_i^j\) generated by ith model on jth image. Thus, the complete set of binary maps can be represented as \(\mathbf {B}^j=\lbrace \mathbf {B}_{1}^j,\mathbf {B}_{2}^j,...,\mathbf {B}_{N}^j \rbrace \). The binary map is obtained from the saliency map using popular adaptive thresholding as given in Achanta et al. [1] as follows:

$$\begin{aligned} T_{i}^{j}=\frac{2}{W \times H}\sum _{x=1}^{W}\sum _{y=1}^{H}S_{i}^{j} \qquad i=1,2..,N \ \mathrm{and} \ j=1,2,..,n \end{aligned}$$
(11)

where W and H are width and height of the input image, n is total number of images in a particular dataset. Hence, binary maps \(\mathbf {B}_i^j\) corresponding to initial saliency map \(\mathbf {S}_i^j\) are computed as follows:

$$\begin{aligned} \mathbf {B}_{i}^{j}\left( x,y \right) = {\left\{ \begin{array}{ll} 1 &{} \text {if }\mathbf {S}_i^j \left( x,y \right) \geqslant T_i^j \\ 0 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(12)

where (xy) represents the location of the pixel under consideration and \( 1 \le x \le W\) and \( 1 \le y \le H\). The binary maps thus contain only two values, i.e., 1 and 0 where 1 denotes foreground region and 0 denotes background region in a given image. Therefore, the binary maps are labeled maps which classified the original image pixels into two classes. Further, we use binary maps for integration logic. Since the binary maps represent class labels, we apply the majority voting scheme to obtain an integrated binary map which classifies each pixel as background or foreground in a better manner. To find the integrated binary map \(\mathbf {B}^j\) corresponding to j-th image, the following equation is used:

$$\begin{aligned} \mathbf {B}^j(x,y) = {\left\{ \begin{array}{ll} 1 &{} \text {if } \sum _{i=1}^{N}\mathbf {B}_i^j(x,y) > N/2 \\ 0 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(13)

For proper working of the above equation, N must be an odd number. In this research work, we have chosen \(N = 5\).

3.3 Generation of final saliency map

The integrated binary map obtained in the previous step helps in the generation of final saliency map \(\mathbf {S}^j_f\). For this purpose, we propose a novel integration logic where at the foreground locations in the integrated binary map, the saliency value is chosen to be maximum amid the initial saliency maps. Similarly, at the background locations, minimum saliency value amid the initial saliency maps is chosen. The mathematical formulation of the proposed integration logic along with initial saliency maps \(\mathbf {S}^j\) corresponding to j-th image is given below:

$$\begin{aligned} \mathbf {S}^{j}_{f} (x,y)= {\left\{ \begin{array}{ll} \mathrm{max} \{\mathbf {S}_1^j(x,y), \mathbf {S}_2^j(x,y),..., \mathbf {S}_N^j(x,y)\} &{} \text {if }\mathbf {B}^j(x,y) = 1 \\ \mathrm{min} \{\mathbf {S}_1^j(x,y), \mathbf {S}_2^j(x,y),..., \mathbf {S}_N^j(x,y)\} &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(14)

Thus, the final saliency map contains maximum saliency value among all the five initial saliency maps at the pixels which are classified as foreground, i.e., the salient region. Therefore, the salient region in the final saliency map is highlighted in a better manner as compared to all the five initial saliency maps. Similarly, the final saliency map contains minimum saliency value among all the five initial saliency maps at the pixels that are classified as background, i.e., the non-salient region. Therefore, the non-salient region in the final saliency map is suppressed in comparison with all the five saliency maps under consideration.

4 Experimental results

We have performed extensive experiments to determine the efficacy of the proposed method. The experiments have been carried out on six publicly available benchmark datasets: MSRA10K [8], DUT-OMRON [55], ECSSD [54], PASCAL-S [32], SED2 [2], and THUR15K [7]. The qualitative as well as quantitative performance of the proposed method is evaluated and compared with 25 state-of-the-art saliency detection models viz. SR [16], SUN [60], SeR [46], CA [13], SEG [44], SWD [11], FES [49], SIM [40], COV [12], PCA [39], GMR [55], MC [20], DSR [30], DRFI [21], RBD [62], LPS [28], MST [50], HSD [47, 54], BMS [58], MILPS [17], RCRR [57], FCB [33], M-EST [25], CCA [43], and AML [53]. Next, we give details of the datasets, evaluation metrics, comparison of the proposed model with above-mentioned models and failure cases. All the experiments are carried out using a desktop PC with following configuration: Intel(R)Core(TM)i7-4770 CPU@3.40GHz.

4.1 Datasets

As discussed above, all the experiments are performed on six publicly available datasets as given in Table 1. MSRA10K dataset [8] is constructed from the MSRA dataset [35] and contains 10,000 images. The dataset also contains pixel-level saliency labels, and it is a large dataset. DUT-OMRON dataset [55] contains 5168 manually selected high-quality images. The images are more complex and contain one or more salient object(s). Hence, the dataset is more difficult and challenging for salient object detection models. ECSSD dataset [54] contains 1000 natural images, which are semantically meaningful but structurally more complex. PASCAL-S dataset [32] contains 850 natural images. The images have multiple complex object and cluttered background. SED2 dataset [2] contains 100 images. The images have two salient objects. THUR15K dataset [7] is constructed from Flickr with 5 keywords, i.e., Butterfly, Coffee Mug, Dog Jump, Giraffe, and Plane. It contains 15,000 images in which only 6000+ images have pixel-accurate ground truth annotations for salient object regions.

Table 1 Summary of datasets used in the experiments
Fig. 4
figure 4

Qualitative comparison of proposed framework with 17 classic saliency methods on six datasets. The images are sequentially with pair of two from MSRA10K [8], DUT-OMRON [55], ECSSD [54], PASCAL-S [32], SED2 [2], and THUR15K [7] datasets

Fig. 5
figure 5

Qualitative comparison of proposed framework with five existing selected saliency methods. The images are sequentially with pair of two from MSRA10K [8], DUT-OMRON [55], ECSSD [54], PASCAL-S [32], SED2 [2], and THUR15K [7] datasets

4.2 Evaluation metrics

The performance of the proposed saliency bagging model and 25 state-of-the-art saliency detection models are measured in terms of Precision, Recall, Receiver Operating Characteristics (ROC), F-Measure, and Mean Absolute Error (MAE). Precision, Recall, and ROC are computed based on the overlapping region between ground truth and saliency map. Precision indicates the ratio of correctly computed salient pixels to all the pixel in the computed salient region, while Recall indicates the ratio of correctly computed salient pixels to all the pixels in the ground truth. F-Measure is computed as the weighted harmonic mean of Precision and Recall. The MAE is computed as the mean of the absolute difference between the saliency map and ground truth. Let saliency map whose intensity values are normalized between 0 and 1 is represented as \(\mathbf {S}\), while ground truth is represented as \(\mathbf {G}\). Then, a binary map (\(\mathbf {B}\)) is found from the saliency map by using an adaptive threshold as suggested in [1] for computing average Precision, average Recall, and average F-Measure. The threshold is varied between 0 and 1 in equal steps to find ROC performance measures. The notation |.| is used to represent the number of nonzero entries in the binary map and ground truth annotation. The mathematical formulae for various performance measures are given below [3]:

$$\begin{aligned}&\mathrm{Percision}= \frac{|\mathbf {B}\cap \mathbf {G}|}{|\mathbf {B}|} \end{aligned}$$
(15)
$$\begin{aligned}&\mathrm{Recall}= \frac{|\mathbf {B}\cap \mathbf {G} |}{|\mathbf {G}|} \end{aligned}$$
(16)

F-Measure is computed from Precision and Recall as follows [3]:

$$\begin{aligned} F_{\beta }=\frac{(1+\beta ^{2}) \mathrm{Percision} \times \mathrm{Recall}}{\beta ^{2} \mathrm{Percision} + \mathrm{Recall }} \end{aligned}$$
(17)

For our experiments, \(\beta \) is set to be 0.3 as suggested in [1] to emphasize Precision more than Recall. Receiver operating characteristics (ROC) curve is the plot in which a false positive rate (FPR) is on the x-axis and true positive rate (TPR) on the y-axis. The TPR and FPR are computed as follows [3]:

$$\begin{aligned}&\mathrm{TPR}=\frac{|\mathbf {B}\cap \mathbf {G} |}{|\mathbf {G}|} \end{aligned}$$
(18)
$$\begin{aligned}&\mathrm{FPR}=\frac{|\mathbf {B}\cap {\bar{\mathbf {G}}} |}{{\bar{\mathbf {G}}}} \end{aligned}$$
(19)

Mean Absolute Error (MAE) is found between saliency map and ground truth as follows [3]:

$$\begin{aligned} \mathrm{MAE}=\frac{1}{W\times H}\sum _{x=1}^{W}\sum _{y=1}^{H}|\mathbf {S}(x,y)-\mathbf {G}(x,y)| \end{aligned}$$
(20)

4.3 Comparison with state-of-the-art models

4.3.1 Qualitative comparison

Here, we show a qualitative comparison of the proposed framework with 25 state-of-the-art methods. Several visual examples are shown in Figs. 4, 5 and 6 on six datasets. Fig. 4 provides visual comparison of the proposed framework with 17 classic saliency methods on all the six datasets. It can be easily observed that the proposed framework can produce more accurate saliency maps, while most classic saliency methods fail to generate good quality saliency maps on complex background and foreground.

For example, (a) homogeneous background and heterogeneous foreground images: column 1, 2, 6 in Fig. 4, the proposed framework detected salient object as whole; (b) cluttered background: column 3, 4, 8, 11, 12 in Fig. 4, in this scenario the proposed framework clearly removes background noises and highlights salient regions; (c) multiple objects: column 9, 10 in Fig. 4, and the proposed framework effectively identifies all salient object present in the visuals, in contrast others fail to detect them either as whole or partially. (d) object touches the image boundary: column 7, 9 in Fig. 4, and the proposed framework detect salient region that is linked to the boundary of the image. (e) heterogeneous background: column 5 in Fig. 4, our framework significantly suppressed all different background regions and highlights salient region. Figure 5 demonstrates qualitative effectiveness of the proposed framework against five existing selected saliency methods. It reveals that the proposed framework better performs on a variety of images, while individually selected saliency methods fail to effective on complex scenarios. For example, complex scenarios: column 1, 3, 4, 11, 12 in Fig. 5 the proposed framework more effective against individual selected saliency methods. In addition, the proposed framework also produces good quality salience maps on other variety of images column 2, 5, 6, 7 in Fig. 5.

Figure 6 shows the qualitative comparison with three aggregation saliency methods on six datasets. It can be observed that the proposed framework is consistently superior than M-EST [25] aggregation approach on variety of images. Further, the proposed framework identifies salient region(s) in a visually better manner than AML [53] method on the images containing heterogeneous foreground, e.g., column 3 and 9 in Fig. 6. Here, the proposed framework is superior than AML method in terms of completeness, while the performance of the proposed framework is comparable in other scenarios. The proposed framework efficiently removes foreground noise in a better way than CCA [43] as depicted in columns 1 and 2. It also effectively highlights salient region than CCA [43] as shown in column 3, while the results are similar to CCA [43] in other scenarios. These visual examples of different saliency methods demonstrate superiority of the proposed framework on a variety of images. The proposed framework is effective on a variety of images due to accurately segmenting salient object (majority voting) and effectively reducing background noise (integration logic).

Fig. 6
figure 6

Qualitative comparison of proposed framework with three aggregation saliency methods. The images are sequentially with pair of two from MSRA10K [8], DUT-OMRON [55], ECSSD [54], PASCAL-S [32], SED2 [2], and THUR15K [7] datasets

Table 2 Qualitative comparisons with classic 17 saliency methods on six datasets (Higher is better for Precision, Recall, and F-Measure while smaller is better for MAE and time
Table 3 Qualitative comparisons with five existing selected saliency methods on six datasets (Higher is better for Precision, Recall, and F-Measure while smaller is better for MAE and time

4.3.2 Quantitative comparison

To verify the effectiveness of the proposed framework, we also estimate quantitative comparison in three segments including classic, existing selected, and aggregation saliency methods as illustrated in Tables 2, 3, and 4, respectively.

Classic saliency methods The quantitative results of proposed framework and 17 classic saliency methods are presented in Table 2 in terms of various performance metrics Precision, Recall, F-Measures, MAE and computational time in second on MSRA10K [8] dataset. As can be observed, the proposed framework outperforms all the other methods across all the datasets in terms of Precision, F-Measure, and MAE due to the proposed framework take the advantages of existing combined saliency methods to accurately segment the salient and non-salient regions and aggregate saliency values using novel integration approach. It is comparable in terms of Recall with top performer MILPS [17] and MST [50] on MSRA10K [8], SED2 [2] and DUT-OMRON [55], ECSSD [54], PASCAL-S [32], respectively.

Table 4 Qualitative comparisons with three aggregation saliency methods on six datasets (Higher is better for Precision, Recall, and F-Measure while smaller is better for MAE and time

As suggested by Liu et al. [36], Recall plays a less important role than Precision in salient object detection performance. In addition, on MSRA10K [8], the proposed framework improves the Precision and F-Measure scores by approx. 2.2% and 0.7% over the best method MILPS [17] while improves MAE score by approx. 17% over the best method MST [50] among the compared classic methods. On DUT-OMRON [55], the proposed framework improves the Precision, F-Measure, and MAE scores by approx. 1%, 4.9%, and 19.5% over the best method LPS [28] among the compared classic methods. Next, on ECSSD [54], the proposed method improves the Precision and F-Measure scores by approx. 4.3% and 3.1% over the best method MILPS [17] while improves MAE score by approx. 11.6% over the best method MST [50] among the compared classic methods. On PASCAL-S [32], the proposed framework improves the Precision score by approx. 4.7% over the best method MILPS [17] while improves and F-Measure, and MAE scores by approx. 2.4% and 6.6% over the best method MST [50] among the compared classic methods. Further on SED2 [2], the proposed method improves the Precision, F-Measures and MAE scores by approx. 0.9 %, 0.6% and 5.1% over the best method SEG [44], MILPS [17], and MST [50] among the compared classic methods, respectively. On THUR15K [7], the proposed framework improves the Precision score by approx. 4.4% over the best method RCRR [57] while improves F-Measure and MAE scores by approx. 3.6% and 16.9% over the best method MST [50] among the compared classic methods. Hence, it is apparent that the proposed model improves Precision, F-Measure, and MAE by approx. 2.9%, 2.55, and 12.78 on average over the best method among the compared classic methods across all six datasets. The above discussion shows that the proposed framework possess effective performance in terms of correctness, completeness, and suppressing non-salient regions on different images scenarios which illustrate its robustness characteristics. In addition, the computational time of the proposed framework includes all combined saliency methods (i.e., 0.40, 0.09, 3.49, 5.64, and 0.15) and aggregation approach (i.e., 0.03) computational time. It is faster than several methods such as CA [13], COV [12], and MILPS [17].

Existing selected saliency methods In order to evaluate the effective performance of the proposed framework against existing selected saliency methods, we also estimate qualitative comparisons with combined saliency methods as illustrated in Table 3 in terms of various performance metrics Precision, Recall, F-Measures, MAE, and computational time in second on MSRA10K [8] dataset.

Table 5 Computational time of aggregation component

The scores presented in Table 3 demonstrate that the proposed framework outperforms all the individual methods in terms of Precision, F-Measure, and MAE because the proposed framework clearly suppressed background noises which are not suppressed by all individual saliency detection methods on same regions. However, these existing selected methods perform better on different regions of the image. We exploit this advantage of each individual method to find a saliency map which significantly improves the quality of salient object detection.

Fig. 7
figure 7

ROC on the six datasets: a MSRA10K [8], b DUT-OMRON [55], c ECSSD [54], d PASCAL-S [32], e SED2 [2], f THUR15K [7]

The performance of the proposed framework is comparable with DRFI [21] and RBD [62] on MSRA10K [8], ECSSD [54], THUR15K [7], PASCAL-S [32], and DUT-OMRON [55], SED2 [2], respectively, in terms of Recall. Furthermore, on MSRA10K [8], the proposed framework improves the Precision, F-Measure, and MAE scores by approx. 3%, 2.4%, and 27.4% over the best methods, i.e., MC [20], DRFI [21], and RBD [62] among the existing selected methods, respectively. On DUT-OMRON [55] dataset, the proposed framework improves Precision and F-Measure scores by approx. 4.3% and 2.6% over the best method DRFI [21] while improves MAE score by approx. 16.1% over the best method DSR [30] among the existing selected methods.

Fig. 8
figure 8

Failure cases. a Images. b Ground truth. c Our saliency maps d GMR [55] saliency maps e MC [20] saliency maps f DSR [30] saliency maps g DRFI [21] saliency maps h RBD [62] saliency maps

On ECSSD [54], the proposed framework improves the Precision, F-Measure, and MAE scores by approx. 2.4%, 0.6%, and 18.6% over the best method DRFI [21] among the existing selected methods. On PASCAL-S [32], the proposed framework improves the Precision and F-Measure scores by approx. 2.2% and 1.2% over the best method DRFI [21] while improves MAE score by approx. 8.8% over the best method RBD [62] among the existing selected methods. Similarly, on SED2 [2], the proposed framework improves the Precision, F-Measure, and MAE scores by approx. 0.8%, 0.4%, and 8.8% over the best method RBD [62] among the existing selected methods. On THUR15K [7], the proposed framework improves the Precision and F-Measure scores by approx. 1.9 % and 0.3% over the best method DRFI [21] while improves MAE score by approx. 14% over the best method DSR [30] among the existing selected methods. Thus, the proposed model improves Precision, F-Measure, and MAE scores by approx. 2.4%, 1.7%, and 15.6% on average over the best-performing method among the existing selected saliency methods across all six datasets.

Aggregation saliency methods The quantitative performance of the proposed framework with aggregation saliency methods is shown in Table 5 in terms of different metrics including Precision, Recall, F-Measure MAE and Table 5 also presented computational time on MSRA10K [8] dataset. The statistical data in Table 5 illustrate that the proposed framework outperforms on all six except PASCAL-S [32] datasets in terms of Precision, F-Measure, and MAE while it outperforms in terms of Precision and MAE from M-EST [25] on PASCAL-S [32] dataset. In addition, the proposed framework is more closer to CCA [43] in terms of all metrics, i.e., the performance is almost equal. In contrast, the proposed framework outperforms with AML [53] in term of Recall across all datasets while it is comparable on other metrics. The proposed framework performs in a consistent and robust manner on a variety of images. Furthermore, the computational time of M-EST [25], AML [53], and CCA [43] aggregation methods includes computation time of (i) individual saliency methods and (ii) aggregation approach . As demonstrated in Table 4, the proposed framework is very fast in processing an image, and it is computationally more efficient than all compared aggregation methods which verify the superiority of the proposed framework against other compared aggregation methods. Table 5 displays the run time of the aggregation component of different aggregation methods. The proposed framework appreciably minimizes the aggregation computational time approx. 98.99%, 88.46%, and 97.20% from compared aggregation methods M-EST [25], CCA [43] and AML [53], respectively.

Furthermore, we evaluate the effectiveness of the proposed framework on varying threshold in terms of ROC against 25 different saliency methods as shown in Fig. 7 on six datasets. It can be seen that the proposed framework performs better than AML [53] while comparable with M-EST [25] and CCA [43] on all datasets. Hence, the proposed framework accurately separates the salient and non-salient regions on different thresholds that AML [53]. It can be also seen that DFRI [21] is better performer among all the methods on DUT-OMRON [55], SED2 [2], and THUR15K [7], while CCA [43] is better performs on other datasets. Hence, the performance of the proposed framework is comparable with better performing saliency detection methods and better than several classic state-of-the-art methods.

4.4 Failure cases

In this paper, we formulate the saliency detection problem as saliency bagging by using existing saliency detection models. The proposed models are highly effective on most of the saliency detection tasks. However, when the selected existing saliency detection models are simultaneously unable to detect the particular salient region then our model also fails to detect that particular region as shown in Fig. 8 top row. As shown in Fig. 8, middle row majority of existing saliency detection models detected salient region as a background region that influences the performance of our model. In the bottom row of Fig. 8, the above two cases appear simultaneously.

5 Conclusion and future work

In this paper, we have proposed a novel framework for saliency detection called saliency bagging. The proposed framework gives robust performance across a variety of images and works in three steps. Firstly, existing saliency detection models are selected and a set of initial salience maps are computed. Secondly, an integrated binary map is found from the initial saliency maps by applying adaptive thresholding and majority voting. Lastly, the final saliency map is obtained using an integrated binary map and initial saliency maps using a novel proposed integration logic. The efficacy of the proposed framework is supported by performing extensive experiments on six publicly available datasets, and the performance is compared with 25 state-of-the-art methods which include 17 classic best-performing methods of the last decade, five existing selected, and three aggregation saliency methods. The experimental results indicate that the proposed method significantly enhances the quality of saliency maps. It outperforms all the compared classic and existing selected methods in terms of Precision, F-Measure, and MAE, while it is comparable to the best-performing methods in terms of Recall and ROC curve across all the six datasets. It is computationally efficient than all compared aggregation methods while performance is comparable on all datasets that validate its superiority. In future work, we shall look for improving the performance of salient object detection in terms of Recall and ROC curves using machine learning techniques.