1 Introduction

Salient object detection is a challenge in computer vision, which plays an important role in a variety of applications such as image and video segmentation [1], object recognition [2], object class discovery [3], image retargeting [4], image quality assessment [5], image super-resolution [6], and person re-identification [7], image deblurring [8] to name a few.

Most of existing saliency approaches, such as CNTX [9], RC [10] and HDCT [11], etc, only focused on static or dynamic 2D scenes. These saliency methods extracts visual features and cues such as color, intensity, texture and motion [12, 13] for saliency detection from 2D image. In HDCT, saliency map of an image is represented as a linear combination of high-dimensional color space where salient regions and backgrounds can be distinctively separated. Boundary connectivity [14] is a robust background measure, proposed in RBD, which characterizes the spatial layout of image regions with respect to image boundaries. In addition, Lin et al. [12] proposed a macroblock classification method for various video processing applications involving motions. And a convex-hull-based process is proposed in [13] to automatically determine the regions of interest of the motions. As an important cue, the motion feature computed in the both works is also helpful for saliency detection. In general, these 2D saliency algorithms work well when dealing with simple images, but perform poorly with challenging scenarios. Moreover, 2D saliency methods are inherently different from how human visual system detects saliency. Recently, a few deep learning based frameworks such as [15] have been proposed, and obtained the remarkable results and significant improvements. But this kind of algorithms is still aimed at 2D images, which does not only take into account depth information, but also requires a lot of labeled training data. As is known to all, human eye can conduct dynamic refocusing that enables rapid sweeping over different depth layers. Man uses two eyes to estimate scene depth for more reliable saliency detection whereas most existing 2D approaches including deep learning based methods assume that the depth information is mainly unknown. For more details about 2D saliency models, please refer to Borji et al.’s studies [16, 17] .

Fortunately, as a new research field in recent years, light field has a unique capability of post-capture refocusing, which can be represented as a 4D function of light rays in terms of their positions and directions. The information of positions and directions can be converted into focal slices focusing on different depth levels, all-focus images and depth maps using rendering and refocusing techniques. The availability of the focal slices is in line with the focusness cue [18,19,20] which can be computed to separate in-focus and out-of-focus regions so as to identify salient objects. A moderately accurate depth map [21,22,23,24] can also greatly help distinguish the foreground from the background. In short, light field data provides a wealth of information such as depth cue and focueness cue. It has been recently proved to be greatly useful for salient object detection in the previous literatures [19, 20, 23, 25], which can be easily acquired in a single shot by commercial light field cameras such as Lytro and Raytrix. The pioneering research LFS [19], as the first salient object detection method on light field, which explores these useful cues mentioned above, can deal with many tricky scenarios including clustered background, similar foreground and background,etc. However, it ignores the explicit use of depth information of salient objects, and the framework is significantly different from previous 2D and 3D solutions so that it is hard to take advantage of previous models to extend its research. Moreover, the results in LFS are not clear enough. Li et al. [20] puts forward a unified saliency detection framework for handling heterogenous types of input data, including 2D, 3D and 4D data. Many visual features such as color, texture, depth and focusness are incorporated together to highlight the region of interesting. This algorithms can handle heterogenous types of input data, but it does not fully exploit the rich information embedded in 3D and 4D data. In fact, appending directly depth to feature vector is not a good choice for salient object detection. Zhang et al. [25] proposes a new saliency detection method, denoted as DILF, which examines focusness, depth and all-focus cues from light field data. First, it computes color contrast saliency and depth-induced contrast saliency. Then, it combines the two saliency maps by linear weighted sum to get the final saliency map. According to previous studies, linear fusion is probably not the best choice for saliency detection. Several visual results of these saliency algorithms are shown in Fig. 1.

Fig. 1
figure 1

Saliency maps of different salient object detection methods. From left to right: we show the all-focus images and the saliency maps of LFS [19], WSC [20], DILF [25], Ours, and ground truth (GT)

Inspired by these outstanding models, an effective two-stage Bayesian integration framework for salient object detection is proposed to handle the light field data in this paper. The main stages of our proposed method are illustrated in Fig. 2. Firstly, the all-focus image is segmented into a set of super-pixels via simple linear iterative clustering method [26], and boundary connectivity is calculated as an effective background measure. Then, background probability is achieved by fusing boundary connectivity with location prior, which is used as weight factors. Secondly, feature vector is composed of multiple features in a high-dimensional color space, which is used to estimate color contrast saliency. Meanwhile, depth-induced contrast saliency is estimated on depth map by computing L2-norm distance between pairs of super-pixels. Thirdly, focusness map of foreground slice is obtained by computing focusness [19, 20] of focus stack. All these three maps are weighted by background probability. Finally, we fuse these background weighted saliency maps in a two-stage Bayesian framework and gain the final saliency map.

In brief, the main contributions of this letter are as follows: Firstly, a new computational framework for salient object detection is proposed, which explored how to effectively use existing visual features of image to achieve better performance and results of saliency detection. Secondly, a two-stage Bayesian integration strategy is adopted in the proposed computational framework, which further improved performance of salient object detection.

2 Contrast-Based Saliency Computation

The proposed approach mainly integrates three saliency maps: color contrast saliency of all-focus images, depth contrast saliency of depth map and focusness map of foreground slice. As shown in Fig. 2, we first conduct the estimation of high-level feature prior such as background probability and focusness. Next, the aforementioned three saliency maps are computed respectively. Finally, all the three maps are weighted by background probability to accurately highlight informative objects of an image.

Fig. 2
figure 2

Illustration of the main phases of our proposed salient object detection algorithm

2.1 Estimation of Background Probability

According to our observation, like all focus image, foreground objects and background regions in the depth map differ in their structures and spatial layouts. Following RBD [14], boundary connectivity prior is computed as a background measure for saliency detection. For superpixel p, it is defined as follows.

$$\begin{aligned} bndCon\left( p \right) = \dfrac{Len_{bnd}(p)}{\sqrt{Area(p)}} \end{aligned}$$
(1)

where \(Len_{bnd}(p)\) is the length along the boundary, and Area(p) is the spanning area of each superpixel p, are computed in the following formulas:

$$\begin{aligned} Area(p)= & {} \sum _{i=1}^{N}\mathrm {exp}\left( -{\dfrac{d_{geo}^2}{2{\sigma _{clr}^2}}}\right) = \sum _{i=1}^{N}S\left( p,p_i\right) \end{aligned}$$
(2)
$$\begin{aligned} Len_{bnd}(p)= & {} \sum _{i=1}^{N}S\left( p,p_i\right) \cdot {\delta \left( p_i\in bnd\right) } \end{aligned}$$
(3)

Accordingly, background probability based on boundary connectivity is defined as:

$$\begin{aligned} bgPb(p) = 1 - \mathrm {exp}\left( -\dfrac{bndCon^2\left( p_i \right) }{2 {\sigma }_{bndCon}^2} \right) \end{aligned}$$
(4)

2.2 Focusness Estimation of Foreground Slice

For focusness of 4D light field, focusness of each focus slice \(f_i\) is defined as the mean distance to its 8-neighbors in the LAB color space:

$$\begin{aligned} \sigma _{i}^f\left( p\right) = \dfrac{1}{8}\sum _{j=1}^{8}{\left\| fea^{color} - fea_{j}^{color} \right\| }_2^2 \end{aligned}$$
(5)

where \(fea^{color}\) is the color vector of superpixel p in LAB color space. Next, for each slice \(f_i\), the focusness of all pixels along the x and y axes are projected respectively to get two 1D focusness distributions, denoted as \(D_x^i\) and \(D_y^i\), which are separately defined as:

$$\begin{aligned} D_x^i = \dfrac{\sum _{y=1}^{h}{\sigma _{i}^f\left( x,y\right) }}{\sum _{x} \sum _{y}{\sigma _{i}^f\left( x,y\right) }} , D_y^i = \dfrac{\sum _{x=1}^{w}{\sigma _{i}^f\left( x,y\right) }}{\sum _{x} \sum _{y}{\sigma _{i}^f\left( x,y\right) }} \end{aligned}$$
(6)

In order to select foreground slice, similar with LFS [19], a background likelihood score is computed by suppression filter for each focus slice \(f_i\):

$$\begin{aligned} BLS(f_i) = \rho \cdot \left[ \sum _{x=1}^{w}D_x^i\left( x\right) \cdot u\left( x,w\right) + \sum _{y=1}^{h}D_y^i\left( y\right) \cdot u\left( y,h\right) \right] \end{aligned}$$
(7)

where \(\rho = \exp \left( \dfrac{\lambda \cdot i}{N}\right) \) is the weighting factor, \(u\left( \cdot ,\cdot \right) \) is a U-shaped 1D band suppression filter defined in LFS. The slice with the lowest BLS score is selected as the foreground slice fg. The focusness map of foreground slice is denoted as \(S_F\left( p\right) \).

2.3 Color Contrast Computation in High-Dimensional Color Space

For color-based saliency, following HDCT [11], we compute various color features, including the average pixel color, color contrast and color histogram, which are in different color space such as RGB, CIELab and HSV. Color contrast contains two types: local contrast and global contrast. Local contrast is defined as the difference between superpixel p and the k nearest neighbor patches, and global contrast is defined as the difference between superpixel p and all patches except for p with respect to color features. But unlike HDCT, we also take into account the Gabor filter responses with 4 scales and 12 orientations as additional texture features. Texture feature is extremely helpful when informative objects with similar color appear in both the foreground and background regions. These aforementioned features are concatenated into the super-pixel feature vectors consisting of 123 dimensions for color-based saliency \(S_C\left( p\right) \).

2.4 Depth-Induced Contrast Computation

Depth has been proved greatly helpful for salient object detection and complementary to color contrast. But it is probably a poor choice to append directly depth map to feature vector. Since the proposal is similar to most contrast-based methods such as RBD [14] and DILF [25], we compute depth-induced contrast \(S_D\left( p_i\right) \) for superpixel \(p_i\):

$$\begin{aligned} S_D\left( p_i\right)= & {} \sum _{j=1}^{N}W_{pos}\left( p_i,p_j\right) \left\| U_{f}\left( p_i\right) - U_{f}\left( p_j\right) \right\| \end{aligned}$$
(8)
$$\begin{aligned} W_{pos}\left( p_i,p_j\right)= & {} \mathrm {exp}\left( -{\dfrac{{\left\| U_{p}^*\left( p_i\right) - U_{p}^*\left( p_j\right) \right\| }^2}{2\sigma _w^2}}\right) \end{aligned}$$
(9)

where \(U_{f}\) is the average depth value of super-pixel p. \(W_{pos}\left( p_i,p_j\right) \) is the \(L_2\)-norm distance between superpixel \(p_i\) and others.

2.5 Two-Stage Bayesian Fusion Framework for Salient Objection Detection

According to our observations, background prior is not only complementary to color contrast, but also complementary to depth-induced contrast and focusness of foreground slice. Hence, all the three maps including color contrast, depth-induced contrast and focusness map are weighted by background probability bgPb(p) as follows:

$$\begin{aligned} S_D^B\left( p\right)= & {} S_D\left( p\right) \cdot bgPb(p) \end{aligned}$$
(10)
$$\begin{aligned} S_C^B\left( p\right)= & {} S_C\left( p\right) \cdot bgPb(p) \end{aligned}$$
(11)
$$\begin{aligned} S_F^B\left( p\right)= & {} S_F\left( p\right) \cdot bgPb(p) \end{aligned}$$
(12)

The following step is to fuse saliency maps through Bayesian framework. In this paper, Bayesian fusion is successively conducted twice. The Bayes formula has been used to compute saliency by the posterior probability in recent studies [27, 28]:

$$\begin{aligned} Pb\left( F\mid H(z)\right) = \dfrac{S(z) Pb\left( H(z)\mid F\right) }{S(z) Pb\left( H(z)\mid F\right) + \left( 1-S(z) \right) Pb\left( H(z)\mid B\right) } \end{aligned}$$
(13)

where H(z) is a feature vector of pixel z and the prior probability S(z) is a coarse saliency map. The likelihood probabilities are computed as:

$$\begin{aligned} Pb\left( H(z)\mid F\right)= & {} \prod _{r \in {L,a,b}}{\dfrac{N_{bF\left( r(z)\right) }}{N_F}} \end{aligned}$$
(14)
$$\begin{aligned} Pb\left( H(z)\mid B\right)= & {} \prod _{r \in {L,a,b}}{\dfrac{N_{bB\left( r(z)\right) }}{N_B}} \end{aligned}$$
(15)

where \(N_F\) and \(N_B\) are the number of pixels in the foreground F and background B respectively. \(N_{bF\left( r(z)\right) }\) and \(N_{bB\left( r(z)\right) }\) denote the number of pixels whose color features belong to the foreground bin \(bF\left( r(z)\right) \) and background bin \(bB\left( r(z)\right) \) respectively.

We first fuse color contrast saliency \(S_C^B\) with focusness map \(S_F^B\) in the first fusion stage, in which color contrast saliency \(S_C^B\) is treated as the prior probability, focusness map \(S_F^B\) is used to compute the likelihood probability \(Pb\left( S_F^B(z)|F_1\right) \), and vise versa, focusness map \(S_F^B\) as the prior probability and color contrast saliency \(S_C^B\) is used to compute the likelihood probability \(Pb\left( S_C^B(z)|F_2\right) \). Consequently, the two corresponding posterior probabilities are computed by Bayes formula and then used to integrate a final saliency map:

$$\begin{aligned} S_{CF}^B\left( S_1(z),S_2(z) \right) = Pb\left( F_1|S_F^B(z)\right) + Pb\left( F_2|S_C^B(z)\right) \end{aligned}$$
(16)

Then, the new-found fusion result in the first fusion stage, denoted as \(S_{CF}^B\), is further fused similarly with depth-induced contrast saliency \(S_D^B\) in the second fusion stage to get the final saliency map.

3 Experimental Results

3.1 Visual Comparison with Other Methods

To illustrate the effectiveness of the proposed approach, we have performed experiments on the only light field dataset LFSD [19]. LFSD includes 40 outdoor scenes and 60 indoor scenes, where each light field scenes is captured by Lytro camera. For each data, there are three users who were asked to manually label the salient objects from the all-focus image.

Fig. 3
figure 3

Visual comparisons of different light field saliency methods. From left to right: we show the all focus images and the saliency results of LFS [19], WSC [20], DILF [25], Ours, and GT

The proposed approach is qualitatively compared with all light field methods we know, including LFS [19], WSC [20] and DILF [25]. A visual comparison of our method with the others is demonstrated in Fig. 3. It is noted that the proposed method is not only able to highlight the entire salient object, but also alleviate the noise of the background obviously. Our proposed model can robustly detect object of interest in challenging scenarios such as similar foreground and background, cluttered background, etc., and achieves the most visually acceptable salient object detection results.

Fig. 4
figure 4

Performance comparisons of the proposed method versus LFS [19], WSC [20], DILF [25] and five 2D saliency models such as RBD [14], GMR [29], HDCT [11], GS [30], SF [31]. a Precision-recall curve (PRC); b F-measure; c precision, recall and F-measure scores; d mean absolute error (MAE)

3.2 Performance Evaluation Measures

In order to conduct a quantitative performance evaluation, we select the methodologies of the authoritative precision-recall curve (PRC). For a given saliency map, with saliency values in the range [0,255], we threshold the saliency map at a threshold T within [0,255] to obtain a binary mask for the salient object. Then we vary this threshold from 0 to 255, and compute the precision, recall and F-Measure values at each value of the threshold for comparing the quality of different saliency maps. The PRC of different methods on LFSD are shown in Fig. 4a, which show that resulting curves of the proposed approach is higher than the other approaches for most of given recall rates. Following FT [32], F-measure is also used for evaluation, which is an overall performance measurement with the weighted harmonic of precision and recall, defined as:

$$\begin{aligned} F_\beta = \dfrac{\left( 1+\beta ^2\right) Precision \times Recall}{\beta ^2 \cdot Precision + Recall} \end{aligned}$$
(17)

As shown in Fig. 4b, F-measure shows that it is hard to tell which one is better, the proposed method or the DILF, but it is obvious that the proposed method is superior to the rest. In order to further validate the superiority of our method, the precision, recall and F-measure scores of all methods in our comparative experiment are shown in Fig. 4c. It demonstrates the results of ours fully exceed the others except that the recall is next only to DILF.

As suggested in SF [31], the PRC is lack of considering the true negative detection of the salient and non-salient pixels in an image. Therefore, the mean absolute error (MAE) between the saliency map and ground truth is also used for a more balanced comparison, which is defined as:

$$\begin{aligned} MAE = \dfrac{1}{W \times H}\sum _{x=1}^{W}\sum _{y=1}^{H}|S\left( x,y\right) - GT\left( x,y\right) |\end{aligned}$$
(18)

where S is a saliency map and GT represents the ground truth image. W and H are the width and the height of S and GT ,respectively. As shown in Fig. 4d, our result of MAE is the lowest one in all methods. In our experiments, we set the number of superpixels N to be 300, and set \(\beta ^2 =0.3\) to highlight the precision in F-measure.

Table 1 Comparison of average running time (seconds per image) of the most recent state-of-the-art saliency detection methods

In Table 1, we show a comparison of the average computational time for each image of the state-of-the-art algorithms mentioned above, including our method. The running time is measured on a computer with an Intel Dual Core i5-2320 3.0 GHz CPU. Considering that the proposed algorithm is implemented by using MATLAB 2015a with unoptimized code, the computational complexity of our algorithm is comparable to that of other algorithms.

Fig. 5
figure 5

Some failures output results by our proposed approach, where \(Sal\_CFb\) is Bayesian fusion results of background weighted color-based contrast and focueness map of foreground slice. \(Sal\_Db\) is saliency maps of background weighted depth-induced contrast

As illustrated in Fig. 5, we also exhibit two failures brought by the proposed approach. The performance of our approach is partially dependent on the accuracy of depth map. If the depth map is seriously blurred or amorphous, our model would get incorrect results. Although there are many outstanding models which can be used to estimate the depth map, it is still a thought-provoking problem to obtain accurate depth map in natural cluttered images.

4 Conclusion

In this paper, the color-based contrast, depth-induced contrast and focusness map of foreground slice, weighted by background probability, are fused in a two-stage Bayesian integration algorithm for salient object detection. We investigate the importance of depth and focusness cue with regard to salient object detection on light field data. Experimental results show that color feature is complementary to focusness and depth cue. The three features are of a great help for salient object detection, and the proposed method brings out desirable results on the light field dataset, LFSD, compared with many state-of-the-art saliency approaches. Due to the partial dependency on the quality and performance of depth map, the proposed method might fail in certain cases. In the future, estimation of more accurate depth maps from light fields will be conducted to improve the accuracy of salient object detection.