Keywords

1 Introduction

Saliency detection aims to effectively highlight the most important pixels in an image. It helps to reduce computing costs and has widely been used in various computer vision applications, such as image segmentation [6], image retrieval [36], object detection [8], object recognition [21], image adaptation [23], and video segmentation [26]. Saliency detection could be summarized in three methods: bottom-up methods [22, 33, 37], top-down methods [10, 15, 35] and mixed methods [7, 27, 31]. The top-down methods are driven by tasks and could be used in object detection tasks. The authors in [34] proposed a top-down method that jointly learns a conditional random field and a discriminative dictionary. Top-down methods could be applied to address complex and special tasks but they lack versatility. The bottom-up methods are driven by data, such as color, light, texture and other basic features. Itti et al. [13] proposed a saliency method by using these basic features. It could be effectively used for real-time systems. The mixed methods are considered both bottom-up and top-down methods.

In this paper, we focus on the bottom-up methods, the proposed method is based on the properties of Markov model, there are many works based on Markov model, such as [3, 4]. Traditional saliency detection via Markov chain [14] is based on Marov model as well, but it only consider boundaries nodes. However, in addition to boundaries cues, background prior and foreground prior cues play a complementary role to enhance saliency detection. We consider four boundaries information and the foreground prior saliency object, using absorbing Markov chain, namely, both boundary absorbing and foreground prior are considered to get background and foreground possibility. In addition, we further optimize our model by fusing these two possibilities, and exploite multi-scale processing. Figure 1 demonstrates and compares the results of our proposed method with the traditional saliency detection absorbing Markov chain (MC) method [14], where the outperformance of our method is evident.

Fig. 1.
figure 1

Comparison of the proposed method with the ground truth and MC method.

2 Principle of Absorbing Markov Chain

In absorbing Markov chain, the transition matrix P is primitive [9], by definition, state i is absorbing when \(P(i,i)=1\), and \(P(i,j)=0\) for all \(i \ne j\). If the Markov chain satisfies the following two conditions, it means there is at least one or more absorbing states in the Markov chain. In every state, it is possible to go to an absorbing state in a finite number of steps (not necessarily in one step), then we call it absorbing Markov chain. In an absorbing Markov chain, if a state is not a absorbing state, it is called transient state.

An absorbing chain has m absorbing states and n transient states, the transfer matrix P can be written as:

$$\begin{aligned} P\rightarrow \left( \begin{array}{cc} Q&{}R\\ 0&{}I\\ \end{array}\right) , \end{aligned}$$
(1)

where Q is a n-by-n matrix, giving transient probabilities between any transient states, R is a nonzero n-by-m matrix giving these probabilities from transient state to any absorbing state, 0 is a m-by-n zero matrix and I is the m-by-m identity matrix.

For an absorbing chain P, all the transient states can achieve absorbing states in one or more steps, we can write the expected number of times N(ij) (which means the transient state moves from i state to the j state), its standard form is written as:

$$\begin{aligned} N=(I-Q)^{-1}, \end{aligned}$$
(2)

namely, the matrix N with invertible matrix, where \(n_{ij}\) denotes the average transfer times between transient state i to transient state j. Supposing \(c=[1,1,\cdot \cdot \cdot ,1]_{1\times n}^{N}\), the absorbed time for each transient state can be expressed as:

$$\begin{aligned} z=N\times c. \end{aligned}$$
(3)

3 Bidirectional Absorbing Markov Chain Model

To obtain more robust and accurate saliency maps, we propose a method via bidirectional absorbing Markov chain. This section explains the procedure to find the saliency area in an image in two orientations. Simple linear iterative clustering (SLIC) algorithm [2] has been used to get the superpixels. The pipeline is explained below (Fig. 2):

Fig. 2.
figure 2

The processing of our proposed method

3.1 Graph Construction

The SLIC algorithm is used to split the image into different pitches of superpixels. Afterwards, two kinds of graphs \(G^1(V^1, E^1)\) and \(G^2(V^2, E^2)\) are constructed, where \(G^1\) represents the graph of boundary absorbing process and \(G^2\) represents the graph of foreground prior absorbing process. In each of the graphs, \(V^1, V^2\) represent the graph nodes and \(E^1, E^2\) represent the edges between any nodes in the graphs. For the process of boundary absorbing, superpixels around the four boundaries as the virtual nodes are duplicated. For the process of foreground prior absorbing, superpixels from the regions (calculated by the foreground prior) are duplicated. There are two kinds of nodes in both graphs, transient nodes (superpixels) and absorbing nodes (duplicated nodes). The nodes in these two graphs constitute following three properties: (1) The nodes (including transient or absorbing) are associated with each other when superpixels in the image are adjacent nodes or have the same neighbors. And also boundary nodes (superpixels on the boundary of image) are fully connected with each other to reduce the geodesic distance between similar superpixels. (2) Any pair of absorbing nodes (which are duplicated from the boundaries or foreground nodes) are not connected (3) The nodes, which are duplicated from the four boundaries or foreground prior nodes, are also connected with original duplicated nodes. In this paper, the weight \(w_{ij}\) of the edges is defined as

$$\begin{aligned} w_{ij}=e^{-\frac{\left\| x_{i}-x_{j} \right\| }{\sigma ^{^{2}}}}, i,j \in V^1 \ \text {or} \ i,j \in V^2 \end{aligned}$$
(4)

where \(\sigma \) is the constant parameter to adjust the strength of the weights in CIELAB color space. Then we can get the affinity matrix A

$$\begin{aligned} a_{ij}= {\left\{ \begin{array}{ll} w_{ij}, &{}\text {if}~ j\in M(i) \quad 1 \le i \le j \\ 1, &{}\text { if}~ i=j \\ 0, &{}\text { otherwise}, \end{array}\right. } \end{aligned}$$
(5)

where M(i) is a nodes set, in which the nodes are all connected to nodes i. The diagonal matrix is given as: \(D = diag(\sum _{j}a_{ij})\), and the obtained transient matrix is calculated as: \(P = D^{-1} \times A.\)

3.2 Saliency Detection Model

Following the aforementioned procedures, the initial image is transformed into superpixels, now two kinds of absorbing nodes for saliency detection are required. Firstly, we choose boundary nodes and foreground prior nodes to duplicate as absorbing nodes and obtain the absorbed times of transient nodes as foreground possibility and background possibility. Secondly, we use a cost function to optimize two possibility results together and obtain saliency results of all transient nodes.

Absorb Markov Chain via Boundary Nodes. In normal conditions, four boundaries of an image rarely have salient objects. Therefore, boundary nodes are assumed as background, and four boundaries nodes set \(H^1\) are duplicated as absorbing nodes set \(D^1\), \(H^1,D^1 \subset V^1\). The graph \(G^1\) is constructed and absorbed time z is calculated via Eq. 3. Finally, foreground possibility of transient nodes \(z^f = \bar{z}(i) \quad i=1,2,\cdot \cdot \cdot ,n,\) is obtained, and \(\bar{z}\) denotes the normalizing the absorbed time vector.

Absorb Markov Chain via Foreground Prior Nodes. We use boundary connectivity to get the foreground prior \(\mathbf f _i\) without using the down-top method [38].

$$\begin{aligned} f_i = \sum ^N_{j=1} (1 - \mathrm {exp}\big (-\frac{BC_j^2}{2\sigma _b^2}\big ))d_a(i,j)\mathrm {exp}\big (-\frac{d_s^2(i,j)}{2\sigma _s^2}\big ) \end{aligned}$$
(6)

where \(d_a(i,j)\) and \(d_s(i,j)\) denote the CIELAB color feature distance and spatial distance respectively between superpixel i and j, the boundary connectivity (BC) of superpixel i is defined as \(BC_i = \frac{\sum _{j\in \mathcal {H}}w_{ij}}{\sqrt{\sum ^N_{j=1}w_{ij}}}\) in Fig. 3, \(\sigma _b = 1 \), \(\sigma _s = 0.25 \). \(\mathcal {H}\) denotes the boundary area of image and \(w_{ij}\) is the similarity between nodes i and j. N is the number of superpixels. Afterwards, nodes (\(\{i|f_i > avg(f)\}\)) with high level values are selected to get a set \(H^2\), which are duplicated as absorbing nodes set \(D^2\), \(H^2,D^2 \subset V^2\). The graph \(G^2\) is constructed and absorbed time z is calculated using Eq. 3. Finally, the background possibility of transient nodes \(z^b = \bar{z}(i) \quad i=1,2,\cdot \cdot \cdot ,n,\) is obtained, where \(\bar{z}\) denotes the absorbed time vector normalization.

Fig. 3.
figure 3

An illustrative example of boundary connectivity. (a) input image (b) the superpixels of input image (c) the superpixels of similarity in each pitches (d) an illustrative example of boundary connectivity.

3.3 Saliency Optimization

In order to combine different cues, this paper has used the optimization model presented in [38], which fused background possibility and foreground possibility for final saliency map. It is defined as

$$\begin{aligned} \sum _{i=1}^{N}z^b_{i}s_{i}^{2}+\sum _{i=1}^{N}z^f_{i}(s_{i}-1)^{2}+\sum _{i,j}w_{ij}(s_{i}-s_{j})^{2} \end{aligned}$$
(7)

where the first term defines superpixel i with large background probability \(z^b\) to obtain a small value \(s_i\) (close to 0). The second term encourages a superpixel i with large foreground probability \(z^f\) to obtain a large value \(s_i\) (close to 1). The third term defines the smoothness to acquire continuous saliency values.

In this work, the used super-pixel numbers N are 200, 250, 300 in the superpixel element, and the final saliency map is given as: \(\mathbf S = \sum _h{S^h}\) at each scale, where \(h = 1, 2, 3\).

4 Simulation Results

The proposed method is evaluated on four benchmark datasets ASD [1], CSSD [30], ECSSD [30] and SED [5]. ASD dataset is a subset of the MSRA dataset, which contains 1000 images with accurate human-labeled ground truth. CSSD dataset, namely complex scene saliency detection contains 200 complex images. ECSSD dataset, an extension of CSSD dataset contains 1000 images and has accurate human-labeled ground truth. SED dataset has two parts, SED1 and SED2, images in SED1 contains one object, and images in SED2 contains two objects, in total they contain 200 images. We compare our model with 17 different state-of-the-art saliency detection algorithms: CA [12], FT [1], SEG [20], BM [28], SWD [11], SF [19], GCHC [32], LMLC [29], HS [30], PCA [18], DSR [17], MC [14], MR [33], MS [24], RBD [38], RR [16], MST [25]. The tuning parameters in the proposed algorithm is the edge weight \(\sigma ^2=0.1\) that controls the strength of weight between a pair of nodes.

The precision-recall curves and F-measure are used as performance metrics. The precision is defined as the ratio of salient pixels correctly assigned to all the pixels of extracted regions. The recall is defined as the ratio of detected salient pixels to the ground-truth number. A PR curve is obtained by the threshold sliding from 0 to 255 to get the difference between the saliency map (which is calculated) and ground truth (which is labeled manually). F-measure is calculated by the weighted average between the precision values and recall values, which can be regarded as overall performance measurement, given as:

$$\begin{aligned} F_{\beta } = \frac{(1+\beta ^{2})Precision \times Recall}{\beta ^{2}Precision + Recall}, \end{aligned}$$
(8)

we set \(\beta ^{2} = 0.3\) to stress precision more than recall. PR-curves and the F-measure curves are shown in Figs. 47, where the outperformance of our proposed method as compared to 17 state-of-the-art methods is evident. Figure 8 presets visual comparisons selected from four datasets. It can be seen that the proposed method achieved best saliency results as compared to the state-of-the-art methods.

Fig. 4.
figure 4

PR-curves and F-measure curves comparing with different methods on ASD dataset.

Fig. 5.
figure 5

PR-curves and F-value curves comparing with different methods on ECSSD dataset.

Fig. 6.
figure 6

PR-curves and F-measure curves comparing with different methods on CSSD dataset.

Fig. 7.
figure 7

PR-curves and F-measure curves comparing with different methods on SED dataset.

Fig. 8.
figure 8

Examples of output saliency maps results using different algorithms on the ASD, CSSD, ECSSD and SED datasets

5 Conclusion

In this paper, a bidirectional absorbing Markov chain based saliency detection method is proposed considering both boundary information and foreground prior cues. A novel optimization model is developed to combine both background and foreground possibilities, acquired through bidirectional absorbing Markov chain. The proposed approach outperformed 17 different state-of-the-art methods over four benchmark datasets, which demonstrate the superiority of our proposed approach. In future, we intend to apply our proposed saliency detection algorithm to problems such as multi-pose lipreading and audio-visual speech recognition.