1 Introduction

Fig. 1
figure 1

Examples of the differences between different SOD task in RGB-D images. From the top to the bottom corresponds to the input RGB images, depth maps, and saliency ground truths. (a) RGB-D SOD. (b) RGB-D CoSOD. (c) RGB-D wCoSOD

Salient object detection (SOD) simulates the human visual attention mechanism to locate the visually attractive object or regions from a scene [8], which has been applied in a large number of vision tasks, such as image classification [17], image retrieval [15], image compression [18], and image retargeting [13], etc. In addition, consistent with human stereo perception, RGB-D salient object detection aims to introduce the depth cue on the basis of RGB information to effectively suppress the complex backgrounds and better highlight the salient objects, as shown in Fig. 1(a). Co-salient object detection (CoSOD) in multiple images has attracted extensive attention from many successful applications. Inspired by human inductive ability and collaborative processing mechanism, co-salient object detection in RGB-D images not only pays attention to the salient object in the single image, but also determines the recurring ones in multiple related images by modeling the inter-images relationship. As shown in Fig. 1(b), RGB-D CoSOD aims to detect common salient object that exists in all three RGB-D images, i.e., the green cartoon characters. However, it is of certain practical significance for the co-salient object detection within an image, mainly detecting the salient objects with similar attributes in an image, which is called within-image co-salient object detection (wCoSOD). As shown in Fig. 1(c), the salient and recurring objects within the image are athletes in red instead of athletes in blue. This task has many potential applications in computer vision, such as reducing information redundancy [37], synthesizing realistic animations from still images [35], detecting multiple instances of an object class [19, 24], etc. The distinction between wCoSOD and CoSOD for RGB-D image is whether the processed data is multiple different images or a single image, so their corresponding modeling methods are also different. The co-salient object detection of multiple images is performed in an image group, so the correspondence between the objects between different images needs to be considered when modeling. While wCoSOD is to find a co-salient object in a single image, the corresponding relationship of different objects in the intra-image needs to be modeled. Considering the effectiveness of depth information and the significance of the wCoSOD task, in this paper, we introduce depth information into the wCoSOD task for the first time, construct a corresponding RGB-D wCoSOD benchmark dataset and provide an unsupervised baseline model.

As mentioned above, the current studies on wCoSOD only focus on RGB image without the aid of depth information. Research on RGB-D wCoSOD is expected to further increase the detection performance, while further prospering the SOD family. Thus, the first thing that needs to be solved is that there is no publicly available RGB-D wCoSOD dataset, which is not conducive to the development of this direction to a certain extent. Based on this, we first collect and annotate an RGB-D wCoSOD benchmark dataset, which laid a data foundation for subsequent algorithm research and performance evaluation. A total of 240 RGB-D images were collected, and the corresponding pixel-wise ground truths were manually annotated. In this dataset, 50% of the images have both common salient objects and non-common salient objects that need to be suppressed, which makes it very challenging. In addition, we also propose an unsupervised method to achieve RGB-D wCoSOD by considering the cluster constraint and similarity matching. The task of RGB-D wCoSOD is decomposed into two parts for construction of the model. By introducing the depth information, more accurate and compact saliency proposals are first determined. Then, the similarity that contains the depth information and cluster-based constraints between different proposals are designed to measure the correspondences and locate the co-salient object.

The main contributions can be summarized as follows:

  • To the best of our knowledge, this paper focuses on the within-image co-salient object detection in RGB-D images for the first time, and constructs the first publicly available dataset for this task containing 240 RGB-D images.

  • With the help of depth information, we propose an unsupervised baseline model, and model the correspondence as a consistency matching problem among different proposals under the similarity and cluster-based constraints.

  • Experimental results on the collected dataset demonstrate that our method achieves competitive performance compared with the existing RGB-D SOD and RGB wCoSOD algorithms.

The rest of this paper is composed as follows. Section 2 reviews the related work. Section 3 details the proposed method. The construction of the RGB-D wCoSOD dataset is introduced in Section 4. Section 4 presents the experimental results and analysis. The conclusions are described in Section 6.

2 Related work

In this section, we briefly review the tasks of SOD. According to different actual needs and processing data, SOD model can be further divided into RGB SOD [6, 36], RGB-D SOD [10, 23, 29, 31], CoSOD [5, 9, 14, 20], video SOD [22, 34], and so on. Among them, CoSOD aims to detect salient and recurring objects in an image group by considering the inter-image relationship, which is similar to our task in this paper. Based on the different inter-image modeling strategies, CoSOD methods can be roughly categorized as clustering based method [14], matching based method [5], depth-induced method [9, 31], and learning based method [20].

The difference from CoSOD is that wCoSOD has only one input image, so it is necessary to model the correspondence between different objects in an image, instead of modeling the constraint relationship between the objects among different images. In view of the characteristics of wCoSOD, the researchers designed some methods to solve it. In [37, 38], an unsupervised bottom-up wCoSOD method was proposed by finding common and saliency proposal groups through optimization and fusion. Song et al. [32] proposed a multi-scale multiple instance learning (MIL) model for wCoSOD. However, for RGB-D wCoSOD task, as yet it has still blank space. Therefore, in this work, we not only collected a dataset suitable for this task, but also proposed an unsupervised baseline model.

Fig. 2
figure 2

The framework of the proposed RGB-D wCoSOD

3 Proposed method

In this paper, we propose an unsupervised wCoSOD model for RGB-D images. As shown in the framework of Fig. 2, our method consists of three parts. First, we determine the saliency attribute and generate the proposal candidates. Specifically, we apply the existing RGB-D SOD models (e.g., A2dele [29]) to obtain the initial saliency map \(S^{init}\) including all the salient objects without distinguishing common attributes. At the same time, we use the Selective Search algorithm [33] to generate some proposal candidates based on the RGB-D images for common attribute judgment, and abstract the RGB image I into several superpixels \(R=\{r_i\}\), \(i=1,2,...,M\) through the SLIC method [2] to improve computation efficiency and structural representation. Then, the correspondence relationship between proposals at the superpixel level is calculated based on the similarity constraints \(w_{RD}\) and cluster-based constraints \(w_C\), and then the initial co-saliency map \(S^{co-init}\) is obtained. Finally, the initial saliency map \(S^{init}\) and the initial co-saliency map \(S^{co-init}\) are weighted fused to generate the final co-saliency map \(S^{co}\).

3.1 Depth assisted proposal generation

Different from the CoSOD in an image group, there is only one image in wCoSOD task without the concept of multiple images. Therefore, how to determine the common attributes among salient objects from single image is the key to our task. Inspired by general CoSOD methods, we can model the correspondence between objects at the proposal level. For the object/region in a proposal, if it is salient and appears multiple times in the entire image, then it is the co-salient object that our task is looking for. Thus, we should generate some proposal candidates to determine the common attributes between objects.

First, the initial L proposals are extracted from the input RGB image I via the Selective Search algorithm [33], which are denoted as \(P^{init}=\{P_n^{init}\}\), \(n=1,2,...,L\). Considering that only the salient proposals are useful for our final task, thus we need to ensure that there are the salient regions in the proposals. To be specific, we set three rules to filter the initial proposals: (1) The proposals should have an appropriate size to avoid covering multiple objects or partial areas inside. (2) The proposals should be salient to highlight the saliency attribute. (3) The proposals should have larger depth value to eliminate the background interference. The process can be formulated as:

$$\begin{aligned} P^{fin}=size\left( P^{init}\right) \cap sal\left( P^{init}\right) \cap dep\left( P^{init}\right) , \end{aligned}$$
(1)

where size() preserves proposals with the size of 1%-30%, sal() removes proposals with the average salient value of less than 0.2, dep() selects the top 80% proposals with larger average depth value. The proposals that conform to the above conditions are defined as denoted as \(P^{fin}=\{P_n\}\), \(n=1,2,...,N\).

3.2 Proposal-wise correspondence calculation

As mentioned earlier, capturing the corresponding relationship and determining the common attributes are the key points for the co-salient object detection model. In our proposed method, the correspondence is modeled as the consistency measurement of different superpixels in different proposals through similarity constraint and cluster-based constraint. In fact, the co-salient objects should have the similar attributes in terms of color appearance, depth distribution, and category. Therefore, we design the similarity constraint considering the color and depth information and the cluster-based constraint to measure the correspondence at proposal level.

Color information intuitively describes the content of an image, and depth information helps distinguish the foreground and background well. Therefore, when calculating the consistency of different superpixels in different proposals, both color and depth information are considered to express similarity constrain at the same time. We define the similarity matrix between superpixel \(r_i^p\) and \(r_j^q\) :

$$\begin{aligned} \omega _{RD}\left( r_i^p,r_j^q\right) =exp\left( -\frac{\parallel h_i^p,h_j^q\parallel _2+\lambda \mid d_i^p-d_j^q\mid }{\sigma ^2}\right) , \end{aligned}$$
(2)

where \(r_i^p\) represents the i-th superpixel in the p-th proposal, \(h_i^p\) denotes the mean color vector of superpixel \(r_i^p\) in the L*a*b* color space, \(\parallel h_i^p,h_j^q\parallel _2\) is the euclidean distance between \(h_i^p\) and \(h_j^q\), \(d_i^p\) denotes the mean depth value of superpixel \(r_i^p\), \(\lambda\) is the depth confidence measure [10], and \(\sigma ^2\) is a parameter to control strength of the similarity, which is fixed to 0.1.

In an image, the salient objects can be further divided into co-salient objects and non-co-salient objects, in which the co-salient objects belong to the same category and can be clustered together. Thus, we propose the cluster-based constraint to define the correspondence. Different from the previous work [9, 14], we consider the degree of correlation between the largest subject class of superpixels. Specifically, we firstly apply k-means method [3] to group the image into K clusters. Then, we define the class probability of superpixel \(r_i\):

$$\begin{aligned} c_k^i=n_k^i/\sum \limits _{k=1}^Kn_k^i , \end{aligned}$$
(3)

where \(n_k^i\) represents the number of pixels belonging to the k-th class in superpixel \(r_i\). \(c_k^{i}\) represents the probability that superpixel \(r_i\) belongs to the cluster k. Then, the cluster correlation between superpixel \(r_i^p\) and \(r_j^q\) is defined as the class probability of the superpixel \(r_j^q\) in the q-th proposal with consistent clustering attribute:

$$\begin{aligned} \omega _{C}\left( r_i^p,r_j^q\right) =c_{mk}^{j,q}, \end{aligned}$$
(4)

where \(mk=\mathrm{argmax}_{k\in K}\left( c_{k}^{i,p}\right)\) is the clustering index of superpixel \(r_i^p\) in the p-th proposal, which is defined as the cluster corresponding to the maximum class probability. In summary, \(w_{RD}\) reflects the degree of similarity between superpixels, and \(w_{C}\) measures the likelihood of superpixels belonging to the same class. Thus, the initial co-saliency of a superpixel is computed as the weighted sum of the initial saliency of corresponding superpixels in other proposals, which is formulated as:

$$\begin{aligned}&S^{co-init}\left( r_i^p\right) \nonumber \\= & {} \frac{1}{N-1}\underset{q\ne p}{\sum \limits _{q=1}^{N}}\frac{1}{N_q}\sum \limits _{j=1}^{N_q}\lambda _1w_{RD}\left( r_i^p,r_j^q\right) \cdot \lambda _2w_{C}\left( r_i^p,r_j^q\right) \cdot S^{init}\left( r_j^q\right) , \end{aligned}$$
(5)

where \(\lambda _1\) and \(\lambda _2\) are the weighted coefficients, which can be adjusted according to different scenarios and needs. Without loss of generality, they are both set to 1 in the experiment. N is the number of proposals in the image I, \(N_q\) is the number of superpixels in the q-th proposal \(P_q\).

3.3 Within-image co-saliency map generation

Finally, the initial saliency map \(S^{init}\) and the initial co-saliency map \(S^{co-init}\) are integrated through intersection and union operation to generate the final co-saliency map \(S^{co}\). The intersection better suppresses the interference area, and the union better ensures the consistency of the saliency map. The formula is represented as:

$$\begin{aligned} S^{co}=\frac{1}{2}\gamma _1\cdot \left( S^{init}+S^{co-init} \right) +\gamma _2\cdot \left( S^{init}\cdot S^{co-init}\right) , \end{aligned}$$
(6)

where \(\gamma _1\) and \(\gamma _2\) are the weighted coefficients, which are all set to 0.5 in experiments without loss of generality.

4 Dataset

Fig. 3
figure 3

Sample RGB images, depth images and their corresponding ground-truth images. (a) Images have some interfering but salient objects. (b) Images are consistent with the single-image SOD results. (c) Images without any within-image co-salient object

Existing publicized image datasets used to evaluate RGB-D salient object detection, such as DUT-RGBD [28], NLPR [27] and NJUD [21], are mainly used to test salient object detection in RGB-D SOD. In most cases, each image contains only one salient object, which is annotated as the ground truth. In this paper, we have a different goal to detect within RGB-D images co-salient object, which is not shown in most images in the public dataset. Therefore, we collected a new benchmark dataset consisting of 240 RGB-D images for co-salient object detection within RGB-D image, which provides the corresponding pixel-wise ground truth. In the dataset, about 221 RGB-D images are selected from the current RGB-D datasets [7, 12, 21, 25,26,27, 39], and the remaining 19 RGB-D images are collected from the RGB within-image co-saliency dataset [37] and the depth maps are generated by the depth estimation method [16]. For the within-image co-saliency annotation, we follow the work [9] to reproduce the ground truth. Some visual examples are shown in Fig. 3. We can see that in this dataset, some images have some interfering but salient objects (such as the Fig. 3(a)), the wCoSOD results of some images are consistent with the single-image SOD results (such as Fig. 3(b)), and there are a few images without any within-image co-salient object in it, as shown in Fig. 3(c). Moreover, our dataset contains both indoor and outdoor scenes, the object types are also relatively diverse, and the background is relatively complex, which makes our dataset very challenging. The dataset will be released after the paper is accepted.

5 Experiments

5.1 Experimental metrics and settings

In order to evaluate the quality of the model, we evaluate the performance of our method on the collected dataset, including the precision-recall (PR) curve, F-measure [1], S-measure [11], and MAE [8]. The precision is the percentage of correctly assigned salient pixels in the detected saliency map, and the recall represents the ratio of the detected salient pixels in the ground truth. The PR curve can be drawn using the paired precision and recall values. When the PR curve is closer to (1,1), the performance of the algorithm is better. F-measure is defined as the weighted average of precision and recall to measure both as a whole:

$$\begin{aligned} F_\beta =\frac{\left( 1+\beta ^2\right) Precision\times Recall}{\beta ^2\times Precision+Recall}, \end{aligned}$$
(7)

where \(\beta ^2\) is set to 0.3, which is the balance parameter [4].

The mean absolute error (MAE) calculates the average absolute error between the detected saliency map S and the ground truth G.

$$\begin{aligned} MAE=\frac{1}{W\times H}\sum \limits _{x=1}^{W}\sum \limits _{y=1}^{H}\mid S(x,y)-G(x,y)\mid , \end{aligned}$$
(8)

where W and H are the width and height of the image respectively. Obviously, the smaller the MAE value, the better the performance of the algorithm.

The S-measure value combines the regional perception (\(S_r\)) and object perception (\(S_o\)) to evaluate the structural similarity between the detected saliency map and ground truth.

$$\begin{aligned} S_m=\alpha *S_o+(1-\alpha )*S_r , \end{aligned}$$
(9)

where \(\alpha\) is set to 0.5 as suggested in [11].

In the experiment, we apply the A2dele [29] model to obtain the initial saliency map. The number of superpixels M is set to 500 and K is set to 6. The experiments are conducted on 1.6 GHz frequency Intel i5 CPU and 8GB of RAM using Matlab 2016a, and the average running time is 43 seconds.

5.2 Performance comparison

Fig. 4
figure 4

Comparison of visual results. (a) RGB image. (b) Depth image. (c) Ground truth. (d) DF. (e) DCMC. (f) CDCP. (g) DRMA. (h) A2dele. (i) CDS. (j) Ours

Table 1 The S-measure, MAE and F-measure of different models

We compare our method with six methods, i.e., DF [30], DCMC [10], CDCP [40], DMRA [28], A2dele [29], CDS [37]. CDS is an RGB wCoSOD unsupervised model. DCMC and CDCP are the RGB-D SOD unsupervised models. DF, DMRA and A2dele are the deep learning based RGB-D SOD models, in which the DF method is trained on the NLPR [27] and NJUD [21], and DMRA (2019) and A2dele (2020) are trained on the DUT-RGBD [28], NLPR [27] and NJUD [21] datasets. These training datasets are designed for RGBD SOD tasks, including the paired RGB and depth images. Because there is no RGB-D wCoSOD dataset before this work, and the core of our work in this paper is a tentative exploration of a new task. Therefore, the scale of the first dataset we constructed is relatively small, and it is mainly used for the design of unsupervised RGB-D wCoSOD model. Furthermore, the comparison algorithms in the experiment have not been retrained, and better performance should be obtained through retraining the RGB-D SOD models on a large-scale RGB-D wCoSOD dataset. This is also the direction of future efforts to collect larger-scale data to promote the development of deep learning based RGB-D wCoSOD model. Figure 4 shows the visualized results of different SOD methods. As can be seen, the RGB-D SOD models can not effectively detect the co-salient regions in the within-image due to the lack of correspondence modeling. For example, in the last image of Fig. 4, the non-common but salient object (i.e., the white dog in the middle) cannot be effectively suppressed by the A2dele method [29]. The PR curves of different methods on the proposed dataset are shown in Fig. 5. It can be seen that our model achieves higher precision on the whole PR curves. Compared with CDS, our method can better highlight the common regions by introducing of depth information and correspondence strategy. Table 1 shows the quantitative comparisons of different methods. It can be directly seen that the proposed method is better than other methods in all measurements. Specifically, for the F-measure score, our method reaches 0.7000, which improves performance by 2.5% compared to the second best algorithm (i.e., A2dele). These measurements testify to the superiority of our method.

Fig. 5
figure 5

PR curves of different methods on the proposed dataset

Fig. 6
figure 6

PR curves of the ablation study on different mechanisms of constraints

Table 2 Ablation study on different mechanisms of constraints

5.3 Ablation study

The proposed model uses two constraints, namely similarity constraint and cluster-based constraint, when calculating the co-saliency value. The PR curves of the ablation study on different mechanisms of constraints are shown in Fig. 6. It can be seen that our different constraint modules are effective. In order to verify the effectiveness of the proposed model, we conducted ablation experiments. The quantitative comparison is shown in Table 2, where “-s” represents the use of cluster-based constraint only, and “-c” represents the use of similarity constraint only. From the perspective of quantitative indicators, the performance is degraded when only one constraint is used to measure the correspondecne. Specifically, the S-measure score is 0.7692 when the cluster-based constraint is used alone. When both constraints exist, the S-measure score reaches 0.7728, with an increase of 0.5%. All these experiments demonstrate the effectiveness of the designed modules.

6 Conclusion

In this paper, we address the problem within-image co-salient object detection (wCoSOD) in a single RGB-D image for the first time. We collected a corresponding benchmark dataset and proposed an unsupervised baseline method. Our method decomposes the wCoSOD task into salient object proposal generation and correspondence modelling combining the similarity and cluster-based constraints. The experimental results demonstrate the effectiveness of our method. In the future, a larger scale dataset can be constructed and some learning based methods can be designed to further improve the performance.