1 Introduction

Salient object detection (SOD) aims at locating most attractive regions in an image. As a pre-processing technique, SOD benefits a variety of computer vision tasks including re-identification [1], image understanding [2], object tracking [3] and video object segmentation [4], to name a few.

In the past years, CNN-based methods (Convolutional Neural Networks, CNN) have achieved excellent performance in the SOD tasks due to its powerful ability to extract and represent features. Most of them [5,6,7,8,9] pay attention to extract features from RGB images to detect the salient objects. However, it is difficult to accurately locate the salient objects in some challenging and complex scenarios only using single modal data, such as similar appearance between the foreground and background, the cluttered background.

Recently, with the development of depth camera, depth information can be easily obtained and various RGB-D SOD methods [10,11,12,13,14,15,16, 48] have risen. Depth maps provide SOD with location and spatial structural information and help the network accurately locate the salient object from the complex background.

Fig. 1
figure 1

a Existing methods mostly exploiting cross-modal complementarity by a two-stream architecture using two backbone. b Our proposed method which designs a lower-complexity depth branch instead of a backbone network and details is illustrated in Fig. 3

As shown in Fig. 1a, the existing RGB-D SOD methods [10,11,12,13,14] mostly explore RGB-D data relying on a traditional two-stream architecture, in which an extra backbone network is required to process depth information. This architecture brings additional calculation and memory consumption. And the model parameters of the network are huge, which hindered its practical application. To solve this issue, we design a simpler and efficient depth branch to deal with depth information as shown in Fig. 1b instead of using an extra backbone network. Because the depth map itself contains rich location and spatial structure information, there is no need to extract depth information through a complex network. Therefore, we choose to design a simpler depth branch instead of using a complex backbone network and treat depth features extracted from our proposed depth branch as depth prior maps to fuse them with RGB features. In our network, depth data are divided into three different-scales pieces by the improved depth branch to be fused with multi-scales RGB features extracted from the backbone network. The depth map only participates in the high-level stages of the network to reduce the model parameters and calculation cost. We also propose a guided residual module (GRM) to integrate features from the cross-modal RGB-D data seamlessly in a multi-scales and channel-wise manner.

As mentioned above, depth plays an important role in RGB-D SOD and provides the network with location and spatial structural information. However, due to the immaturity of the technology for obtaining depth maps, depth maps sometimes are inaccurate and would contaminate the results of SOD. Previous work generally integrates the RGB and depth information in an indiscriminate manner, which may induce negative results when encountering the inaccurate or blurred depth maps. Hence, we design a depth correction module (DCM) to introduce the adaptive weight to depth map to filter out unreliable depth map. When DCM judges a depth map to be of low quality, this depth map is unreliable and DCM adds a low weight to this depth map to mitigate its negative impact.

In summary, our main contributions are listed as follows:

  1. (1)

    Instead of employing a backbone network, we design a simpler and efficient depth branch to extract complementary features which are fused with RGB features for guided refinement.

  2. (2)

    We design DCM to judge the quality of depth maps and add adaptive weight to each depth map to mitigate the negative influence of unreliable depth maps.

  3. (3)

    Experiments show competitive performance against 13 state-of-the-art methods on 7 datasets, especially in advantages of efficiency (102 FPS) and compactness (64.2 MB).

2 Related work

2.1 RGB salient object detection

In recent years, we have witnessed the rapid development of SOD for the RGB image. Numerous models have been presented to explore SOD in terms of boundaries, feature fusion, multi-supervision, pooling, etc. Su et al. [5] proposed a boundary-aware network with successive dilation to enhance the feature selectivity at boundaries and guarantee the feature at interiors; Zhao et al. [6] employed pyramid feature attention network to focus on effective high-level context features and low-level spatial structural features; Zheng et al. [7] used multi-source weak supervision for saliency detection; Liu et al. [8] expanded the role of pooling in CNNs to detect saliency; Zhao et al. [9] focused on the complementarity between salient edge information and salient object information. However, in the face of SOD in the challenge and complex scenes, such as low contrast and multi-salient objects, the single-modal SOD models did not perform well.

2.2 RGB-D salient object detection

The pioneering work for RGB-D SOD was produced by Niu et al. [10], who introduced disparity contrast and domain knowledge into stereoscopic photography to measure stereo saliency. After Niu’s work, various handcrafted features are originally applied for RGB SOD were extended to RGB-D, such as center-surround difference, contrast and background enclosure.

Fig. 2
figure 2

The schematic illustration of our proposed network. We design an efficient depth branch to improve depth feature learning, MSRF to capture more multi-scale context, DCM to reduce the negative impact of low-quality depth map and GRM to fuse RGB-D features, which will be introduced in Sects.3.1 to 3.4, respectively

In the past five years, deep learning-based RGB-D methods have achieved outstanding performance due to the powerful ability of CNNs in extracting salient object representations. Many methods begin to adopt a two-stream architecture which uses two backbone networks (e.g., VGG [17], ResNet [18]) to explore the mining and fusion of the cross-modal RGB-D information. Zhu et al. [11] employed an independent encoder network to take advantage of location and spatial structural information in depth maps and assist the RGB-stream network. Chen et al. [42] exploited the cross-modal complement across all the levels by a complementarity-aware fusion module based on a two-stream structure. Chen et al. [34] proposed a multi-scale multi-path fusion network with cross-modal interactions to enable sufficient and efficient fusion.

Recently, Piao et al. [12] proposed an adaptive and attentive depth distiller to transfer depth information from depth-stream to RGB-stream. Hence, their network needed no more depth maps when testing, which promotes the practical application of RGB-D SOD approaches. Li et al. [14] proposed an attention steered interweave fusion network to detect salient object, which progressively integrates cross-modal and cross-level complementarity from RGB-D images via steering of an attention mechanism. Fu et al. [15] consumed that RGB data and depth information are common and propose a single backbone network to learn from both RGB and depth inputs. Chen et al. [13] introduced depth potentiality-aware mechanism to explicitly model the potentiality of the depth map and effectively integrate the cross-modal complementarity of RGB-D data. Zhang et al. [16] were inspired by the saliency data labeling process and propose probabilistic RGB-D saliency detection network via conditional variational autoencoders to model human annotation uncertainty and generate multiple saliency maps for each input image by sampling in the latent space.

However, most of the above RGB-D approaches [10,11,12,13,14, 16, 34, 42] focus on the way of integrating RGB-D cross-modal information and employ two pre-trained backbone networks to detect salient object, which require an additional backbone network to process depth data. Different from them, we improve two-stream network by designing a simpler and efficient depth branch, which achieved with less computation cost and memory consumption than a traditional two-stream network.

Fig. 3
figure 3

The structure of the proposed depth branch. “Conv” means a \({3 \times 3}\) convolutional layer. Residual module is denoted as “Resmod”

3 The proposed network

3.1 Architecture overview

As shown in Fig. 2, our proposed network is an asymmetric two-stream end-to-end architecture and employs ResNet-50 [18] as the RGB branch. We design a simple and efficient depth branch to extract depth features, which avoids employing an additional backbone network to process depth data and reduces extra calculation cost and memory consumption.

The structure of proposed depth branch is illustrated in Fig. 3. It uses one \({3 \times 3}\) convolutional layer and three residual modules to deal with depth data. The residual modules are composed of two \({3 \times 3}\) convolutional layers. Compared to a backbone network like VGG-16 [17], our proposed depth branch has less convolutional layers and lower complexity. But this does not mean that the performance would be decreased. Because of the rich information of the depth map, a simple depth branch is sufficient to extract the complementary features, which will be verified in the ablation experiment. By improving the way of depth feature learning, we reduce the complexity of the depth branch and use less convolutional layers to achieve better performance.

Since the qualities of depth maps may vary due to the limitations of the depth sensors, we design DCM to add adaptive weight to depth maps by judging the quality of depth maps before concatenating depth and RGB features. After that, we divide depth maps into multi-scales maps to be fused with multi-level RGB features extracted from the backbone network. To gain more high-level semantic information, we adopt a multi-scales receptive field (MSRF) module on the top of the backbone network to enlarge receptive field of the network. On the top-down pathway, we put in multi-GRMs to help integrate depth maps and RGB features seamlessly to acquire predictions in a multi-scales and channel-wise manner. In what follows, we describe the structures of the above-mentioned components and explain their functions in detail.

3.2 Multi-scales receptive field module

The last convolutional layer of the backbone has strong ability to capture semantic information. Therefore, we usually adopt it for global saliency perception to obtain a coarse prediction. Since the scale of salient objects vary from large to small, which implies that the model needs to capture information at different contexts in order to detect objects reliably. However, [43, 44] shows that the empirical receptive fields of CNNs are much smaller than the ones in theory especially for deeper layers. The receptive fields of the whole network are not large enough to capture the multi-scale context of the input images. We notice that there are some outstanding structures proposed to solve this issue, such PPM (Pyramid Pooling Module [43]), ASSP (Atrous Spatial Pyramid Pooling [45]), RFB (Receptive Field Block[46]). Different from these parallel concatenation-based methods, we build a hierarchical multi-scales receptive field module to sequentially aggregate the multi-scale contexts. By adding a skip connection into each of the two parallel streams, we not only make pixel sampling denser but also provide larger receptive field.

Fig. 4
figure 4

The structure of the MSRF module. “C” denotes concatenation operation. \({k \times k,d}\) represents convolutional layer with kernel size k and dilation rate d

3.2.1 Architecture details

In specific, we first reduce the channel into 64 for saving memory. Then, we add four separate branches to capture multi-scale context cues after that, which is inspired from RFB [46]. Each branch consist of two convolutional layers, the first layer is standard convolution with \(1\times 1, 3\times 3, 5\times 5, 7\times 7\) kernel size for dense sampling, the second layer is dilated \(3\times 3\) convolution with {1, 2, 4, 6} dilation rate for sparse sampling. Different from the previous works, each input is added with the output of its previous branch in each layer except for the first branch. Finally, we concatenate them together and feed it into a 3\(\times 3\) convolutional layer to generate the single-channel coarse prediction. The whole architecture is illustrated in Fig. 4.

3.3 Depth correction module

The existing approaches [10, 12, 14, 15, 34, 42] generally integrate RGB and depth features undifferentiated. However, due to the different ways to obtain the depth maps, the quality of them is uneven. The low-quality depth maps cannot guide the network to learn salient regions but have a negative effect on the predictions. To solve this issue, we design a DCM to add adaptive weight to depth maps by judging the quality of them, which reduces the negative impact of low-quality depth maps on prediction results. If the quality of depth map is high, DCM would add a high weight to it and vice versa.

Usually, a high-quality depth map should have many similarities with the ground truth. Based on that, we employ Structural Similarity (SSIM [19]) function to calculate the similarity between depth map and ground truth. SSIM function is designed to calculate the luminance, contrast and structure similarity between two images, which is defined as:

$$\begin{aligned} SSIM(x,y) = \frac{{(2{\mu _x}{\mu _y} + {C_1})(2{\sigma _{xy}} + {C_2})}}{{(\mu _x^2 + \mu _y^2 + {C_1})(\sigma _x^2 + \sigma _y^2 + {C_2})}} \end{aligned}$$
(1)

where \({C_1}\) and \({C_2}\) mean two constants that prevents the denominator from 0, \({\mu _x}\) to \({\mu _y}\) are denoted as mean of pixel values, \({\sigma _x}\) and \({\sigma _y}\) represents standard deviation of pixel values.

Then, we use the value calculated by SSIM function as adaptive weight and add it to the depth map to get an improved depth map. In addition to screening the quality of the depth maps, DCM can also filter out redundant information. When a depth map is low-quality, the weight added by DCM is low, which means less information of it participating in network computing.

There are also several existing methods proposed to handle the issue of uneven-quality depth maps. Chen and Huang [13] concatenate high-level RGB-D features and compute depth weight value between RGB-D features and ground truth. Zhang et al. [16] design a sub-network to refine depth maps. Although these methods alleviate the negative effects of low-quality depth maps in some ways, they increase the complexity of the network and also the running time of the model. Different from [13, 16], our proposed DCM can be treated as pre-processing operations because of using original RGB-D data to calculate adaptive depth weight, which brings little extra computing cost to the network.

3.4 Guided residual module

After getting multi-level RGB-D features, we also need to fuse them to acquire saliency prediction. Low-level features contain rich detail information, such as boundary, texture and spatial structure information, and high-level features capture rich semantic information. We employ U-shape architecture to combine multi-level RGB-D features. However, one of the problems to this type of U-shape architectures is that the high-level features will be gradually diluted when they are transmitted to shallow layers. To solve this issue and help the network integrate RGB-D features seamlessly, we design GRM to mitigate the dilution of confident semantic information with the help of top-down guidance.

As can be seen in Fig. 5, the N-channel feature maps are firstly split into N non-overlapped groups, each of which consists of 1-channel feature map. Then, the side output prediction is used as a guidance feature map to be concatenated with the 1-channel feature map, thus, we can get 2N-channel feature maps in total. Several \({3 \times 3}\) convolution layers are applied for guided learning to obtain a 1-channel feature map, which are added with the input guidance map as new side-output prediction. By concatenating high-level prediction with each channel of the side-output features, our approach can well relieve the dilution of the high-level semantic information in the feature fusion process of the U-Net architecture.

Fig. 5
figure 5

The construction of the GRM. \( {up \times 2 \& concate}\) represents upsampling operation with a factor of 2 and concatenation operation. \({H \times W \times N}\) means images with length H and width W with N channels

3.5 Loss function

We apply the binary cross-entropy (BCE) loss [20] and intersection over union (IoU) loss [21] to optimize the network, where the BCE and IoU are used to constrain the saliency prediction in pixel-level and image-level, respectively.

In SOD, BCE loss is commonly employed to measure the relation between predicted saliency map and the ground truth, which is defined as:

$$\begin{aligned} {l_{bce}}= & {} - \sum \limits _{(r,c)} {[G(r,c)\log (S(r,c))} \nonumber \\&+ (1 - G(r,c))\log (1 - S(r,c))] \end{aligned}$$
(2)

where \({G(r,c) \in \{ 0,1\} }\) is the ground truth label of the pixel (rc) and S(rc) is the predicted probability of being salient object.

IoU loss is originally proposed for measuring the similarity of two sets and then used as a standard evaluation measure for object detection and segmentation. Recently, it has been used in SOD, which is defined as:

$$\begin{aligned} {l_{iou}} = 1 - \frac{{\sum \limits _{r = 1}^H {\sum \limits _{c = 1}^W {S(r,c)G(r,c)} } }}{{\sum \limits _{r = 1}^H {\sum \limits _{c = 1}^W {[S(r,c) + G(r,c) - S(r,c)G(r,c)]} } }} \end{aligned}$$
(3)

where \({G(r,c) \in \{ 0,1\} }\) is the ground truth label of the pixel (rc) and S(rc) is the predicted probability of being salient object.

In our training process, we combine BCE loss with IoU loss to obtain BCE-IoU loss for optimization, which is defined as:

$$\begin{aligned} {L_{bce + iou}} = {l_{bce}} + {l_{iou}}. \end{aligned}$$
(4)

4 Experiments

4.1 Experimental setup

4.1.1 Datasets

We evaluate the proposed approach on 7 public RGB-D SOD datasets. NJUD [22] consists of 1985 RGB images and corresponding depth images with various objects and complex scenarios. The depth images are estimated from the stereo images. NLPR [23] contains 1000 RGB-D images with pixel-wise ground truth, where the depth maps are captured by Microsoft Kinect under different illumination conditions and acquisition scenes. STERE [10] collects 1000 paired RGB-D images, where the depth maps are also estimated from the stereo images. LFSD [24] is constructed for light field saliency detection, which contains 100 all-focus RGB images, the corresponding depth maps and the pixel-wise ground truth. The depth map is captured by the Lytro light field camera. SSD [25] contains 80 images picked up from three stereo movies, where the depth map is generated by depth estimation approach. DUT [26] consists of 1200 paired images containing more complex scenarios, such as multiple or transparent objects. SIP [27] is a new released dataset which contains 929 high-resolution person RGB-D images captured by Huawei Meta10.

4.1.2 Implementation details

Training dataset is the same as [26], which selects 1487 images from NJUD [22], 700 images from NLPR [23] and 800 images from DUT [26]. To prevent the overfitting, we augment the training set by flipping, cropping, rotating and light changing. In this work, we train two versions which employ VGG-16 [17] and ResNet-50 [18] as backbone network, respectively. The RGB branch is initialized by VGG-16 and ResNet-50, and the others are using the default setting of the PyTorch. We implement the proposed network with PyTorch on a PC with an Intel i9 9900K CPU, 32GB RAM, and an NVIDIA GeForce 2080Ti GPU. All the experiments are performed using the Adam [47] optimizer with an initial learning rate of 5e-5 which is divided by 10 after 20 epochs. Our network is trained for 30 epochs in total.

4.1.3 Evaluation metrics

We employ the precision–recall (P–R) curve, F-measure [28], S-measure [29, 30], E-measure [27, 31] and mean absolute error (MAE) for quantitative evaluations. Thresholding the saliency map at a series of values, pairs of precision–recall value can be computed by comparing the binary saliency map with the ground truth. The F-measure is a comprehensive measurement, which is defined as the weighted harmonic mean of precision and recall:

$$\begin{aligned} {F_\beta } = \frac{{\mathrm{{(1 + }}{\beta ^2}\mathrm{{)}} \cdot \mathrm{{Precision}} \cdot \mathrm{{Recall}}}}{{{\beta ^2} \cdot \mathrm{{Precision}} + \mathrm{{Recall}}}} \end{aligned}$$
(5)

where we set \(\beta ^2\) to be 0.3 as set in [28]. In this paper, we only report the maximum F-measure score.

S-measure evaluates the structure similarity between the saliency map and ground truth, which is defined as:

$$\begin{aligned} {S_\alpha } = \alpha \times {S_o} + (1 - \alpha ) \times {S_r} \end{aligned}$$
(6)

where \({\alpha }\) is set to 0.5 for assigning equal contribution to both region (\({{S_r}}\)) and object (\({{S_o}}\)) similarity.

Fig. 6
figure 6

Comparison of PR curves on six datasets. Best viewed on the screen

E-measure considers the local pixel-wise values and the image-level mean value together, which is consistent with cognitive vision studies. It is defined as:

$$\begin{aligned} {E_\xi } = \frac{1}{{W \times H}}\sum \limits _{x = 1}^W {\sum \limits _{y = 1}^H {f(\frac{{2{\varphi _{GT}} \circ {\varphi _{FM}}}}{{{\varphi _{GT}} \circ {\varphi _{GT}} + {\varphi _{FM}} \circ {\varphi _{FM}}}})} } \end{aligned}$$
(7)

where \({\varphi }\) is the bias matrix as the distance between each pixel-wise value of ground truth and its image-level mean, i.e., \({{\varphi _{GT}}}\) and \({{\varphi _{FM}}}\) are for ground truth and binary foreground map, respectively, and \({f( \cdot )}\) is a quadratic function. “\({\circ }\)” denotes dot product.

MAE is adopted to evaluate the non-salient region average per-pixel difference, which is defined as

$$\begin{aligned} MAE = \frac{1}{{H \cdot W}}\sum \limits _{x = 1}^H {\sum \limits _{y = 1}^W {|S(x,y) - G(x,y)|} } \end{aligned}$$
(8)

where H and W are the height and width of the saliency map S and G denotes its ground truth. It needs to be pointed out that higher F-measure and lower MAE score denote better performance.

4.2 Compared with the state-of-the-arts

We compare our network with other 13 state-of-the-art methods, including DF [32], PCAN [36], CTMF [33], MMCI [34], TANet [35], CPFP [37], DMRA [26], D3Net [27], ICNet [38], S2MA [40], SSF [39], UCNet [16], JLDCF [15]. For fair comparisons, we use their released saliency maps or adopt the released code and their default parameters to reproduce their results.

Table 1 Quantitative measures: S-measure (\({S_\alpha }\)), F-measure (\({F_\beta }\)), E-measure (\({E_\xi }\)), MAE (M) of SOTA methods and our proposed approach on seven RGB-D datasets

4.2.1 Quantitative evaluation

We present the quantitative comparison results in Table 1 and Fig. 6. We can clearly find that our method consistently outperforms the other methods on all the seven datasets with respect to S-measure, F-measure, E-measure and MAE scores, except JLDCF on SIP dataset. It needs to point that JLDCF adopted a backbone network initialized by the pre-trained parameters of DSS [41], which is trained on a RGB SOD dataset with 2500 images. Nevertheless, we still perform better than JLDCF on the other datasets. The PR curves also indicate the superior performance of our method, which are shown in Fig. 5. Thus, the above quantitative evaluation demonstrates the effectiveness and superiority of our proposed method on detecting salient objects.

Fig. 7
figure 7

Visual comparisons of GRN with SOTA RGB-D saliency models. “oursV” means GRN with VGG-16 version. “oursR” represents GRN with ResNet-50 version

4.2.2 Qualitative evaluation

In order to show our results more intuitively, we provide some visual representative saliency maps of different methods to demonstrate the superiority of our proposed network. As can be seen in Fig. 7, the salient regions are highlighted more accurately by our method even in some challenging cases. And our results have more clear boundaries compared to other methods.

4.2.3 Complexity evaluation

Moreover, we further compare the FPS (Frames Per Second) and model size with other models for complexity evaluation as shown in Table 2. It can be observed that both running speed and model size of our proposed model performs better than other models. Our best performing model ResNet-50 [18] version runs 33% faster than S2MA and minimizes the model size 30% than SSF. The training process of our proposed work takes about 4 h and it runs at a speed of 90 FPS for 352\(\times \)352 images without any other post-processing.

4.3 Ablation study

In this section, we carry out the ablation studies to demonstrate the effectiveness of the proposed module components. We use VGG-16 as backbone for RGB branch in the following experiments.

4.3.1 Module components

To verify the effectiveness of the modules proposed in this paper, we conduct some experiments to evaluate their performance with different combinations. We select the network which removes DCM, MSRF and GRM as baseline. In the baseline, GRMs are replaced with element-wise summation and \({3 \times 3}\) convolution operation. The results are shown in Table 2. We find that when using a single module, the best network performance is GRM, indicating that GRM plays a great role in fully integrating RGB-D features. When using two modules, GRM+MSRF achieved the best performance, indicating MSRF has expanded the receptive field of the network thus captured more advanced semantic information. When all three modules are combined, it can be seen that DCM alleviated the negative impact of low-quality depth maps on the network thus further improved the performance.

4.3.2 Fusion of RGB and depth

As shown in Table 4, summation and concatenation operation in the RGB-D fusion process almost achieved the same performance. It is worthy to mention that ever such simple fusion can obtain good performance contributed by the proposed effective depth branch. Considering the test speed of the network, the concatenation operation is better. So we employ a simpler and faster concatenation operation as our fusion strategy.

Table 2 Running speed and model size comparisons with recent models
Table 3 Quantitative evaluation for ablation studies about the effectiveness of DCM, MSRF and GRM

4.3.3 Depth branch

We select three designs of the depth branch for comparisons, which are VGG-16 [17], the proposed depth branch and without using depth branch. It can be seen from Table 5 that the depth branch we designed has a better performance compared with VGG-16 and our speed is nearly twice of it. Further compared with the model without using depth branch, we perform much better, and the speed is only reduced by nearly 10 FPS.

Table 4 Quantitative evaluation for ablation studies about different fusion methods of RGB features and depth maps
Table 5 Quantitative evaluation for ablation studies about different designs of the depth branch

5 Conclusion

In this paper, we proposed guided residual network for RGB-D SOD. Instead of employing a traditional two-stream architecture which uses two backbone networks, we propose a simpler and efficient depth branch for extracting depth features. Based on this depth branch, we can quickly and efficiently integrate depth and RGB features. We also proposed DCM by introducing adaptive weight to each depth map to mitigate the negative influence of unreliable depth map. In terms of feature fusion, multi-scale RGB-D features are fused in the top-down pathway by the proposed GRM. Contributed by them, the proposed network located the salient regions and object boundaries accurately and efficiently. Experiments with 13 state-of-the-art methods on seven datasets demonstrated the superior performance of the proposed network. Our proposed network is simple and effective with fast running speed and compact mode size. Thus, we believe our designed depth branch can be applied into other two-stream RGB-D SOD methods to further improve their performance. In our future works, we will continue to explore more effective fusion strategy to further improve performance.