Guided residual network for RGB-D salient object detection with efficient depth feature learning

Wang, Jian; Chen, Shuhan; Lv, Xiao; Xu, Xiuqi; Hu, Xuelong

doi:10.1007/s00371-021-02106-5

Guided residual network for RGB-D salient object detection with efficient depth feature learning

Original article
Published: 17 April 2021

Volume 38, pages 1803–1814, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

The Visual Computer Aims and scope Submit manuscript

Guided residual network for RGB-D salient object detection with efficient depth feature learning

Download PDF

Jian Wang¹,
Shuhan Chen ORCID: orcid.org/0000-0002-0094-5157¹,
Xiao Lv²,
Xiuqi Xu¹ &
…
Xuelong Hu¹

685 Accesses
8 Citations
Explore all metrics

Abstract

RGB-D salient object detection aims at identifying the most attractive parts from a RGB image and its corresponding depth image, which has been widely applied in many computer vision tasks. However, there are still two challenges: (1) how to quickly and effectively integrate the cross-modal features from the RGB-D data; and (2) how to mitigate the negative impact from the low-quality depth map. The previous methods mostly employ a two-stream architecture which adopts two backbone network to process RGB-D data and ignore the quality of depth map. In this paper, we propose a guided residual network to address these two issues. On the one hand, we design a simpler and efficient depth branch only using one convolutional layer and three residual modules to extract depth features instead of employing a pre-trained backbone to handle the depth data, and fuse RGB features and depth features in a multi-scale manner for refinement with top-down guidance. On the other hand, we add adaptive weight to depth maps to control the fusion between them, which mitigates the negative influence of unreliable depth map. Experimental results compared with 13 state-of-the-art methods on 7 datasets demonstrate the validity of the proposed approach both quantitatively and qualitatively, especially in efficiency (102 FPS) and compactness (64.2 MB).

Progressively Guided Alternate Refinement Network for RGB-D Salient Object Detection

CFIDNet: cascaded feature interaction decoder for RGB-D salient object detection

Article 07 January 2022

Accurate RGB-D Salient Object Detection via Collaborative Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Salient object detection (SOD) aims at locating most attractive regions in an image. As a pre-processing technique, SOD benefits a variety of computer vision tasks including re-identification [1], image understanding [2], object tracking [3] and video object segmentation [4], to name a few.

In the past years, CNN-based methods (Convolutional Neural Networks, CNN) have achieved excellent performance in the SOD tasks due to its powerful ability to extract and represent features. Most of them [5,6,7,8,9] pay attention to extract features from RGB images to detect the salient objects. However, it is difficult to accurately locate the salient objects in some challenging and complex scenarios only using single modal data, such as similar appearance between the foreground and background, the cluttered background.

Recently, with the development of depth camera, depth information can be easily obtained and various RGB-D SOD methods [10,11,12,13,14,15,16, 48] have risen. Depth maps provide SOD with location and spatial structural information and help the network accurately locate the salient object from the complex background.

As shown in Fig. 1a, the existing RGB-D SOD methods [10,11,12,13,14] mostly explore RGB-D data relying on a traditional two-stream architecture, in which an extra backbone network is required to process depth information. This architecture brings additional calculation and memory consumption. And the model parameters of the network are huge, which hindered its practical application. To solve this issue, we design a simpler and efficient depth branch to deal with depth information as shown in Fig. 1b instead of using an extra backbone network. Because the depth map itself contains rich location and spatial structure information, there is no need to extract depth information through a complex network. Therefore, we choose to design a simpler depth branch instead of using a complex backbone network and treat depth features extracted from our proposed depth branch as depth prior maps to fuse them with RGB features. In our network, depth data are divided into three different-scales pieces by the improved depth branch to be fused with multi-scales RGB features extracted from the backbone network. The depth map only participates in the high-level stages of the network to reduce the model parameters and calculation cost. We also propose a guided residual module (GRM) to integrate features from the cross-modal RGB-D data seamlessly in a multi-scales and channel-wise manner.

As mentioned above, depth plays an important role in RGB-D SOD and provides the network with location and spatial structural information. However, due to the immaturity of the technology for obtaining depth maps, depth maps sometimes are inaccurate and would contaminate the results of SOD. Previous work generally integrates the RGB and depth information in an indiscriminate manner, which may induce negative results when encountering the inaccurate or blurred depth maps. Hence, we design a depth correction module (DCM) to introduce the adaptive weight to depth map to filter out unreliable depth map. When DCM judges a depth map to be of low quality, this depth map is unreliable and DCM adds a low weight to this depth map to mitigate its negative impact.

In summary, our main contributions are listed as follows:

(1)
Instead of employing a backbone network, we design a simpler and efficient depth branch to extract complementary features which are fused with RGB features for guided refinement.
(2)
We design DCM to judge the quality of depth maps and add adaptive weight to each depth map to mitigate the negative influence of unreliable depth maps.
(3)
Experiments show competitive performance against 13 state-of-the-art methods on 7 datasets, especially in advantages of efficiency (102 FPS) and compactness (64.2 MB).

2 Related work

2.1 RGB salient object detection

In recent years, we have witnessed the rapid development of SOD for the RGB image. Numerous models have been presented to explore SOD in terms of boundaries, feature fusion, multi-supervision, pooling, etc. Su et al. [5] proposed a boundary-aware network with successive dilation to enhance the feature selectivity at boundaries and guarantee the feature at interiors; Zhao et al. [6] employed pyramid feature attention network to focus on effective high-level context features and low-level spatial structural features; Zheng et al. [7] used multi-source weak supervision for saliency detection; Liu et al. [8] expanded the role of pooling in CNNs to detect saliency; Zhao et al. [9] focused on the complementarity between salient edge information and salient object information. However, in the face of SOD in the challenge and complex scenes, such as low contrast and multi-salient objects, the single-modal SOD models did not perform well.

2.2 RGB-D salient object detection

The pioneering work for RGB-D SOD was produced by Niu et al. [10], who introduced disparity contrast and domain knowledge into stereoscopic photography to measure stereo saliency. After Niu’s work, various handcrafted features are originally applied for RGB SOD were extended to RGB-D, such as center-surround difference, contrast and background enclosure.

In the past five years, deep learning-based RGB-D methods have achieved outstanding performance due to the powerful ability of CNNs in extracting salient object representations. Many methods begin to adopt a two-stream architecture which uses two backbone networks (e.g., VGG [17], ResNet [18]) to explore the mining and fusion of the cross-modal RGB-D information. Zhu et al. [11] employed an independent encoder network to take advantage of location and spatial structural information in depth maps and assist the RGB-stream network. Chen et al. [42] exploited the cross-modal complement across all the levels by a complementarity-aware fusion module based on a two-stream structure. Chen et al. [34] proposed a multi-scale multi-path fusion network with cross-modal interactions to enable sufficient and efficient fusion.

Recently, Piao et al. [12] proposed an adaptive and attentive depth distiller to transfer depth information from depth-stream to RGB-stream. Hence, their network needed no more depth maps when testing, which promotes the practical application of RGB-D SOD approaches. Li et al. [14] proposed an attention steered interweave fusion network to detect salient object, which progressively integrates cross-modal and cross-level complementarity from RGB-D images via steering of an attention mechanism. Fu et al. [15] consumed that RGB data and depth information are common and propose a single backbone network to learn from both RGB and depth inputs. Chen et al. [13] introduced depth potentiality-aware mechanism to explicitly model the potentiality of the depth map and effectively integrate the cross-modal complementarity of RGB-D data. Zhang et al. [16] were inspired by the saliency data labeling process and propose probabilistic RGB-D saliency detection network via conditional variational autoencoders to model human annotation uncertainty and generate multiple saliency maps for each input image by sampling in the latent space.

However, most of the above RGB-D approaches [10,11,12,13,14, 16, 34, 42] focus on the way of integrating RGB-D cross-modal information and employ two pre-trained backbone networks to detect salient object, which require an additional backbone network to process depth data. Different from them, we improve two-stream network by designing a simpler and efficient depth branch, which achieved with less computation cost and memory consumption than a traditional two-stream network.

3 The proposed network

3.1 Architecture overview

As shown in Fig. 2, our proposed network is an asymmetric two-stream end-to-end architecture and employs ResNet-50 [18] as the RGB branch. We design a simple and efficient depth branch to extract depth features, which avoids employing an additional backbone network to process depth data and reduces extra calculation cost and memory consumption.

The structure of proposed depth branch is illustrated in Fig. 3. It uses one ${3 \times 3}$ convolutional layer and three residual modules to deal with depth data. The residual modules are composed of two ${3 \times 3}$ convolutional layers. Compared to a backbone network like VGG-16 [17], our proposed depth branch has less convolutional layers and lower complexity. But this does not mean that the performance would be decreased. Because of the rich information of the depth map, a simple depth branch is sufficient to extract the complementary features, which will be verified in the ablation experiment. By improving the way of depth feature learning, we reduce the complexity of the depth branch and use less convolutional layers to achieve better performance.

Since the qualities of depth maps may vary due to the limitations of the depth sensors, we design DCM to add adaptive weight to depth maps by judging the quality of depth maps before concatenating depth and RGB features. After that, we divide depth maps into multi-scales maps to be fused with multi-level RGB features extracted from the backbone network. To gain more high-level semantic information, we adopt a multi-scales receptive field (MSRF) module on the top of the backbone network to enlarge receptive field of the network. On the top-down pathway, we put in multi-GRMs to help integrate depth maps and RGB features seamlessly to acquire predictions in a multi-scales and channel-wise manner. In what follows, we describe the structures of the above-mentioned components and explain their functions in detail.

3.2 Multi-scales receptive field module

The last convolutional layer of the backbone has strong ability to capture semantic information. Therefore, we usually adopt it for global saliency perception to obtain a coarse prediction. Since the scale of salient objects vary from large to small, which implies that the model needs to capture information at different contexts in order to detect objects reliably. However, [43, 44] shows that the empirical receptive fields of CNNs are much smaller than the ones in theory especially for deeper layers. The receptive fields of the whole network are not large enough to capture the multi-scale context of the input images. We notice that there are some outstanding structures proposed to solve this issue, such PPM (Pyramid Pooling Module [43]), ASSP (Atrous Spatial Pyramid Pooling [45]), RFB (Receptive Field Block[46]). Different from these parallel concatenation-based methods, we build a hierarchical multi-scales receptive field module to sequentially aggregate the multi-scale contexts. By adding a skip connection into each of the two parallel streams, we not only make pixel sampling denser but also provide larger receptive field.

3.2.1 Architecture details

In specific, we first reduce the channel into 64 for saving memory. Then, we add four separate branches to capture multi-scale context cues after that, which is inspired from RFB [46]. Each branch consist of two convolutional layers, the first layer is standard convolution with $1\times 1, 3\times 3, 5\times 5, 7\times 7$ kernel size for dense sampling, the second layer is dilated $3\times 3$ convolution with {1, 2, 4, 6} dilation rate for sparse sampling. Different from the previous works, each input is added with the output of its previous branch in each layer except for the first branch. Finally, we concatenate them together and feed it into a 3$\times 3$ convolutional layer to generate the single-channel coarse prediction. The whole architecture is illustrated in Fig. 4.

3.3 Depth correction module

The existing approaches [10, 12, 14, 15, 34, 42] generally integrate RGB and depth features undifferentiated. However, due to the different ways to obtain the depth maps, the quality of them is uneven. The low-quality depth maps cannot guide the network to learn salient regions but have a negative effect on the predictions. To solve this issue, we design a DCM to add adaptive weight to depth maps by judging the quality of them, which reduces the negative impact of low-quality depth maps on prediction results. If the quality of depth map is high, DCM would add a high weight to it and vice versa.

Usually, a high-quality depth map should have many similarities with the ground truth. Based on that, we employ Structural Similarity (SSIM [19]) function to calculate the similarity between depth map and ground truth. SSIM function is designed to calculate the luminance, contrast and structure similarity between two images, which is defined as:

$$\begin{aligned} SSIM(x,y) = \frac{{(2{\mu _x}{\mu _y} + {C_1})(2{\sigma _{xy}} + {C_2})}}{{(\mu _x^2 + \mu _y^2 + {C_1})(\sigma _x^2 + \sigma _y^2 + {C_2})}} \end{aligned}$$

(1)

where ${C_1}$ and ${C_2}$ mean two constants that prevents the denominator from 0, ${\mu _x}$ to ${\mu _y}$ are denoted as mean of pixel values, ${\sigma _x}$ and ${\sigma _y}$ represents standard deviation of pixel values.

Then, we use the value calculated by SSIM function as adaptive weight and add it to the depth map to get an improved depth map. In addition to screening the quality of the depth maps, DCM can also filter out redundant information. When a depth map is low-quality, the weight added by DCM is low, which means less information of it participating in network computing.

There are also several existing methods proposed to handle the issue of uneven-quality depth maps. Chen and Huang [13] concatenate high-level RGB-D features and compute depth weight value between RGB-D features and ground truth. Zhang et al. [16] design a sub-network to refine depth maps. Although these methods alleviate the negative effects of low-quality depth maps in some ways, they increase the complexity of the network and also the running time of the model. Different from [13, 16], our proposed DCM can be treated as pre-processing operations because of using original RGB-D data to calculate adaptive depth weight, which brings little extra computing cost to the network.

3.4 Guided residual module

After getting multi-level RGB-D features, we also need to fuse them to acquire saliency prediction. Low-level features contain rich detail information, such as boundary, texture and spatial structure information, and high-level features capture rich semantic information. We employ U-shape architecture to combine multi-level RGB-D features. However, one of the problems to this type of U-shape architectures is that the high-level features will be gradually diluted when they are transmitted to shallow layers. To solve this issue and help the network integrate RGB-D features seamlessly, we design GRM to mitigate the dilution of confident semantic information with the help of top-down guidance.

As can be seen in Fig. 5, the N-channel feature maps are firstly split into N non-overlapped groups, each of which consists of 1-channel feature map. Then, the side output prediction is used as a guidance feature map to be concatenated with the 1-channel feature map, thus, we can get 2N-channel feature maps in total. Several ${3 \times 3}$ convolution layers are applied for guided learning to obtain a 1-channel feature map, which are added with the input guidance map as new side-output prediction. By concatenating high-level prediction with each channel of the side-output features, our approach can well relieve the dilution of the high-level semantic information in the feature fusion process of the U-Net architecture.

3.5 Loss function

We apply the binary cross-entropy (BCE) loss [20] and intersection over union (IoU) loss [21] to optimize the network, where the BCE and IoU are used to constrain the saliency prediction in pixel-level and image-level, respectively.

In SOD, BCE loss is commonly employed to measure the relation between predicted saliency map and the ground truth, which is defined as:

$$\begin{aligned} {l_{bce}}= & {} - \sum \limits _{(r,c)} {[G(r,c)\log (S(r,c))} \nonumber \\&+ (1 - G(r,c))\log (1 - S(r,c))] \end{aligned}$$

(2)

where ${G(r,c) \in \{ 0,1\} }$ is the ground truth label of the pixel (r, c) and S(r, c) is the predicted probability of being salient object.

IoU loss is originally proposed for measuring the similarity of two sets and then used as a standard evaluation measure for object detection and segmentation. Recently, it has been used in SOD, which is defined as:

$$\begin{aligned} {l_{iou}} = 1 - \frac{{\sum \limits _{r = 1}^H {\sum \limits _{c = 1}^W {S(r,c)G(r,c)} } }}{{\sum \limits _{r = 1}^H {\sum \limits _{c = 1}^W {[S(r,c) + G(r,c) - S(r,c)G(r,c)]} } }} \end{aligned}$$

(3)

where ${G(r,c) \in \{ 0,1\} }$ is the ground truth label of the pixel (r, c) and S(r, c) is the predicted probability of being salient object.

In our training process, we combine BCE loss with IoU loss to obtain BCE-IoU loss for optimization, which is defined as:

$$\begin{aligned} {L_{bce + iou}} = {l_{bce}} + {l_{iou}}. \end{aligned}$$

(4)

4 Experiments

4.1 Experimental setup

4.1.1 Datasets

We evaluate the proposed approach on 7 public RGB-D SOD datasets. NJUD [22] consists of 1985 RGB images and corresponding depth images with various objects and complex scenarios. The depth images are estimated from the stereo images. NLPR [23] contains 1000 RGB-D images with pixel-wise ground truth, where the depth maps are captured by Microsoft Kinect under different illumination conditions and acquisition scenes. STERE [10] collects 1000 paired RGB-D images, where the depth maps are also estimated from the stereo images. LFSD [24] is constructed for light field saliency detection, which contains 100 all-focus RGB images, the corresponding depth maps and the pixel-wise ground truth. The depth map is captured by the Lytro light field camera. SSD [25] contains 80 images picked up from three stereo movies, where the depth map is generated by depth estimation approach. DUT [26] consists of 1200 paired images containing more complex scenarios, such as multiple or transparent objects. SIP [27] is a new released dataset which contains 929 high-resolution person RGB-D images captured by Huawei Meta10.

4.1.2 Implementation details

Training dataset is the same as [26], which selects 1487 images from NJUD [22], 700 images from NLPR [23] and 800 images from DUT [26]. To prevent the overfitting, we augment the training set by flipping, cropping, rotating and light changing. In this work, we train two versions which employ VGG-16 [17] and ResNet-50 [18] as backbone network, respectively. The RGB branch is initialized by VGG-16 and ResNet-50, and the others are using the default setting of the PyTorch. We implement the proposed network with PyTorch on a PC with an Intel i9 9900K CPU, 32GB RAM, and an NVIDIA GeForce 2080Ti GPU. All the experiments are performed using the Adam [47] optimizer with an initial learning rate of 5e-5 which is divided by 10 after 20 epochs. Our network is trained for 30 epochs in total.

4.1.3 Evaluation metrics

We employ the precision–recall (P–R) curve, F-measure [28], S-measure [29, 30], E-measure [27, 31] and mean absolute error (MAE) for quantitative evaluations. Thresholding the saliency map at a series of values, pairs of precision–recall value can be computed by comparing the binary saliency map with the ground truth. The F-measure is a comprehensive measurement, which is defined as the weighted harmonic mean of precision and recall:

$$\begin{aligned} {F_\beta } = \frac{{\mathrm{{(1 + }}{\beta ^2}\mathrm{{)}} \cdot \mathrm{{Precision}} \cdot \mathrm{{Recall}}}}{{{\beta ^2} \cdot \mathrm{{Precision}} + \mathrm{{Recall}}}} \end{aligned}$$

(5)

where we set $\beta ^2$ to be 0.3 as set in [28]. In this paper, we only report the maximum F-measure score.

S-measure evaluates the structure similarity between the saliency map and ground truth, which is defined as:

$$\begin{aligned} {S_\alpha } = \alpha \times {S_o} + (1 - \alpha ) \times {S_r} \end{aligned}$$

(6)

where ${\alpha }$ is set to 0.5 for assigning equal contribution to both region (${{S_r}}$) and object (${{S_o}}$) similarity.

E-measure considers the local pixel-wise values and the image-level mean value together, which is consistent with cognitive vision studies. It is defined as:

$$\begin{aligned} {E_\xi } = \frac{1}{{W \times H}}\sum \limits _{x = 1}^W {\sum \limits _{y = 1}^H {f(\frac{{2{\varphi _{GT}} \circ {\varphi _{FM}}}}{{{\varphi _{GT}} \circ {\varphi _{GT}} + {\varphi _{FM}} \circ {\varphi _{FM}}}})} } \end{aligned}$$

(7)

where ${\varphi }$ is the bias matrix as the distance between each pixel-wise value of ground truth and its image-level mean, i.e., ${{\varphi _{GT}}}$ and ${{\varphi _{FM}}}$ are for ground truth and binary foreground map, respectively, and ${f( \cdot )}$ is a quadratic function. “${\circ }$” denotes dot product.

MAE is adopted to evaluate the non-salient region average per-pixel difference, which is defined as

$$\begin{aligned} MAE = \frac{1}{{H \cdot W}}\sum \limits _{x = 1}^H {\sum \limits _{y = 1}^W {|S(x,y) - G(x,y)|} } \end{aligned}$$

(8)

where H and W are the height and width of the saliency map S and G denotes its ground truth. It needs to be pointed out that higher F-measure and lower MAE score denote better performance.

4.2 Compared with the state-of-the-arts

We compare our network with other 13 state-of-the-art methods, including DF [32], PCAN [36], CTMF [33], MMCI [34], TANet [35], CPFP [37], DMRA [26], D3Net [27], ICNet [38], S2MA [40], SSF [39], UCNet [16], JLDCF [15]. For fair comparisons, we use their released saliency maps or adopt the released code and their default parameters to reproduce their results.

Table 1 Quantitative measures: S-measure (${S_\alpha }$), F-measure (${F_\beta }$), E-measure (${E_\xi }$), MAE (M) of SOTA methods and our proposed approach on seven RGB-D datasets

Full size table

4.2.1 Quantitative evaluation

We present the quantitative comparison results in Table 1 and Fig. 6. We can clearly find that our method consistently outperforms the other methods on all the seven datasets with respect to S-measure, F-measure, E-measure and MAE scores, except JLDCF on SIP dataset. It needs to point that JLDCF adopted a backbone network initialized by the pre-trained parameters of DSS [41], which is trained on a RGB SOD dataset with 2500 images. Nevertheless, we still perform better than JLDCF on the other datasets. The PR curves also indicate the superior performance of our method, which are shown in Fig. 5. Thus, the above quantitative evaluation demonstrates the effectiveness and superiority of our proposed method on detecting salient objects.

4.2.2 Qualitative evaluation

In order to show our results more intuitively, we provide some visual representative saliency maps of different methods to demonstrate the superiority of our proposed network. As can be seen in Fig. 7, the salient regions are highlighted more accurately by our method even in some challenging cases. And our results have more clear boundaries compared to other methods.

4.2.3 Complexity evaluation

Moreover, we further compare the FPS (Frames Per Second) and model size with other models for complexity evaluation as shown in Table 2. It can be observed that both running speed and model size of our proposed model performs better than other models. Our best performing model ResNet-50 [18] version runs 33% faster than S2MA and minimizes the model size 30% than SSF. The training process of our proposed work takes about 4 h and it runs at a speed of 90 FPS for 352$\times $352 images without any other post-processing.

4.3 Ablation study

In this section, we carry out the ablation studies to demonstrate the effectiveness of the proposed module components. We use VGG-16 as backbone for RGB branch in the following experiments.

4.3.1 Module components

To verify the effectiveness of the modules proposed in this paper, we conduct some experiments to evaluate their performance with different combinations. We select the network which removes DCM, MSRF and GRM as baseline. In the baseline, GRMs are replaced with element-wise summation and ${3 \times 3}$ convolution operation. The results are shown in Table 2. We find that when using a single module, the best network performance is GRM, indicating that GRM plays a great role in fully integrating RGB-D features. When using two modules, GRM+MSRF achieved the best performance, indicating MSRF has expanded the receptive field of the network thus captured more advanced semantic information. When all three modules are combined, it can be seen that DCM alleviated the negative impact of low-quality depth maps on the network thus further improved the performance.

4.3.2 Fusion of RGB and depth

As shown in Table 4, summation and concatenation operation in the RGB-D fusion process almost achieved the same performance. It is worthy to mention that ever such simple fusion can obtain good performance contributed by the proposed effective depth branch. Considering the test speed of the network, the concatenation operation is better. So we employ a simpler and faster concatenation operation as our fusion strategy.

Table 2 Running speed and model size comparisons with recent models

Full size table

Table 3 Quantitative evaluation for ablation studies about the effectiveness of DCM, MSRF and GRM

Full size table

4.3.3 Depth branch

We select three designs of the depth branch for comparisons, which are VGG-16 [17], the proposed depth branch and without using depth branch. It can be seen from Table 5 that the depth branch we designed has a better performance compared with VGG-16 and our speed is nearly twice of it. Further compared with the model without using depth branch, we perform much better, and the speed is only reduced by nearly 10 FPS.

Table 4 Quantitative evaluation for ablation studies about different fusion methods of RGB features and depth maps

Full size table

Table 5 Quantitative evaluation for ablation studies about different designs of the depth branch

Full size table

5 Conclusion

In this paper, we proposed guided residual network for RGB-D SOD. Instead of employing a traditional two-stream architecture which uses two backbone networks, we propose a simpler and efficient depth branch for extracting depth features. Based on this depth branch, we can quickly and efficiently integrate depth and RGB features. We also proposed DCM by introducing adaptive weight to each depth map to mitigate the negative influence of unreliable depth map. In terms of feature fusion, multi-scale RGB-D features are fused in the top-down pathway by the proposed GRM. Contributed by them, the proposed network located the salient regions and object boundaries accurately and efficiently. Experiments with 13 state-of-the-art methods on seven datasets demonstrated the superior performance of the proposed network. Our proposed network is simple and effective with fast running speed and compact mode size. Thus, we believe our designed depth branch can be applied into other two-stream RGB-D SOD methods to further improve their performance. In our future works, we will continue to explore more effective fusion strategy to further improve performance.

References

Zhao, R., Ouyang, W., Wang, X.: Unsupervised salience learning for person re-identification. In: CVPR, pp. 3586–3593 (2013)
Zhang, F., Bo, D., Zhang, L.: Saliency-guided unsupervised feature learning for scene classification. IEEE TGRS 53(4), 2175–2184 (2014)
Google Scholar
Hong, S., You, T., Kwak, S., Han, B.: Online tracking by learning discriminative saliency map with convolutional neural network. In: ICML, pp. 597–606 (2015)
Wang, W., Shen, J., Yang, R., Porikli, F.: Saliency-aware video object segmentation. IEEE TPAMI 40(1), 20–33 (2017)
Article Google Scholar
Su, J., Li, J., Zhang, Y., Xia, C., Tian, Y.: Selectivity or Invariance: Boundary-aware Salient Object Detection. In: ICCV, pp. 3799-3808 (2019)
Zhao, T., Wu, X.: Pyramid Feature Attention Network for Saliency detection. In CVPR, pp. 3085-3094 (2019)
Zeng, Y., Zhuge, Y., Lu, H., Zhang, L., Qian, M., Yu, Y.: Multi-source weak supervision for saliency detection. In: CVPR, pp. 6074–6083 (2009)
Liu, J.-Ji., Hou, Q., Cheng, M.-M., Feng, J., Jiang, J.: A Simple Pooling-Based Design for Real-Time Salient Object Detection. In: CVPR, pp. 3917–3926 (2019)
Zhao, J.-X., Liu, J.-J., Fan, D.-P., Cao, Y., Yang, J.-F., Cheng, M.-M.: EGNet: Edge Guidance Network for Salient Object Detection. In: ICCV, pp. 8779–8788 (2019)
Niu, Y., Geng, Y., Li, X., Liu, F.: Leveraging stereopsis for saliency analysis. In: CVPR, pp. 454–461 (2012)
Zhu, C., Cai, X., Huang, K., Li, T.H., Li, G.: Pdnet: Prior-model guided depth-enhanced network for salient object detection. In: ICME, pp. 199–204. IEEE (2019)
Piao, Y., Rong, Z., Zhang, M., Ren, W., Lu, H.: A2dele: Adaptive and Attentive Depth Distiller for Efficient RGB-D Salient Object Detection. In: CVPR, pp. 9060-9069 (2020)
Chen, Z., Huang, Q.: Depth Potentiality-Aware Gated Attention Network for RGB-D Salient Object Detection. In: CVPR (2020)
Li, C., Cong, R., Kwong, S., Hou, J., Fu, H., Zhu, G., Zhang, D., Huang, Q.: ASIF-Net: attention steered interweave fusion network for RGB-D salient object detection. IEEE Trans. Cybern 51(1), 88–100 (2021)
Fu, K., Fan, D.-P., Ji, G.-P., Zhao, Q.: JL-DCF: Joint learning and densely-cooperative fusion framework for RGB-D salient object detection. In: CVPR, pp. 3052–3062 (2020)
Zhang, J., Fan, D.-P., Dai, Y., Anwar, S.: UC-Net: Uncertainty inspired RGB-D saliency detection via conditional variational autoencoders. In: CVPR, pp. 8582–8591 (2020)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multi-scale structural similarity for image quality assessment. In: the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, 2003, vol. 2, pp. 1398–1402. Ieee (2003)
de Boer, P.-T., Kroese, D.P., Mannor, S., Rubinstein, R.Y.: A tutorial on the cross-entropy method. Ann. OR 134(1), 19–67 (2005)
Article MathSciNet Google Scholar
Mattyus, G., Luo, W., Urtasun, R.: Deep-roadmapper: Extracting road topology from aerial images
Ju, R., Ge, L., Geng, W., Ren, T., Wu, G.: Depth saliency based on anisotropic center-surround difference. In: ICIP, pp. 1115–1119 (2014)
Peng, H., Li, B., Xiong, W., Hu, W., Ji, R.: RGBD salient object detection: A benchmark and algorithms. In: ECCV, pp. 92–109 (2014)
Li, N., Ye, J., Ji, Y., Ling, H., Yu, J.: Saliency detection on light field. In: CVPR, pp. 2806–2813 (2014)
Zhu, C., Li, G.: A three-pathway psychobiological framework of salient object detection using stereoscopic technology. In: ICCV, pp. 3008–3014 (2017)
Piao, Y., Ji, W., Li, J., Zhang, M., Lu, H.: Depth-induced multi-scale recurrent attention network for saliency detection. In: ICCV (2019)
Fan, D.-P., Lin, Z., Zhang, Z., Zhu, M., Cheng, M.-M.: Rethinking RGB-D salient object detection: Models, datasets, and large-scale benchmarks. IEEE TNNLS (2020)
Borji, A., Cheng, M.-M., Jiang, H., Li, J.: Salient object detection: A benchmark. IEEE TIP 24(12), 5706–5722 (2015)
MathSciNet MATH Google Scholar
Fan, D.-P., Cheng, M.-M., Liu, Y., Li, T., Borji, A.: Structure-measure: A New Way to Evaluate Foreground Maps. In: ICCV, pp. 4548–4557 (2017)
Zhao, J.-X., Cao, Y., Fan, D.-P., Cheng, M.-M., Li, X.-Y., Zhang, L.: Contrast prior and fluid pyramid integration for rgbd salient object detection. In: CVPR, pp. 3927–3936 (2019)
Fan, D.-P., Gong, C., Cao, Y., Ren, B., Cheng, M., Borji, A.: Enhanced-alignment measure for binary foreground map evaluation. In: IJCAI, pp. 698–704 (2018)
Liangqiong, Q., He, S., Zhang, J., Tian, J., Tang, Y., Yang, Q.: Rgbd salient object detection via deep fusion. IEEE TIP 26(5), 2274–2285 (2017)
MathSciNet MATH Google Scholar
Han, J., Chen, H., Liu, N., Chenggang Y, Xuelong L.: CNNs-based RGB-D saliency detection via cross-view transfer and multiview fusion. IEEE TCYB, pp. 3171–3183 (2018)
Chen, H., Li, Y., Dan, S.: Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. PR 86, 376–385 (2019)
Google Scholar
Chen, H., Li, Y.: Three-stream Attention-aware Network for RGB-D Salient Object Detection. IEEE TIP, pp. 2825–2835 (2019)
Chen, H., Li, Y.: Progressively complementarity-aware fusion network for RGB-D Salient Object Detection. In: IEEE CVPR, pp. 3051–3060 (2018)
Zhao, J.-X., Cao, Y., Fan, D.-P., Cheng, M.-M., Li, X.-Y., Zhang, L.: Contrast Prior and Fluid Pyramid Integration for RGBD Salient Object Detection. In: IEEE CVPR (2019)
Li, G., Liu, Z., Ling, H.: ICNet: information conversion Nnetwork for RGB-D based salient object detection. IEEE Trans. Image Process. 29, 4873–4884 (2020)
Zhang, M., Ren, W., Piao, Y., Rong, Z., Lu, H.: Select, Supplement and Focus for RGB-D Saliency Detection. In: CVPR, pp. 3472–3481 (2020)
Liu, N., Zhang, N., Han, J.: Learning Selective Self-Mutual Attention for RGB-D Saliency Detection. In: CVPR, pp. 13756–13765 (2020)
Hou, Q., Cheng, M.-M., Hu, X., Borji, A., Tu, Z., Torr, P.H.S.: Deeply supervised salient object detection with short connections. IEEE TPAMI 41(4), 815–828 (2019)
Article Google Scholar
Chen, H., Li, Y.: Progressively complementarity-aware fusion network for rgb-d salient object detection. In: CVPR, pp. 3051–3060 (2018)
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Object detectors emerge in deep scene cnns. In: ICLR (2015)
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In: ECCV (2018)
Liu, S., Huang, D., Wang, Yu.: Receptive Field Block Net for Accurate and Fast Object Detection. In: ECCV, pp. 385–400 (2018)
Diederik, Kingma, P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)
Liu, Z., Duan, Q., Shi, S., et al.: Multi-level progressive parallel attention guided salient object detection for RGB-D images. Vis Comput (2020). https://doi.org/10.1007/s00371-020-01821-9

Download references

Acknowledgements

This work was supported by the Natural Science Foundation of China (No. 61802336), Jiangsu Province 7th Projects for Summit Talents in Six Main Industries, Electronic Information Industry (DZXX-149, No.110).

Author information

Authors and Affiliations

School of Information Engineering, Yangzhou University, Yangzhou, China
Jian Wang, Shuhan Chen, Xiuqi Xu & Xuelong Hu
Chongqing Special Equipment Inspection and Research Institute, Chongqing, China
Xiao Lv

Authors

Jian Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shuhan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Lv
View author publications
You can also search for this author in PubMed Google Scholar
Xiuqi Xu
View author publications
You can also search for this author in PubMed Google Scholar
Xuelong Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuhan Chen.

Ethics declarations

Conflict of interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, J., Chen, S., Lv, X. et al. Guided residual network for RGB-D salient object detection with efficient depth feature learning. Vis Comput 38, 1803–1814 (2022). https://doi.org/10.1007/s00371-021-02106-5

Download citation

Accepted: 04 March 2021
Published: 17 April 2021
Issue Date: May 2022
DOI: https://doi.org/10.1007/s00371-021-02106-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Guided residual network for RGB-D salient object detection with efficient depth feature learning

Abstract

Similar content being viewed by others

Progressively Guided Alternate Refinement Network for RGB-D Salient Object Detection

CFIDNet: cascaded feature interaction decoder for RGB-D salient object detection

Accurate RGB-D Salient Object Detection via Collaborative Learning

Explore related subjects

1 Introduction

2 Related work

2.1 RGB salient object detection

2.2 RGB-D salient object detection

3 The proposed network

3.1 Architecture overview

3.2 Multi-scales receptive field module

3.2.1 Architecture details

3.3 Depth correction module

3.4 Guided residual module

3.5 Loss function

4 Experiments

4.1 Experimental setup

4.1.1 Datasets

4.1.2 Implementation details

4.1.3 Evaluation metrics

4.2 Compared with the state-of-the-arts

4.2.1 Quantitative evaluation

4.2.2 Qualitative evaluation

4.2.3 Complexity evaluation

4.3 Ablation study

4.3.1 Module components

4.3.2 Fusion of RGB and depth

4.3.3 Depth branch

5 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation