Keywords

1 Introduction

Salient object detection, which involves identifying the visually interesting regions, is a well-researched domain of computer vision. It serves as an essential pre-processing step for various visual tasks such as image retrieval [7, 15, 17, 28], visual tracking [2, 20, 38], object segmentation [12, 39, 40, 42, 43], object recognition [10, 36, 37], and therefore makes an important contribution towards sustainable development.

A majority of existing works [21, 26] for saliency detection focus on operating RGB images. While RGB-based saliency detection methods have achieved great success, appearance features in RGB data are less predictive to some challenging scenes, such as multiple or transparent objects, similar foreground and background, complex background, low-intensity environment, etc.

The depth cue has the preponderance of discriminative power in location and spatial structure, which has been proved beneficial to accurate saliency prediction [35]. Moreover, the paired depth data for RGB natural images are widely available with the advent of depth sensors, e.g., Kinect and Lytro Illum. Consequently, using depth information gains growing interests in saliency detection.

Most RGB-D-based methods utilize symmetric two-stream architectures for extracting RGB and depth features [4, 6, 18, 32]. However, we observe that while RGB data contain more information such as color, texture, contour, as well as

Fig. 1.
figure 1

The comparison of predicted maps between our method and two top-ranking RGB-D-based methods on salient objects details, i.e., DMRA [32], CPFP [46]. The \(1^{st}\) row and the \(4^{th}\) row are the enlarged images of the red box area of the middle two rows, which show superior performance of our method on saliency details

limited location, grayscale depth data provide more information such as spatial structure and 3D layout. In consequence, a symmetric two-stream network may overlook the inherent differences of RGB and depth data. Asymmetric architectures have been adopted in few works to extract RGB and depth features, taking the differences between two modalities into account. Zhu et al. [48] present an architecture composed of a master network for processing RGB values, and a sub-network making full use of depth cues, which incorporates depth-based features into the master network via direct concatenation. Zhao et al. [46] incorporate the contrast prior to enhance the depth maps and then integrate them into the RGB stream for saliency detection. However, simple fusion strategies like direct concatenation or summation are less adaptive to locate the salient objects due to myriad possibilities of salient objects positions in the real world. Overall, these above methods overlook the fact that depth cue contributes differently to the salient object prediction in various scenes. Furthermore, existing RGB-D methods inevitably suffer from detail information loss [16, 41] for adopting strides and pooling operations in the RGB and depth streams. An intuitive solution is to use skip-connections [22] or short-connections [21] for reconstructing the detail information. Although these strategies have brought satisfactory improvements, they remain restrictive to predict the complete structures with fine details.

Building on the above observation, we strive to take a further step towards the goal of accurate saliency detection with an asymmetric two-stream model. The primary challenge towards this goal is how to effectively extract rich global context information while preserving local saliency details. The second challenge is how to effectively utilize the discriminative power of depth features to guide the RGB features for locating salient objects accurately.

To confront these challenges, we propose an asymmetric two-stream architecture as illustrated in Fig. 2. Concretely, our contributions are:

  • We design a flow ladder module (FLM) and a lightweight depth network (DepthNet) with a small model size of 6.7MB. Instead of adopting skip-connections or short-connections, our FLM can effectively extract local detail information (see Fig. 1) and global context information through a local-global evolutionary fusion flow for accurate saliency detection.

  • We propose a novel depth attention module (DAM) to ensure that the depth features can effectively guide the RGB features by using the discriminative power of depth cues. Its effectiveness has been experimentally verified (see Table 4).

  • Furthermore, we conduct extensive experiments on 7 datasets and demonstrate that our method achieves consistently superior performance over 13 state-of-the-art RGB-D approaches in terms of 4 evaluation metrics. Numerically, our approach reduces the MAE performance by nearly 33% on DUT-RGBD dataset. In addition, our method minimizes the model size by 33% compared with the existing minimum method (PDNet) and achieves Top-2 running speed of 46 FPS.

2 Related Work

RGB-D Saliency Detection. Although many RGB-based saliency detection methods have achieved appealing performance [16, 29, 33, 44, 45, 47], they may not accurately detect the salient area because the appearance features in RGB data are less predictive when encountering with some complex scenes, such as low-contrast scenes, transparent objects, foreground sharing similar contents with background, multiple objects, and complex backgrounds. With the advent of consumer-grade depth cameras such as Kinect cameras, light field cameras and lidars, depth cues with a wealth of geometric and structural information is widely used in salient object detection (SOD).

Existing RGB-D saliency detection methods can be generally classified into two categories: Traditional methods [8, 9, 11, 24, 31, 35, 49, 50]. Ren et al. [35] propose a two-stage RGB-D saliency detection framework using the validity of global priors. Lang et al. [24] introduce the depth prior into the saliency detection model to improve detection performance. Desingh et al. [11] use a non-linear regression to combine the RGB-D saliency detection model with the RGB model to measure the saliency values. CNNs-based methods [4,5,6, 18, 32, 34, 46, 48]. To better mine salient information in challenging scenes, some CNNs-based methods combine depth information with RGB information for more accurate results. Practices and theories that lead to symmetric two-stream architectures which extract RGB and depth representations equally have been studied for a long time [4,5,6, 18, 32]. Han et al. [18] design a symmetric architecture for fusing the deep representations of depth and RGB views automatically to obtain the final saliency map. Chen et al. [6] utilize two-stream CNNs-based models for introducing cross-model interactions in multiple layers by direct summation. Recently, several asymmetric architectures are proposed for processing different data types [46, 48]. Zhao et al. [46] use the enhanced depth information as an auxiliary cue and adopt a pyramid decoding structure to obtain more accurate salient regions.

Because of the inherent differences between RGB and depth information, classic symmetric two-stream architectures and simple fusion strategies may work their ways down to inaccurate prediction. Besides, the strides and pooling operations adopted in existing RGB-D-based methods for downsampling inevitably result in information loss. To address the above-mentioned issues, in this work, we design an asymmetric network and ably fuse RGB and depth information by a depth attention mechanism for precise saliency detection.

Fig. 2.
figure 2

The overall architecture of our proposed approach. Our asymmetric architecture consists of three parts, i.e., the RGBNet, the DAM and the lightweight DepthNet. The RGBNet includes a VGG-19 backbone and a flow ladder module. For depth stream, we also employ the same backbone of the RGBNet. The black arrows represent the information flows

3 The Proposed Method

The overall architecture of our proposed method is shown in Fig. 2. In this section, we begin with describing the overall architecture in Sect. 3.1, then introduce the DepthNet in Sect. 3.2, the flow ladder module in Sect. 3.3, and finally the proposed depth attention module in Sect. 3.4.

Table 1. Details of our DepthNet architecture, k represents the kernel, s represents the stride, chns represents the number of input/output channels for each layer, p represents the padding, in and out represent the input and output feature size

3.1 The Overall Architecture

Considering that most RGB-D-based methods utilizing symmetric two-stream architectures overlook the inherent differences between RGB and depth data, we propose an asymmetric two-stream architecture, as illustrated in Fig. 2. Our two-stream architecture includes a lightweight depth stream and a RGB stream with a flow ladder module, namely DepthNet and RGBNet, respectively. As for the depth stream, we design a lightweight architecture as shown in Table 1. Then the extracted depth features are fed into the RGB stream through a depth attention mechanism (DAM, see Fig. 3) to generate complementary features with affluent information of location and spatial structure. For the RGB stream, we adopt the commonly used architecture VGG-19 as our baseline. Based on this baseline, we propose a novel flow ladder module (FLM) to preserve the detail information as well as receive global location information from representations of other vertical parallel branches in an evolutionary way, which benefits locating salient regions and achieves considerable performance gains.

3.2 DepthNet

Compared with RGB data which contains richer color and texture information, depth cues focus on spatial location information. A large number of parameters in a complex depth extraction network are redundant, thus we consider that is unnecessary to process depth data with a complex network as large as RGBNet. In addition, the ablation experiments on symmetric and asymmetric architectures in Sect. 4.3 also confirm our claim. As illustrated in Fig. 2, we adopt a detail-transfer architecture for the depth stream (see Table 1 for detailed specification) and take the original depth maps as input. Our DepthNet transfers detail information in the whole architecture to capture fine spatial details. Considering the differences between RGB and depth data, numerous redundant channels of depth features are unnecessary. Therefore, we prune the number of feature channels to 32 in Conv3, 4 and 128 in final Conv, which further achieves a more lightweight DepthNet with a model size of 6.7MB.

Fig. 3.
figure 3

Illustration of the depth attention module (DAM). The images above \(F_{out}\) are the corresponding original RGB image and ground truth

3.3 RGBNet

Deeper networks are able to extract richer high-level information such as location and semantic information, but strides and pooling operations widely used in existing RGB-D-based methods may cause detail information loss, such as boundary, small object, for saliency detection. A straightforward solution to this issue is combining the low-level features with the high-level features by skip-connections [22]. However, the low-level features take a less discriminative and predictive power for complex scenes, which has trouble contributing to accurate saliency detection. Hence, we design a novel RGBNet consisting of a VGG backbone for fair comparison and a flow ladder model to preserve the local detail information by constructing four detail-transfer branches and fuse the global location information in an evolutionary way. In order to fit our task, we truncate the last three fully-connected layers and maintain the five convolution blocks as well as all pooling operations of VGG-19. The FLM can preserve the resolution of representations in multiple scales and levels, ensuring that the local detail information and global location information contribute to the precision of saliency detection. More details are described as follows.

In order to alleviate the detail information loss, we design a flow ladder model (FLM). This module is applied in VGG-19 and integrates four detail-transfer branches in the way of local-global evolutionary fusion flow. We design detail-transfer branches for preserving the saliency details. As shown in Fig. 2, the first two branches consist of 3 layers. The number of the layers in the \(3^{rd}\) and \(4^{th}\) branch is decreased to 2, 1, respectively. Specifically, we simply denote the \(j^{th}\) layer of the \(i^{th}\) branch as \(B_{i}L_{j}\), \(i \epsilon [1,4], j\epsilon [1,3]\). \(B_{i}L_{j}\) is composed of four basicblocks [19], each of which consists of two convolutional layers as shown in the top of Fig. 2. Our FLM consists of 4 evolved detail-transfer branches. Instead of adopting strides and pooling operations, our FLM preserves the resolution of representations with more details in each branch by employing convolutional operations with 1 * 1 stride.

We design a novel local-global evolutionary fusion flow for integrating multi-scale local and global features extracted from detail-transfer branches. Each branch receives rich information from other vertical parallel representations through our local-global evolutionary fusion flow. In this way, rich global context representations are generated while more local saliency details are preserved. Specifically, the representations of the deeper branches are fused into the shallower branches by upsampling and summation operations as well as the representations of the shallower branches are fused into the deeper branches by downsampling and summation operations as shown in the FLM of Fig. 2. Through the evolution between different branches (shown in Fig. 2), the local detail information and the global context information are effectively combined, which benefits the precision of saliency detection. The whole fusion process is described as the following equations:

$$\begin{aligned} B_{i}L_{j}=\left\{ \begin{matrix} trans(Conv2) \ \ i=1,j=1\\ trans(Conv(i+1)) \ \ i=j+1,j\epsilon [1,3]\\ \sum \nolimits _{n=1}^{j}f(B_{n}L_{j-1}) \ \ i \epsilon [1,j],j\epsilon [2,3] \end{matrix}\right. , \end{aligned}$$
(1)
$$\begin{aligned} F_{RGB}^{j}=\sum _{n=1}^{j+1}f(B_{n}L_{j}) \ \ j\epsilon [1,2], \end{aligned}$$
(2)
$$\begin{aligned} F_{RGB}^{3}=cat(f(B_{n}L_{3})) \ \ n\epsilon [1,4], \end{aligned}$$
(3)

where \(B_{i}\) and \(L_{j}\) denote the \(i^{th}\) branch and \(j^{th}\) layer, respectively. \({f(\cdot )}\) denotes \(n-i\) times up-sampled when \(n>i\) and \(i-n\) times down-sampled when \(n<i\). And when n is equal to i, \({f(\cdot )}\) means no operation. Conv(i) means the output features of the \(i^{th}\) Conv block in VGG-19 and \({trans(\cdot )}\) is operated by a convolutional layer to realize the transformation of the number of channels. \({cat(\cdot )}\) denotes concatenating all features together. The final output of our LFM namely \(F_{RGB}^{3}\) is a concatenation of multi-scale features extracted from four branches. In conclusion, the features with local and global information are transferred to the parallel branches in an evolutionary way. Our proposed LFM can not only alleviate the object detail information loss but also effectively integrate multi-scale and multi-level features for precise saliency prediction.

Fig. 4.
figure 4

Comparisons of ours with state-of-the-art CNNs-based methods. Those methods are top ranking ones in quantitative evaluation. Obviously, our results are more consistent with the ground truths (GT), especially in complex scenes

3.4 Depth Attention Module

Changes in statistics of object positions in the real world makes linear fusion strategies of RGB and depth data less adaptive to complex scenes. To take full advantage of the depth cues with the discriminative power in location and spatial structure, we design a depth attention module to adaptively fuse the RGB and depth representations as shown in Fig. 3. Firstly, the depth features contain abundant spatial and structural information. We utilize the context attention block which contains a 1 * 1 convolutional layer \(W_{k}\) and a softmax function for extracting the salient location cues more precisely, instead of applying simple fusion like summation or concatenation. Then a matrix multiplication operation is adopted to aggregate all location features together to generate attention weights of each channel i (\(i.e., \alpha _{i}\)) for capturing pixel-wise spatial dependencies. Moreover, the degree of response to the salient regions varies between features of different channels. Thus we adopt a channel-wise attention block which contains two 1 * 1 convolutional layers \(W_{c}\) and a LayerNorm function to capture the inter-dependencies between channels, and further achieves a weighted depth feature \(\beta \). Then we adopt dot product operation to fuse \(\beta \) into the RGB stream, which helps guide the RGB features at pixel-level to distinguish the foreground and background thoroughly. Furthermore, the ablation experiments in Sect. 4.3 also verify the effectiveness of our DAM compared with simple fusion. And the visual results in Fig. 6(b) also prove that the salient regions are emphasized through the attention mechanism.

The details of these three blocks can be formulated as the following equations:

$$\begin{aligned} \alpha _{i}=\sum _{j=1}^{N_{p}}\frac{e^{W_{k}F_{d}^{j}}}{\sum _{m=1}^{N_{p}}e^{W_{k}F_{d}^{m}}}F_{d}^{j}, \end{aligned}$$
(4)
$$\begin{aligned} \beta _{i}=\varsigma (W_{c2}ReLU(LN(W_{c1}\alpha _{i})) \odot F_{d}), \end{aligned}$$
(5)
$$\begin{aligned} F_{fusion}=\varsigma (F_{RGB}\odot \beta ), \end{aligned}$$
(6)

where \(\alpha _{i}\) denotes the weight of the \(i^{th}\) channel to obtain the global context features. \(F_{d}^{j}\) means the \(j^{th}\) position in depth feature \(F_{depth}\). \(N_{p}\) is the number of positions in the depth feature map (e.g., \(N_{p} = H\cdot W\)). \(W_{k}\), \(W_{c1}\) and \(W_{c2}\) denote 1 * 1 convolutional operations. LN denotes the LayerNorm operation after the convolution \(W_{c1}\), and ReLU is an activate function. The \(\varsigma (\cdot )\) and \(\odot \) mean the sigmoid function and dot product operation, respectively. The \(\beta _{i}\) indicates the depth pixel-wise attention map of \(i^{th}\) channel of \(F_{RGB}\). \(F_{RGB}\) and \(F_{fusion}\) represent the input RGB feature and the output feature of the DAM, respectively. The \(F_{fusion}\) can be calculated as a DAM output with much more effective depth-induced context-aware attention features. Furthermore, experiments in Sect. 4.3 show that our DAM is capable of fusing depth features discriminatively and filtering out features which are guided by depth cues in mistake.

As illustrated in Fig. 3, the inputs of our DAM are \(F_{RGB}^{i}\) and \(F_{Depth}^{i}\) extracted from our LFM and DepthNet, respectively, \(i=1,2,3\). At the end, a simple decoder is adopted for supervision. The decoder module contains two bilinear upsample functions, each of which is followed by 3 convolutional layers. The total loss L can be represented as:

$$\begin{aligned} L = l_{f}\{Decoder(F_{fusion}^{3});gt\}, \end{aligned}$$
(7)

where \(F_{fusion}^{3}\) represents the output fusion feature of the third DAM and gt means the ground-truth map. The cross-entropy loss \( l_{f} \) can be computed as:

$$\begin{aligned} l_{f}\{\hat{y};y\}=ylog\hat{y} + (1-y)log(1-\hat{y}), \end{aligned}$$
(8)

where y and \(\hat{y}\) denote the saliency ground-truth map and the predicted map, respectively.

Table 2. Quantitative comparisons of E-measure (\(E_{\gamma }\)), S-measure (\(S_{\lambda }\)), F-measure (\(F_{\beta }\)) and MAE (M) on 7 widely-used RGB-D datasets. The best three results are shown in \(\mathbf{boldface} \), bolditalic, italic fonts respectively. From top to bottom: the latest CNNs-based RGB-D methods and traditional RGB-D methods
Table 3. Continuation of Table 2

4 Experiments

4.1 Dataset

We perform our experiments on 7 public RGB-D datasets for fair comparisons, i.e., NJUD [23], NLPR [31], RGBD135 [8], STEREO [30], LFSD [27], DUT-RGBD [32], SSD [25]. We split those datasets as [4, 6, 18] to guarantee fair comparisons. We randomly select 800 samples from DUT-RGBD, 1485 samples from NJUD and 700 samples from NLPR for training. The remaining images in these 3 datasets and other 4 datasets are all for testing to verify the generalization ability of saliency models. To prevent overfitting, we additionally augment the training set by flipping, cropping and rotating those images.

Fig. 5.
figure 5

Illustration of the six ablation experiments

Fig. 6.
figure 6

(a) The visualization of the feature maps in FLM. The \(B_{i}L_{j}\) presents the output features of corresponding block in Fig. 2. (b) Visualization of the effectiveness of the DAM. The \(4^{th}\) (DAM b/f) and the \(5^{th}\) columns (DAM a/f) show the feature maps before and after adopting DAM, respectively

4.2 Experimental Setup

Evaluation Metrics. To comprehensively evaluate various methods, we adopt 4 evaluation metrics including F-measure (\(F_{\beta }\)) [1], mean absolute error (M) [3], S-measure (\(S_{\lambda }\)) [13], E-measure (\(E_{\gamma }\)) [14]. Specifically, the F-measure can evaluate the performance integrally. The M represents the average absolute difference between the saliency map and ground truth. The S-measure which is recently proposed can evaluate the structural similarities. The E-measure can jointly capture image level statistics and local pixel matching information.

Implementation Details. Our method is implemented with pytorch toolbox and trained on a PC with GTX 2080Ti GPU and 16 GB memory. The input images are uniformly resized to 256 * 256. The momentum, weight decay, batch-size and learning rate of our network are set as 0.9, 0.0005, 2 and 1e-10, respectively. During training, we use softmax entropy loss described in Sect. 3.4 and the network converges after 60 epochs with mini-batch size 2.

4.3 Ablation Analysis

Effect of FLM. We adopt the commonly two-stream VGG-19 network fused by direct summation as our baseline(denoted as ’B[s]’ shown in Fig. 5(a)). In order to verify the effectiveness of FLM, we employ the FLM in both the RGB and depth streams (’B+FLM[s]’ shown in Fig. 5(b)). The experimental results of (a) and (b) in Table 4 clearly demonstrate that our FLM obtains impressive performance gains. Moreover, as shown in Fig. 7, we can note that after employing FLM, the saliency maps achieve sharper boundaries as well as finer structures. Furthermore, for effectively analyzing the working mechanism of FLM module, we visualized the output features of each block in FLM. As shown in Fig. 6(a), we can see that the branch 4 and branch 3 extract the global location information while the branch 2 and branch 1 preserve more local detail information. This benefits from the evolutionary process of salient regions with finer details.

Table 4. Ablation analysis on 7 datasets. The [s] and [a] following the modules represent the symmetric and asymmetric architectures, respectively. Obviously, each component of our architecture can achieve considerable accuracy gains. (a), (b), (c), (d), (e), (f) represent the modules indexed by the corresponding letters in Fig. 5
Fig. 7.
figure 7

Visual comparisons of ablation analyses. (a), (b), (c), (d), (e), (f) represent the visual results of the experiments indexed by the corresponding letters in Fig. 5

Effect of DAM. We conduct contrast experiments for verifying the effectiveness of our DAM on both symmetric and asymmetric architectures. In terms of symmetric architecture, we replace simple summation with DAM on our baseline (denoted as ‘B+DAM[s]’, as shown in Fig. 5(c)). From the results of (a) and (c) in Table 4 we can see that the MAE is reduced by 18% on NLPR dataset after employing DAM, which intuitively verifies the effect of DAM. Meanwhile, the corresponding visual results in Fig. 7 also illustrate that our DAM can fuse depth features discriminatively and filter out features which are guided by depth cues in mistake. On the other hand, we employ FLM in the RGB stream and replace the VGG-19 backbone with DepthNet in the depth stream (denoted as ‘B+FLM[a]’, as shown in Fig. 5(d)). And we adopt DAM on ‘B+FLM[a]’ for verifying the effect of DAM on asymmetric architecture (denoted as ‘B+FLM+DAM[a]’ shown in Fig. 5(e)). The comparison results of (d) and (e) in Table 4 demonstrate the effectiveness of DAM on asymmetric architecture over all datasets. Additionally, we visualized the feature maps of our two-stream asymmetric architecture (before and after adopting DAM). As shown in Fig. 6(b), we can see that the salient regions are emphasized after adopting DAM, which significantly improves our detection accuracy.

Effect of Asymmetric Architecture. In order to illustrate the effectiveness of adopting asymmetric architecture, we compare the results of (b) and (d) in Fig. 5. Furthermore, for fair comparison, we adopt our FLM and DAM on the two-stream symmetric network (denoted as ‘B+FLM+DAM[s]’, as shown in Fig. 5(f)). As we can see from Table 4 (Asymmetric), the asymmetric architecture achieves considerable performance compared with symmetrical architecture, but has a small size. Specifically, the asymmetric architecture tremendously minimizes the model size by 47% (128.9 MB vs. 244.4 MB). Based on the above observation, we consider that is unnecessary to utilize large network as RGBNet for extracting features and we can replace it with a more lightweight network.

4.4 Comparison with State-of-the-Art

Considering that most of the existing approaches are based on VGG network, we adopt VGG as our backbone for fair comparisons. And We compare our model with 13 RGB-D based salient object detection models including 8 CNNs-based methods: CPFP [46], DMRA [32], MMCI [6], TANet [5], PDNet [48], PCA [4], CTMF [18], DF [34], and 5 traditional methods: MB [49], CDCP [50], DCMC [9], NLPR [31], DES [8]. For fair comparisons, the results of the competing methods are generated by authorized codes or directly provided by authors.

Quantitative Evaluation. Tables 2 and 3 show the validation results in terms of 4 evaluation metrics on 7 datasets. As we can see, our model achieves significant outperformance over all other methods. It is noted that our approach outperforms all other methods by a dramatic margin on datasets DUT-RGBD, NJUD and RGBD135, which are considered as more challenging datasets due to the large number of complex scenes like similar foreground and background, low-contrast and transparent object. It further indicates that our model can be generalized to various challenging scenes.

Qualitative Evaluation. We also visually compare our method with the most representative methods as shown in Fig. 4. From those results, we can observe that our saliency maps are closer to the ground truths. For instance, other methods have trouble in distinguishing salient objects in complex environments such as cluttered background (see the \(1^{th}\) row), while ours can precisely identify the whole object and exquisite details. And our model can locate and detect the entire salient object with sharp details more accurately than others in more challenging scenes such as low-contrast (see the \(2^{nd}\) - \(3^{rd}\) rows), transparent object (see the \(8^{th}\) row), multiple objects and small object (see the \(5^{th}\) - \(7^{th}\) rows). Those results further verify the effectiveness and robustness of our proposed model.

Table 5. Complexity comparisons on two datasets. The best three results are shown in boldface, bolditalic, italic fonts respectively

Complexity Evaluation. We compare the model size and execution time of our method with other 7 representative models, as shown in Table 5. It can be seen that our method achieves Top-1 model size and Top-2 FPS. To be specific, the model size of our architecture is only 128.9 MB which is 2/3 of the existing minimum model size (PDNet). Compared with the best performing method DMRA, our architecture tremendously minimizes the model size by 46% and boosts the FPS by 109%. Besides, we achieve a high running speed with 46 Frame Per Second (FPS) compared with the representative approaches.

5 Conclusion

In this paper, we propose an asymmetric two-stream architecture taking account of the inherent differences between RGB and depth data for saliency detection. For the RGB stream, we introduce a flow ladder module (FLM) for effectively extracting rich global context information while preserving local saliency details. And we design a lightweight DepthNet for depth stream with a small model size of 6.7MB. Besides, we propose a depth attention module (DAM) ensuring that the depth cues can discriminatively guide the RGB features for precisely locating salient objects. Our approach significantly advances the state-of-the-art methods over the widely used datasets and is capable of precisely capturing salient regions in challenging scenes.