Asymmetric Two-Stream Architecture for Accurate RGB-D Saliency Detection

Zhang, Miao; Fei, Sun Xiao; Liu, Jie; Xu, Shuang; Piao, Yongri; Lu, Huchuan

doi:10.1007/978-3-030-58604-1_23

Miao Zhang¹²,
Sun Xiao Fei¹²,
Jie Liu¹²,
Shuang Xu¹²,
Yongri Piao¹² &
…
Huchuan Lu^12,13

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12373))

Included in the following conference series:

European Conference on Computer Vision

3657 Accesses
68 Citations

Abstract

Most existing RGB-D saliency detection methods adopt symmetric two-stream architectures for learning discriminative RGB and depth representations. In fact, there is another level of ambiguity that is often overlooked: if RGB and depth data are necessary to fit into the same network. In this paper, we propose an asymmetric two-stream architecture taking account of the inherent differences between RGB and depth data for saliency detection. First, we design a flow ladder module (FLM) for the RGB stream to fully extract global and local information while maintaining the saliency details. This is achieved by constructing four detail-transfer branches, each of which preserves the detail information and receives global location information from representations of other vertical parallel branches in an evolutionary way. Second, we propose a novel depth attention module (DAM) to ensure depth features with high discriminative power in location and spatial structure being effectively utilized when combined with RGB features in challenging scenes. The depth features can also discriminatively guide the RGB features via our proposed DAM to precisely locate the salient objects. Extensive experiments demonstrate that our method achieves superior performance over 13 state-of-the-art RGB-D approaches on the 7 datasets. Our code will be publicly available.

S. X. Fei and J. Liu—Denotes equal contributions.

Access provided by Autonomous University of Puebla. Download conference paper PDF

RGB-D saliency detection via complementary and selective learning

Article 30 July 2022

A Single Stream Network for Robust and Real-Time RGB-D Salient Object Detection

Efficient Depth-Included Residual Refinement Network for RGB-D Saliency Detection

Keywords

1 Introduction

Salient object detection, which involves identifying the visually interesting regions, is a well-researched domain of computer vision. It serves as an essential pre-processing step for various visual tasks such as image retrieval [7, 15, 17, 28], visual tracking [2, 20, 38], object segmentation [12, 39, 40, 42, 43], object recognition [10, 36, 37], and therefore makes an important contribution towards sustainable development.

A majority of existing works [21, 26] for saliency detection focus on operating RGB images. While RGB-based saliency detection methods have achieved great success, appearance features in RGB data are less predictive to some challenging scenes, such as multiple or transparent objects, similar foreground and background, complex background, low-intensity environment, etc.

The depth cue has the preponderance of discriminative power in location and spatial structure, which has been proved beneficial to accurate saliency prediction [35]. Moreover, the paired depth data for RGB natural images are widely available with the advent of depth sensors, e.g., Kinect and Lytro Illum. Consequently, using depth information gains growing interests in saliency detection.

Most RGB-D-based methods utilize symmetric two-stream architectures for extracting RGB and depth features [4, 6, 18, 32]. However, we observe that while RGB data contain more information such as color, texture, contour, as well as

limited location, grayscale depth data provide more information such as spatial structure and 3D layout. In consequence, a symmetric two-stream network may overlook the inherent differences of RGB and depth data. Asymmetric architectures have been adopted in few works to extract RGB and depth features, taking the differences between two modalities into account. Zhu et al. [48] present an architecture composed of a master network for processing RGB values, and a sub-network making full use of depth cues, which incorporates depth-based features into the master network via direct concatenation. Zhao et al. [46] incorporate the contrast prior to enhance the depth maps and then integrate them into the RGB stream for saliency detection. However, simple fusion strategies like direct concatenation or summation are less adaptive to locate the salient objects due to myriad possibilities of salient objects positions in the real world. Overall, these above methods overlook the fact that depth cue contributes differently to the salient object prediction in various scenes. Furthermore, existing RGB-D methods inevitably suffer from detail information loss [16, 41] for adopting strides and pooling operations in the RGB and depth streams. An intuitive solution is to use skip-connections [22] or short-connections [21] for reconstructing the detail information. Although these strategies have brought satisfactory improvements, they remain restrictive to predict the complete structures with fine details.

Building on the above observation, we strive to take a further step towards the goal of accurate saliency detection with an asymmetric two-stream model. The primary challenge towards this goal is how to effectively extract rich global context information while preserving local saliency details. The second challenge is how to effectively utilize the discriminative power of depth features to guide the RGB features for locating salient objects accurately.

To confront these challenges, we propose an asymmetric two-stream architecture as illustrated in Fig. 2. Concretely, our contributions are:

We design a flow ladder module (FLM) and a lightweight depth network (DepthNet) with a small model size of 6.7MB. Instead of adopting skip-connections or short-connections, our FLM can effectively extract local detail information (see Fig. 1) and global context information through a local-global evolutionary fusion flow for accurate saliency detection.
We propose a novel depth attention module (DAM) to ensure that the depth features can effectively guide the RGB features by using the discriminative power of depth cues. Its effectiveness has been experimentally verified (see Table 4).
Furthermore, we conduct extensive experiments on 7 datasets and demonstrate that our method achieves consistently superior performance over 13 state-of-the-art RGB-D approaches in terms of 4 evaluation metrics. Numerically, our approach reduces the MAE performance by nearly 33% on DUT-RGBD dataset. In addition, our method minimizes the model size by 33% compared with the existing minimum method (PDNet) and achieves Top-2 running speed of 46 FPS.

2 Related Work

RGB-D Saliency Detection. Although many RGB-based saliency detection methods have achieved appealing performance [16, 29, 33, 44, 45, 47], they may not accurately detect the salient area because the appearance features in RGB data are less predictive when encountering with some complex scenes, such as low-contrast scenes, transparent objects, foreground sharing similar contents with background, multiple objects, and complex backgrounds. With the advent of consumer-grade depth cameras such as Kinect cameras, light field cameras and lidars, depth cues with a wealth of geometric and structural information is widely used in salient object detection (SOD).

Existing RGB-D saliency detection methods can be generally classified into two categories: Traditional methods [8, 9, 11, 24, 31, 35, 49, 50]. Ren et al. [35] propose a two-stage RGB-D saliency detection framework using the validity of global priors. Lang et al. [24] introduce the depth prior into the saliency detection model to improve detection performance. Desingh et al. [11] use a non-linear regression to combine the RGB-D saliency detection model with the RGB model to measure the saliency values. CNNs-based methods [4,5,6, 18, 32, 34, 46, 48]. To better mine salient information in challenging scenes, some CNNs-based methods combine depth information with RGB information for more accurate results. Practices and theories that lead to symmetric two-stream architectures which extract RGB and depth representations equally have been studied for a long time [4,5,6, 18, 32]. Han et al. [18] design a symmetric architecture for fusing the deep representations of depth and RGB views automatically to obtain the final saliency map. Chen et al. [6] utilize two-stream CNNs-based models for introducing cross-model interactions in multiple layers by direct summation. Recently, several asymmetric architectures are proposed for processing different data types [46, 48]. Zhao et al. [46] use the enhanced depth information as an auxiliary cue and adopt a pyramid decoding structure to obtain more accurate salient regions.

Because of the inherent differences between RGB and depth information, classic symmetric two-stream architectures and simple fusion strategies may work their ways down to inaccurate prediction. Besides, the strides and pooling operations adopted in existing RGB-D-based methods for downsampling inevitably result in information loss. To address the above-mentioned issues, in this work, we design an asymmetric network and ably fuse RGB and depth information by a depth attention mechanism for precise saliency detection.

3 The Proposed Method

The overall architecture of our proposed method is shown in Fig. 2. In this section, we begin with describing the overall architecture in Sect. 3.1, then introduce the DepthNet in Sect. 3.2, the flow ladder module in Sect. 3.3, and finally the proposed depth attention module in Sect. 3.4.

Table 1. Details of our DepthNet architecture, k represents the kernel, s represents the stride, chns represents the number of input/output channels for each layer, p represents the padding, in and out represent the input and output feature size

Full size table

3.1 The Overall Architecture

Considering that most RGB-D-based methods utilizing symmetric two-stream architectures overlook the inherent differences between RGB and depth data, we propose an asymmetric two-stream architecture, as illustrated in Fig. 2. Our two-stream architecture includes a lightweight depth stream and a RGB stream with a flow ladder module, namely DepthNet and RGBNet, respectively. As for the depth stream, we design a lightweight architecture as shown in Table 1. Then the extracted depth features are fed into the RGB stream through a depth attention mechanism (DAM, see Fig. 3) to generate complementary features with affluent information of location and spatial structure. For the RGB stream, we adopt the commonly used architecture VGG-19 as our baseline. Based on this baseline, we propose a novel flow ladder module (FLM) to preserve the detail information as well as receive global location information from representations of other vertical parallel branches in an evolutionary way, which benefits locating salient regions and achieves considerable performance gains.

3.2 DepthNet

Compared with RGB data which contains richer color and texture information, depth cues focus on spatial location information. A large number of parameters in a complex depth extraction network are redundant, thus we consider that is unnecessary to process depth data with a complex network as large as RGBNet. In addition, the ablation experiments on symmetric and asymmetric architectures in Sect. 4.3 also confirm our claim. As illustrated in Fig. 2, we adopt a detail-transfer architecture for the depth stream (see Table 1 for detailed specification) and take the original depth maps as input. Our DepthNet transfers detail information in the whole architecture to capture fine spatial details. Considering the differences between RGB and depth data, numerous redundant channels of depth features are unnecessary. Therefore, we prune the number of feature channels to 32 in Conv3, 4 and 128 in final Conv, which further achieves a more lightweight DepthNet with a model size of 6.7MB.

3.3 RGBNet

Deeper networks are able to extract richer high-level information such as location and semantic information, but strides and pooling operations widely used in existing RGB-D-based methods may cause detail information loss, such as boundary, small object, for saliency detection. A straightforward solution to this issue is combining the low-level features with the high-level features by skip-connections [22]. However, the low-level features take a less discriminative and predictive power for complex scenes, which has trouble contributing to accurate saliency detection. Hence, we design a novel RGBNet consisting of a VGG backbone for fair comparison and a flow ladder model to preserve the local detail information by constructing four detail-transfer branches and fuse the global location information in an evolutionary way. In order to fit our task, we truncate the last three fully-connected layers and maintain the five convolution blocks as well as all pooling operations of VGG-19. The FLM can preserve the resolution of representations in multiple scales and levels, ensuring that the local detail information and global location information contribute to the precision of saliency detection. More details are described as follows.

In order to alleviate the detail information loss, we design a flow ladder model (FLM). This module is applied in VGG-19 and integrates four detail-transfer branches in the way of local-global evolutionary fusion flow. We design detail-transfer branches for preserving the saliency details. As shown in Fig. 2, the first two branches consist of 3 layers. The number of the layers in the $3^{rd}$ and $4^{th}$ branch is decreased to 2, 1, respectively. Specifically, we simply denote the $j^{th}$ layer of the $i^{th}$ branch as $B_{i}L_{j}$, $i \epsilon [1,4], j\epsilon [1,3]$. $B_{i}L_{j}$ is composed of four basicblocks [19], each of which consists of two convolutional layers as shown in the top of Fig. 2. Our FLM consists of 4 evolved detail-transfer branches. Instead of adopting strides and pooling operations, our FLM preserves the resolution of representations with more details in each branch by employing convolutional operations with 1 * 1 stride.

We design a novel local-global evolutionary fusion flow for integrating multi-scale local and global features extracted from detail-transfer branches. Each branch receives rich information from other vertical parallel representations through our local-global evolutionary fusion flow. In this way, rich global context representations are generated while more local saliency details are preserved. Specifically, the representations of the deeper branches are fused into the shallower branches by upsampling and summation operations as well as the representations of the shallower branches are fused into the deeper branches by downsampling and summation operations as shown in the FLM of Fig. 2. Through the evolution between different branches (shown in Fig. 2), the local detail information and the global context information are effectively combined, which benefits the precision of saliency detection. The whole fusion process is described as the following equations:

$$\begin{aligned} B_{i}L_{j}=\left\{ \begin{matrix} trans(Conv2) \ \ i=1,j=1\\ trans(Conv(i+1)) \ \ i=j+1,j\epsilon [1,3]\\ \sum \nolimits _{n=1}^{j}f(B_{n}L_{j-1}) \ \ i \epsilon [1,j],j\epsilon [2,3] \end{matrix}\right. , \end{aligned}$$

(1)

$$\begin{aligned} F_{RGB}^{j}=\sum _{n=1}^{j+1}f(B_{n}L_{j}) \ \ j\epsilon [1,2], \end{aligned}$$

(2)

$$\begin{aligned} F_{RGB}^{3}=cat(f(B_{n}L_{3})) \ \ n\epsilon [1,4], \end{aligned}$$

(3)

where $B_{i}$ and $L_{j}$ denote the $i^{th}$ branch and $j^{th}$ layer, respectively. ${f(\cdot )}$ denotes $n-i$ times up-sampled when $n>i$ and $i-n$ times down-sampled when $n<i$. And when n is equal to i, ${f(\cdot )}$ means no operation. Conv(i) means the output features of the $i^{th}$ Conv block in VGG-19 and ${trans(\cdot )}$ is operated by a convolutional layer to realize the transformation of the number of channels. ${cat(\cdot )}$ denotes concatenating all features together. The final output of our LFM namely $F_{RGB}^{3}$ is a concatenation of multi-scale features extracted from four branches. In conclusion, the features with local and global information are transferred to the parallel branches in an evolutionary way. Our proposed LFM can not only alleviate the object detail information loss but also effectively integrate multi-scale and multi-level features for precise saliency prediction.

3.4 Depth Attention Module

Changes in statistics of object positions in the real world makes linear fusion strategies of RGB and depth data less adaptive to complex scenes. To take full advantage of the depth cues with the discriminative power in location and spatial structure, we design a depth attention module to adaptively fuse the RGB and depth representations as shown in Fig. 3. Firstly, the depth features contain abundant spatial and structural information. We utilize the context attention block which contains a 1 * 1 convolutional layer $W_{k}$ and a softmax function for extracting the salient location cues more precisely, instead of applying simple fusion like summation or concatenation. Then a matrix multiplication operation is adopted to aggregate all location features together to generate attention weights of each channel i ($i.e., \alpha _{i}$) for capturing pixel-wise spatial dependencies. Moreover, the degree of response to the salient regions varies between features of different channels. Thus we adopt a channel-wise attention block which contains two 1 * 1 convolutional layers $W_{c}$ and a LayerNorm function to capture the inter-dependencies between channels, and further achieves a weighted depth feature $\beta $. Then we adopt dot product operation to fuse $\beta $ into the RGB stream, which helps guide the RGB features at pixel-level to distinguish the foreground and background thoroughly. Furthermore, the ablation experiments in Sect. 4.3 also verify the effectiveness of our DAM compared with simple fusion. And the visual results in Fig. 6(b) also prove that the salient regions are emphasized through the attention mechanism.

The details of these three blocks can be formulated as the following equations:

$$\begin{aligned} \alpha _{i}=\sum _{j=1}^{N_{p}}\frac{e^{W_{k}F_{d}^{j}}}{\sum _{m=1}^{N_{p}}e^{W_{k}F_{d}^{m}}}F_{d}^{j}, \end{aligned}$$

(4)

$$\begin{aligned} \beta _{i}=\varsigma (W_{c2}ReLU(LN(W_{c1}\alpha _{i})) \odot F_{d}), \end{aligned}$$

(5)

$$\begin{aligned} F_{fusion}=\varsigma (F_{RGB}\odot \beta ), \end{aligned}$$

(6)

where $\alpha _{i}$ denotes the weight of the $i^{th}$ channel to obtain the global context features. $F_{d}^{j}$ means the $j^{th}$ position in depth feature $F_{depth}$. $N_{p}$ is the number of positions in the depth feature map (e.g., $N_{p} = H\cdot W$). $W_{k}$, $W_{c1}$ and $W_{c2}$ denote 1 * 1 convolutional operations. LN denotes the LayerNorm operation after the convolution $W_{c1}$, and ReLU is an activate function. The $\varsigma (\cdot )$ and $\odot $ mean the sigmoid function and dot product operation, respectively. The $\beta _{i}$ indicates the depth pixel-wise attention map of $i^{th}$ channel of $F_{RGB}$. $F_{RGB}$ and $F_{fusion}$ represent the input RGB feature and the output feature of the DAM, respectively. The $F_{fusion}$ can be calculated as a DAM output with much more effective depth-induced context-aware attention features. Furthermore, experiments in Sect. 4.3 show that our DAM is capable of fusing depth features discriminatively and filtering out features which are guided by depth cues in mistake.

As illustrated in Fig. 3, the inputs of our DAM are $F_{RGB}^{i}$ and $F_{Depth}^{i}$ extracted from our LFM and DepthNet, respectively, $i=1,2,3$. At the end, a simple decoder is adopted for supervision. The decoder module contains two bilinear upsample functions, each of which is followed by 3 convolutional layers. The total loss L can be represented as:

$$\begin{aligned} L = l_{f}\{Decoder(F_{fusion}^{3});gt\}, \end{aligned}$$

(7)

where $F_{fusion}^{3}$ represents the output fusion feature of the third DAM and gt means the ground-truth map. The cross-entropy loss $ l_{f} $ can be computed as:

$$\begin{aligned} l_{f}\{\hat{y};y\}=ylog\hat{y} + (1-y)log(1-\hat{y}), \end{aligned}$$

(8)

where y and $\hat{y}$ denote the saliency ground-truth map and the predicted map, respectively.

Table 2. Quantitative comparisons of E-measure ($E_{\gamma }$), S-measure ($S_{\lambda }$), F-measure ($F_{\beta }$) and MAE (M) on 7 widely-used RGB-D datasets. The best three results are shown in $\mathbf{boldface} $, bolditalic, italic fonts respectively. From top to bottom: the latest CNNs-based RGB-D methods and traditional RGB-D methods

Full size table

Table 3. Continuation of Table 2

Full size table

4 Experiments

4.1 Dataset

We perform our experiments on 7 public RGB-D datasets for fair comparisons, i.e., NJUD [23], NLPR [31], RGBD135 [8], STEREO [30], LFSD [27], DUT-RGBD [32], SSD [25]. We split those datasets as [4, 6, 18] to guarantee fair comparisons. We randomly select 800 samples from DUT-RGBD, 1485 samples from NJUD and 700 samples from NLPR for training. The remaining images in these 3 datasets and other 4 datasets are all for testing to verify the generalization ability of saliency models. To prevent overfitting, we additionally augment the training set by flipping, cropping and rotating those images.

4.2 Experimental Setup

Evaluation Metrics. To comprehensively evaluate various methods, we adopt 4 evaluation metrics including F-measure ($F_{\beta }$) [1], mean absolute error (M) [3], S-measure ($S_{\lambda }$) [13], E-measure ($E_{\gamma }$) [14]. Specifically, the F-measure can evaluate the performance integrally. The M represents the average absolute difference between the saliency map and ground truth. The S-measure which is recently proposed can evaluate the structural similarities. The E-measure can jointly capture image level statistics and local pixel matching information.

Implementation Details. Our method is implemented with pytorch toolbox and trained on a PC with GTX 2080Ti GPU and 16 GB memory. The input images are uniformly resized to 256 * 256. The momentum, weight decay, batch-size and learning rate of our network are set as 0.9, 0.0005, 2 and 1e-10, respectively. During training, we use softmax entropy loss described in Sect. 3.4 and the network converges after 60 epochs with mini-batch size 2.

4.3 Ablation Analysis

Effect of FLM. We adopt the commonly two-stream VGG-19 network fused by direct summation as our baseline(denoted as ’B[s]’ shown in Fig. 5(a)). In order to verify the effectiveness of FLM, we employ the FLM in both the RGB and depth streams (’B+FLM[s]’ shown in Fig. 5(b)). The experimental results of (a) and (b) in Table 4 clearly demonstrate that our FLM obtains impressive performance gains. Moreover, as shown in Fig. 7, we can note that after employing FLM, the saliency maps achieve sharper boundaries as well as finer structures. Furthermore, for effectively analyzing the working mechanism of FLM module, we visualized the output features of each block in FLM. As shown in Fig. 6(a), we can see that the branch 4 and branch 3 extract the global location information while the branch 2 and branch 1 preserve more local detail information. This benefits from the evolutionary process of salient regions with finer details.

Table 4. Ablation analysis on 7 datasets. The [s] and [a] following the modules represent the symmetric and asymmetric architectures, respectively. Obviously, each component of our architecture can achieve considerable accuracy gains. (a), (b), (c), (d), (e), (f) represent the modules indexed by the corresponding letters in Fig. 5

Full size table

Effect of DAM. We conduct contrast experiments for verifying the effectiveness of our DAM on both symmetric and asymmetric architectures. In terms of symmetric architecture, we replace simple summation with DAM on our baseline (denoted as ‘B+DAM[s]’, as shown in Fig. 5(c)). From the results of (a) and (c) in Table 4 we can see that the MAE is reduced by 18% on NLPR dataset after employing DAM, which intuitively verifies the effect of DAM. Meanwhile, the corresponding visual results in Fig. 7 also illustrate that our DAM can fuse depth features discriminatively and filter out features which are guided by depth cues in mistake. On the other hand, we employ FLM in the RGB stream and replace the VGG-19 backbone with DepthNet in the depth stream (denoted as ‘B+FLM[a]’, as shown in Fig. 5(d)). And we adopt DAM on ‘B+FLM[a]’ for verifying the effect of DAM on asymmetric architecture (denoted as ‘B+FLM+DAM[a]’ shown in Fig. 5(e)). The comparison results of (d) and (e) in Table 4 demonstrate the effectiveness of DAM on asymmetric architecture over all datasets. Additionally, we visualized the feature maps of our two-stream asymmetric architecture (before and after adopting DAM). As shown in Fig. 6(b), we can see that the salient regions are emphasized after adopting DAM, which significantly improves our detection accuracy.

Effect of Asymmetric Architecture. In order to illustrate the effectiveness of adopting asymmetric architecture, we compare the results of (b) and (d) in Fig. 5. Furthermore, for fair comparison, we adopt our FLM and DAM on the two-stream symmetric network (denoted as ‘B+FLM+DAM[s]’, as shown in Fig. 5(f)). As we can see from Table 4 (Asymmetric), the asymmetric architecture achieves considerable performance compared with symmetrical architecture, but has a small size. Specifically, the asymmetric architecture tremendously minimizes the model size by 47% (128.9 MB vs. 244.4 MB). Based on the above observation, we consider that is unnecessary to utilize large network as RGBNet for extracting features and we can replace it with a more lightweight network.

4.4 Comparison with State-of-the-Art

Considering that most of the existing approaches are based on VGG network, we adopt VGG as our backbone for fair comparisons. And We compare our model with 13 RGB-D based salient object detection models including 8 CNNs-based methods: CPFP [46], DMRA [32], MMCI [6], TANet [5], PDNet [48], PCA [4], CTMF [18], DF [34], and 5 traditional methods: MB [49], CDCP [50], DCMC [9], NLPR [31], DES [8]. For fair comparisons, the results of the competing methods are generated by authorized codes or directly provided by authors.

Quantitative Evaluation. Tables 2 and 3 show the validation results in terms of 4 evaluation metrics on 7 datasets. As we can see, our model achieves significant outperformance over all other methods. It is noted that our approach outperforms all other methods by a dramatic margin on datasets DUT-RGBD, NJUD and RGBD135, which are considered as more challenging datasets due to the large number of complex scenes like similar foreground and background, low-contrast and transparent object. It further indicates that our model can be generalized to various challenging scenes.

Qualitative Evaluation. We also visually compare our method with the most representative methods as shown in Fig. 4. From those results, we can observe that our saliency maps are closer to the ground truths. For instance, other methods have trouble in distinguishing salient objects in complex environments such as cluttered background (see the $1^{th}$ row), while ours can precisely identify the whole object and exquisite details. And our model can locate and detect the entire salient object with sharp details more accurately than others in more challenging scenes such as low-contrast (see the $2^{nd}$ - $3^{rd}$ rows), transparent object (see the $8^{th}$ row), multiple objects and small object (see the $5^{th}$ - $7^{th}$ rows). Those results further verify the effectiveness and robustness of our proposed model.

Table 5. Complexity comparisons on two datasets. The best three results are shown in boldface, bolditalic, italic fonts respectively

Full size table

Complexity Evaluation. We compare the model size and execution time of our method with other 7 representative models, as shown in Table 5. It can be seen that our method achieves Top-1 model size and Top-2 FPS. To be specific, the model size of our architecture is only 128.9 MB which is 2/3 of the existing minimum model size (PDNet). Compared with the best performing method DMRA, our architecture tremendously minimizes the model size by 46% and boosts the FPS by 109%. Besides, we achieve a high running speed with 46 Frame Per Second (FPS) compared with the representative approaches.

5 Conclusion

In this paper, we propose an asymmetric two-stream architecture taking account of the inherent differences between RGB and depth data for saliency detection. For the RGB stream, we introduce a flow ladder module (FLM) for effectively extracting rich global context information while preserving local saliency details. And we design a lightweight DepthNet for depth stream with a small model size of 6.7MB. Besides, we propose a depth attention module (DAM) ensuring that the depth cues can discriminatively guide the RGB features for precisely locating salient objects. Our approach significantly advances the state-of-the-art methods over the widely used datasets and is capable of precisely capturing salient regions in challenging scenes.

References

Achanta, R., Hemami, S.S., Estrada, F.J., Süsstrunk, S.: Frequency-tuned salient region detection. In: CVPR, pp. 1597–1604 (2009)
Google Scholar
Borji, A., Frintrop, S., Sihite, D.N., Itti, L.: Adaptive object tracking by learning background context. In: CVPR, pp. 23–30 (2012). https://academic.microsoft.com/paper/2158535435
Borji, A., Sihite, D.N., Itti, L.: Salient object detection: a benchmark. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7573, pp. 414–429. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33709-3_30
Chapter Google Scholar
Chen, H., Li, Y.: Progressively complementarity-aware fusion network for RGB-D salient object detection. In: CVPR, pp. 3051–3060 (2018)
Google Scholar
Chen, H., Li, Y.: Three-stream attention-aware network for RGB-D salient object detection. TIP 28(6), 2825–2835 (2019)
MathSciNet MATH Google Scholar
Chen, H., Li, Y., Su, D.: Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. PR 86, 376–385 (2019)
Google Scholar
Cheng, M.M., Hou, Q.B., Zhang, S.H., Rosin, P.L.: Intelligent visual media processing: When graphics meets vision. JCST 32(1), 110–121 (2017). https://academic.microsoft.com/paper/2571295082
Cheng, Y., Fu, H., Wei, X., Xiao, J., Cao, X.: Depth enhanced saliency detection method. In: ICIMCS, pp. 23–27 (2014)
Google Scholar
Cong, R., Lei, J., Zhang, C., Huang, Q., Cao, X., Hou, C.: Saliency detection for stereoscopic images based on depth confidence analysis and multiple cues fusion. SPL 23(6), 819–823 (2016)
Google Scholar
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: NIPS, pp. 379–387 (2016). https://academic.microsoft.com/paper/2407521645
Desingh, K., K, M.K., Rajan, D., Jawahar, C.V.: Depth really matters: improving visual salient region detection with depth. In: BMVC (2013)
Google Scholar
Donoser, M., Urschler, M., Hirzer, M., Bischof, H.: Saliency driven total variation segmentation. In: ICCV, pp. 817–824 (2009). https://academic.microsoft.com/paper/2546160422
Fan, D.P., Cheng, M.M., Liu, Y., Li, T., Borji, A.: Structure-measure: a new way to evaluate foreground maps. In: ICCV, pp. 4558–4567 (2017). https://academic.microsoft.com/paper/2963868681
Fan, D.P., Gong, C., Cao, Y., Ren, B., Cheng, M.M., Borji, A.: Enhanced-alignment measure for binary foreground map evaluation. In: IJCAI, pp. 698–704 (2018)
Google Scholar
Fan, D.P., Wang, J., Liang, X.M.: Improving image retrieval using the context-aware saliency areas. AMM 734, 596–599 (2015). https://academic.microsoft.com/paper/2090323693
Feng, M., Lu, H., Ding, E.: Attentive feedback network for boundary-aware salient object detection. In: CVPR, pp. 1623–1632 (2019). https://academic.microsoft.com/paper/2948510860
Gao, Y., Wang, M., Tao, D., Ji, R., Dai, Q.: 3-D object retrieval and recognition with hypergraph analysis. TIP 21(9), 4290–4303 (2012). https://academic.microsoft.com/paper/2068078373
Han, J., Chen, H., Liu, N., Yan, C., Li, X.: CNNs-based RGB-D saliency detection via cross-view transfer and multiview fusion. TSMC 48(11), 3171–3183 (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Hong, S., You, T., Kwak, S., Han, B.: Online tracking by learning discriminative saliency map with convolutional neural network. In: ICML, pp. 597–606 (2015). https://academic.microsoft.com/paper/1854404533
Hou, Q., Cheng, M.M., Hu, X., Borji, A., Tu, Z., Torr, P.H.S.: Deeply supervised salient object detection with short connections. CVPR. 41, 815–828 (2017)
Google Scholar
Hou, Q., Cheng, M.M., Hu, X., Borji, A., Tu, Z., Torr, P.H.S.: Deeply supervised salient object detection with short connections. TPAMI 41(4), 815–828 (2019). https://academic.microsoft.com/paper/2569272946
Ju, R., Ge, L., Geng, W., Ren, T., Wu, G.: Depth saliency based on anisotropic center-surround difference. In: ICIP, pp. 1115–1119 (2014)
Google Scholar
Lang, C., Nguyen, T.V., Katti, H., Yadati, K., Kankanhalli, M., Yan, S.: Depth matters: influence of depth cues on visual saliency. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7573, pp. 101–115. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33709-3_8
Chapter Google Scholar
Li, G., Zhu, C.: A three-pathway psychobiological framework of salient object detection using stereoscopic technology. In: ICCVW, pp. 3008–3014 (2017). https://academic.microsoft.com/paper/2766315367
Li, G., Yu, Y.: Deep contrast learning for salient object detection. In: CVPR, pp. 478–487 (2016)
Google Scholar
Li, N., Ye, J., Ji, Y., Ling, H., Yu, J.: Saliency detection on light field. PAMI 39(8), 1605–1616 (2017)
Article Google Scholar
Liu, G., Fan, D.: A model of visual attention for natural image retrieval. In: ISCC-C, pp. 728–733 (2013). https://academic.microsoft.com/paper/2314707829
Liu, N., Han, J.: DHSNet: deep hierarchical saliency network for salient object detection. In: CVPR, pp. 678–686 (2016)
Google Scholar
Niu, Y., Geng, Y., Li, X., Liu, F.: Leveraging stereopsis for saliency analysis. In: CVPR, pp. 454–461 (2012)
Google Scholar
Peng, H., Li, B., Xiong, W., Hu, W., Ji, R.: RGBD salient object detection: a benchmark and algorithms. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 92–109. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10578-9_7
Chapter Google Scholar
Piao, Y., Ji, W., Li, J., Zhang, M., Lu, H.: Depth-induced multi-scale recurrent attention network for saliency detection. In: ICCV (2019)
Google Scholar
Qin, X., Zhang, Z., Huang, C., Gao, C., Dehghan, M., Jagersand, M.: BASNet: boundary-aware salient object detection. In: CVPR, pp. 7479–7489 (2019). https://academic.microsoft.com/paper/2961348656
Qu, L., He, S., Zhang, J., Tian, J., Tang, Y., Yang, Q.: RGBD salient object detection via deep fusion. TIP 26(5), 2274–2285 (2017)
MathSciNet MATH Google Scholar
Ren, J., Gong, X., Yu, L., Zhou, W., Yang, M.Y.: Exploiting global priors for RGB-D saliency detection. In: CVPRW, pp. 25–32 (2015)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. TPAMI 39(6), 1137–1149 (2017). https://academic.microsoft.com/paper/639708223
Ren, Z., Gao, S., Chia, L.T., Tsang, I.W.H.: Region-based saliency detection and its application in object recognition. TCSVT 24(5), 769–779 (2014). https://academic.microsoft.com/paper/2055180303
Smeulders, A.W.M., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah, M.: Visual tracking: an experimental survey. TPAMI 36(7), 1442–1468 (2014). https://academic.microsoft.com/paper/2126302311
Wang, W., Shen, J., Porikli, F.: Saliency-aware geodesic video object segmentation. In: CVPR, pp. 3395–3402 (2015)
Google Scholar
Wang, W., Shen, J., Sun, H., Shao, L.: Video co-saliency guided co-segmentation. TCSVT 28(8), 1727–1736 (2018). https://academic.microsoft.com/paper/2887503470
Wu, R., Feng, M., Guan, W., Wang, D., Lu, H., Ding, E.: A mutual learning method for salient object detection with intertwined multi-supervision. In: CVPR, pp. 8150–8159 (2019). https://academic.microsoft.com/paper/2962680827
Zhang, M., et al.: LFNet: light field fusion network for salient object detection. IEEE Trans. Image Process. 29, 6276–6287 (2020)
Article Google Scholar
Zhang, M., Li, J., Ji, W., Piao, Y., Lu, H.: Memory-oriented decoder for light field salient object detection. In: NeurIPS 2019: Thirty-third Conference on Neural Information Processing Systems, pp. 898–908 (2019)
Google Scholar
Zhang, P., Wang, D., Lu, H., Wang, H., Ruan, X.: Amulet: aggregating multi-level convolutional features for salient object detection. In: ICCV, pp. 202–211 (2017). https://academic.microsoft.com/paper/2963032190
Zhang, X., Wang, T., Qi, J., Lu, H., Wang, G.: Progressive attention guided recurrent network for salient object detection. In: CVPR (2018)
Google Scholar
Zhao, J.X., Cao, Y., Fan, D.P., Cheng, M.M., Li, X.Y., Zhang, L.: Contrast prior and fluid pyramid integration for RGBD salient object detection. In: CVPR, pp. 3927–3936 (2019)
Google Scholar
Zhao, R., Ouyang, W., Li, H., Wang, X.: Saliency detection by multi-context deep learning. In: CVPR, pp. 1265–1274 (2015)
Google Scholar
Zhu, C., Cai, X., Huang, K., Li, T.H., Li, G.: PDNet: prior-model guided depth-enhanced network for salient object detection. In: ICME (2019)
Google Scholar
Zhu, C., Li, G., Guo, X., Wang, W., Wang, R.: A multilayer backpropagation saliency detection algorithm based on depth mining. In: CAIP, pp. 14–23 (2017)
Google Scholar
Zhu, C., Li, G., Wang, W., Wang, R.: An innovative salient object detection using center-dark channel prior. In: ICCVW, pp. 1509–1515 (2017)
Google Scholar

Download references

Acknowledgement

This work was supported by the Science and Technology Innovation Foundation of Dalian (2019J12GX034), the National Natural Science Foundation of China (61976035), and the Fundamental Research Funds for the Central Universities (DUT19JC58, DUT20JC42).

Author information

Authors and Affiliations

Dalian University of Technology, Dalian, China
Miao Zhang, Sun Xiao Fei, Jie Liu, Shuang Xu, Yongri Piao & Huchuan Lu
Pengcheng Lab, Shenzhen, China
Huchuan Lu

Authors

Miao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Sun Xiao Fei
View author publications
You can also search for this author in PubMed Google Scholar
Jie Liu
View author publications
You can also search for this author in PubMed Google Scholar
Shuang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yongri Piao
View author publications
You can also search for this author in PubMed Google Scholar
Huchuan Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongri Piao .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, M., Fei, S.X., Liu, J., Xu, S., Piao, Y., Lu, H. (2020). Asymmetric Two-Stream Architecture for Accurate RGB-D Saliency Detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12373. Springer, Cham. https://doi.org/10.1007/978-3-030-58604-1_23

Download citation

DOI: https://doi.org/10.1007/978-3-030-58604-1_23
Published: 03 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58603-4
Online ISBN: 978-3-030-58604-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Asymmetric Two-Stream Architecture for Accurate RGB-D Saliency Detection

Abstract

Similar content being viewed by others

RGB-D saliency detection via complementary and selective learning

A Single Stream Network for Robust and Real-Time RGB-D Salient Object Detection

Efficient Depth-Included Residual Refinement Network for RGB-D Saliency Detection

Keywords

1 Introduction

2 Related Work