BAG: A Binary Descriptor for RGB-D Images Combining Appearance and Geometric Cues

Xiao, Xiuzi; He, Songhua; Guo, Yulan; Lu, Min; Zhang, Jun

doi:10.1007/978-981-10-5230-9_7

Xiuzi Xiao¹²,
Songhua He¹²,
Yulan Guo^13,14,
Min Lu¹³ &
…
Jun Zhang¹³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 710))

Included in the following conference series:

International Conference on Cognitive Systems and Signal Processing

2053 Accesses

Abstract

Feature matching forms the basis for numerous computer vision applications. With the rapid development of 3D sensors, the availability of RGB-D images has been increased stably. Compared to traditional 2D images, the additional depth images in RGB-D images can provide more geometric information. In this paper, we propose a new efficient binary descriptor (namely BAG) for RGB-D image representation by combining appearance and geometric cues. Experimental results show that the proposed BAG descriptor produces better feature matching performance with faster matching speed and less memory than the existing methods.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Bionic Vision Descriptor for Image Retrieval

Keypoint Detection in RGB-D Images Using Binary Patterns

Improving RGB Descriptors Using Depth Cues

Keywords

1 Introduction

Feature matching is a fundamental issue for computer vision tasks such as object recognition [1], object detection, 3D reconstruction and image retrieval. The performance of feature matching highly depends on descriptors. Feature descriptors are generally required to be invariant to variations in image scale, rotation and illumination. Therefore, designing descriptors with good performance is very important.

In 2D domain, Lowe et al. [2] proposed the Scale Invariant Feature Transform (SIFT) method using scale space and gradient orientation histograms. Bay et al. [3] proposed the Speeded-Up Robust Features (SURF) method to speed up the keypoint detection step of SIFT. In 3D domain, Tombari et al. [4] proposed a Signature of Histograms of Orientations (SHOT) descriptor using geometric histograms. Later, Tombari et al. [5] proposed a Color-SHOT (CSHOT) descriptor by combining shape and color information. Guo et al. [6] proposed Rotational Projection Statistics (RoPS) based on observations in cognition and multi-view information representation. However, these float descriptors are usually high in dimensionality and matching time and require a large amount of memory for storage. This issue becomes even challenging when the number of extracted keypoints is large. Consequently, current float feature descriptors are unsuitable for time crucial applications (e.g., mobile computing).

To achieve low memory cost and fast feature generation and matching, binary descriptors have been proposed. In 2D domain, Calonder et al. [7] simply compared the gray values of a specified point pair to generate the Binary Robust Independent Elementary Features (BRIEF) descriptor. In 3D domain, Prakhya et al. [8] used a certain binary mapping rules to propose the Binary SHOT (B-SHOT) descriptor. The aforementioned algorithms generate feature descriptors using RGB information or point cloud information only.

In recent years, with the availability of RGB-D images, few RGB-D image descriptors have been developed. Beksi et al. [9] integrated different features of RGB image and depth image to obtain a covariance descriptor. Feng et al. [10] proposed a Local Ordinal Intensity and Normal Descriptor (LOIND) by constructing a histogram. These descriptors are represented by float vectors, so they require higher memory consumption and more matching time than binary descriptors. Recently, Nascimento et al. [11] proposed a Binary Robust Appearance and Normal Descriptor (BRAND) by detecting variation in the normal or in the intensity on each point pair around a keypoint neighborhood.

In this paper, we propose a new efficient binary descriptor namely BAG for RGB-D images. Experimental results on the datasets 7-Scenes [12] and TUM RGB-D benchmark [13] show that the proposed BAG binary descriptor achieves better performance than other descriptors in terms of recall/1-precision, matching time and memory consumption.

2 BAG Binary Descriptor

2.1 Local Binary Pattern

Given a keypoint $ k $ and a circle patch $ p $ with radius $ R $ around a keypoint $ k $, a specific sampling method is designed to obtain $ N $ point pairs from patch $ p $, then local binary pattern [14] can be represented as follows:

$$ LBP_{R,N} (p) = \sum\limits_{i = 0}^{N - 1} {\tau (p:n_{i} ,m_{i} )} 2^{i} $$

(1)

$$ \tau (p:n_{i} ,m_{i} ) = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {p(n_{i} ) < p(m_{i} )} \hfill \\ {0,} \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right. $$

(2)

where $ n_{i} ,m_{i} $ stands for a point pair from patch $ p $, $ p(n_{i} ),p(m_{i} ) $ is the gray value at point $ n_{i} $ and $ m_{i} $, respectively.

It can be seen that local binary pattern is robust to illumination variations and fast to compute. Since RGB-D image can provide both color information and depth information, we use 3 bits rather than 1 bit for each point pair to improve the discriminativeness power of feature descriptor.

2.2 Scale and Rotation Invariance

To improve the discriminativeness and robustness of a feature descriptor in 2D image, scale invariance should be considered. Generally, it can be achieved by constructing a scale space and then searching for extreme values in the scale space. According to the principle of image formation, the scale of a point in 2D image is approximately inversely proportional to its corresponding depth. So, we can use depth information to estimate the scale $ s $ for each keypoint in RGB-D image [11], that is:

$$ s = \hbox{max} (0.2,\frac{3.8 - 0.4\hbox{max} (2,d)}{3}) $$

(3)

where $ d $ is the depth value of the keypoint. Since the depth sensor has a limited working range, part of the depth values measured by depth sensor are inaccurate. Therefore, we filter out the depth value smaller than 2 m.

To achieve rotation invariance, we use the same canonical orientation estimation method as SURF [3].

Then, the patch $ p $ with radius $ R $ around keypoint $ k $ is scaled and rotated:

$$ p = \{ (T_{\theta ,s} (x_{i} ),T_{\theta ,s} (y_{i} ))|(x_{i} ,y_{i} )\,{ \in }\,A\} $$

(4)

$$ T_{\theta ,s} = \left( {\begin{array}{*{20}c} {\cos \theta } & { - \sin \theta } \\ {\sin \theta } & {\cos \theta } \\ \end{array} } \right)\left( {\begin{array}{*{20}c} s & 0 \\ 0 & s \\ \end{array} } \right) $$

(5)

where $ A $ is the $ N $ point pairs sampled from patch $ p $, $ (x_{i} ,y_{i} ) $ is a point pair selected from $ A $. After doing this, the proposed descriptor is invariant to scale and rotation. Uniform distribution sampling is used in this paper, an illustration is shown in Fig. 1.

2.3 Fusion of Color and Depth Information

Once $ N $ point pairs have been determined for patch $ p $, the BAG descriptor for each point pair is generated using the following three steps.

(1)
The average gray value $ ave $ of patch $ p $ in RGB image is first calculate, the gray value of each point in the point pair is then compared with the average gray value to produce a 1-bit representation $ \tau_{a} $. That is:
$$ \tau_{a} (x_{i} ,y_{i} ) = \left\{ {\begin{array}{*{20}c} {1,} & {{\text{if }}(p(x_{i} ) > \;ave\& \& p(y_{i} ) > \;ave)||(p(x_{i} ){<} \;ave\& \& p(y_{i} ){ < } ave)} \\ {0,} & {\text{others}} \\ \end{array} } \right. $$
(6)

where $ p(.) $ is the gray value.

(2)
The gray values of points in the point pair are compared directly, resulting in a 1-bit representation $ \tau_{b} $.
$$ \tau_{b} (x_{i} ,y_{i} ) = \left\{ {\begin{array}{*{20}c} {1,} & {if\;p(x_{i} ) > p(y_{i} )} \\ {0,} & {\text{others}} \\ \end{array} } \right. $$
(7)

(3)
The depth image is transformed to a point cloud, geometry information of the point cloud is then extracted and compared within each point pair, resulting in a 1-bit representation. The related geometry information is shown in Fig. 2.
Fig. 2.
Geometry information
Full size image

Specifically, the following two features are constructed using surface geometry. The first one is the dot product between surface normals. The second one is the convexity of a surface. The convexity of a surface is defined as follows:

$$ k(x_{i} ,y_{i} ) = < p_{s} (x_{i} ) - p_{s} (y_{i} ),p_{n} (x_{i} ) - p_{n} (y_{i} ) {>} $$

(8)

Where $ {<},{>} $ represents the dot product of two vectors, $ p_{s} (x_{i} ),p_{s} (y_{i} ) $ and $ p_{n} (x_{i} ),p_{n} (y_{i} ) $ are the 3D positions and the normals of points $ x_{i} $ and $ y_{i} $, respectively.

Finally, the 1-bit representation $ \tau_{c} $ for geometry information [11] can be expressed as:

$$ \tau_{c} (x_{i} ,y_{i} ) = ( < p_{n} (x_{i} ),p_{n} (y_{i} )> < \cos (\rho ))^{\wedge} (k(x_{i} ,y_{i} ) < 0) $$

(9)

where $ \rho $ is the normal angle threshold.

Therefore, we have 3 bits for the representation of each point pair.

The BAG feature descriptor is then expressed as:

$$ b(k) = \sum\limits_{i = 0}^{N - 1} {(2^{3i} \tau_{a} (x_{i} ,y_{i} ) + 2^{3i + 1} \tau_{b} (x_{i} ,y_{i} ) + 2^{3i + 2} \tau_{c} (x_{i} ,y_{i} ))} $$

(10)

2.4 Gaussian Filter

To improve the robustness of BAG descriptor with respect to noise, the patch $ p $ is smoothed with a Gaussian kernel before the generation of binary features (as described in Sect. 2.3). The Gaussian kernel has a deviation $ \sigma $ of 2 and a window of $ 9 \times 9 $ pixels. The 2D Gaussian kernel function with deviation $ \sigma $ is defined as:

$$ g_{\sigma } (x,y) = \frac{1}{{2\pi \sigma^{2} }}\exp ( - \frac{{x^{2} + y^{2} }}{{2\sigma^{2} }}) $$

(11)

The Gaussian filter value $ F(q) $ at point $ q $ is defined as:

$$ F(q) = \frac{{\sum\limits_{{(l,k) \in\Omega }} {g_{\sigma } } (l,k)I_{(l,k)} }}{{\sum\limits_{{(l,k) \in\Omega }} {g_{\sigma } } (l,k)}} $$

(12)

where $ (l,k) $ is the relative position to $ q $, $ I_{(l,k)} $ is the gray value at position $ (l,k) $, $ \Omega $ is the filter window.

3 Experiments

3.1 Experimental Setup

In this paper, experiments were conducted on public datasets 7-Scenes and TUM RGB-D benchmark. The integral image method [10, 11] was used to calculate the normals of a depth image. The Nearest Neighbor Distance Ratio (NNDR) criterion was used for descriptor matching and the widely used recall/1-precision curve [15] was used to measure the performance of a feature descriptor. To ensure fair comparison, we used the same keypoint detector method (i.e., STAR) [16] to generate keypoints in all experiments. We then used different keypoint description methods to generate feature matching performance.

To find the optimal parameters for BAG descriptors, we have tested our BAG feature descriptor with different sizes and different thresholds for normal angle. Finally, we set the size for BAG to be 48 bytes and the normal angle to be 45°.

3.2 Comparative Experiments

In order to test the performance of the proposed BAG descriptor, we compared BAG with several existing methods, including SIFT [2], SURF [3], SHOT [4] and CSHOT [5] in terms of recall/1-precision, generation time, matching time and memory consumption. Note that, the NNDR matching criterion was used in all experiments. For feature matching, the Euclidean distance was used for all float descriptors, and the Hamming distance was used for our proposed BAG binary descriptor.

The recall/1-precision curves achieved on different sequences of the 7-Scenes dataset are shown in Fig. 3. The results show that the proposed BAG consistently outperformed SURF. Compared to SIFT, results on the first 3 sequences show that the proposed BAG significantly outperformed SIFT when the recall is low, the proposed BAG is then inferior to SIFT slightly. For the remaining 3 sequences, the proposed BAG achieved almost the same performance when recall is low, the proposed BAG is then slightly inferior to SIFT. That is because, the proposed BAG contains both appearance (RGB) information and geometric (depth) information, its performance is comparable to or better than the SIFT descriptor. Meanwhile, since the proposed BAG is represented by binary strings, its descriptiveness for a local surface is still slightly lower than the float descriptor. Therefore, the number of false positive matches for BAG matching increases when recall is high. Compared to SHOT and CSHOT, it is clear that the proposed BAG outperforms SHOT and CSHOT on all scenes when recall is high. Note that, CSHOT outperforms SHOT all the time.

The recall/1-precison curves achieved on the publically available TUM RGB-D benchmark are shown in Fig. 4. The results show that the comparative performance of the proposed BAG is consistent with the results achieved on the 7-Scenes dataset, except that CSHOT achieves almost the same performance as SHOT on the TUM RGB-D benchmark. The observation indicates that CSHOT does not benefit from the color information of images on the TUM RGB-D benchmark.

The feature generation and matching time for different feature descriptors is shown in Table 1. From Table 1, we can see that the descending ranking for descriptors in terms of generation time is CSHOT, SHOT, SIFT, BAG and SURF. Particularly, the generation time of the proposed BAG is smaller than all other algorithms except SURF. That is because, the proposed BAG has to correct the scale and rotation of each point pair to achieve scale invariance and rotation invariance. With the scale and rotation invariance, the proposed BAG consistently outperforms SURF in term of recall/1-precision curves. The descending rank for descriptors in terms of matching time is CSHOT, SHOT, SIFT, SURF and BAG. The reason is that the BAG uses Hamming distance while the other descriptors use Euclidean distance for feature matching. With Euclidean distance, the matching time increases with the feature dimensionality. Note that, the descriptor dimensionalities for CSHOT, SHOT, SIFT and SURF are 1344, 352, 128, 64, respectively.

Table 1. Feature generation and matching time for different descriptors

Full size table

The memory consumption for different descriptors is shown in Table 2. From Table 2, we can see the descending ranking is CSHOT, SHOT, SIFT, SURF and BAG. BAG is significantly smaller in memory cost than others. That is because, the proposed BAG is represented by binary bits while the other descriptors are represented by float values. Since a float vector contains 32 bits, a binary descriptor is more compact than a float descriptor with the same feature dimensionality.

Table 2. Memory consumption comparison among different descriptors

Full size table

4 Conclusion

The paper has proposed a new efficient binary descriptor for RGB-D images. The proposed descriptor integrates both RGB and depth information and it is highly discriminative. Experiments have been conducted to compare our BAG descriptors with SIFT, SURF, SHOT and CSHOT descriptors. Results show that the proposed BAG produces better keypoint matching performance, it also achieves the fastest matching speed and the smallest memory consumption.

References

Guo, Y., Bennamoun, M., Sohel, F., Lu, M., Wan, J.: 3D object recognition in cluttered scenes with local surface features: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2270–2287 (2014)
Article Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Article Google Scholar
Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110(3), 346–359 (2008)
Article Google Scholar
Tombari, F., Salti, S., Di Stefano, L.: Unique signatures of histograms for local surface description. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6313, pp. 356–369. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15558-1_26
Chapter Google Scholar
Tombari, F., Salti, S., Di Stefano, L.: A combined texture-shape descriptor for enhanced 3D feature matching. In: 2011 18th IEEE International Conference on Image Processing, pp. 809–812. IEEE, September 2011
Google Scholar
Guo, Y., Sohel, F., Bennamoun, M., Lu, M., Wan, J.: Rotational projection statistics for 3D local surface description and object recognition. Int. J. Comput. Vis. 105(1), 63–86 (2013)
Article MathSciNet MATH Google Scholar
Calonder, M., Lepetit, V., Strecha, C., Fua, P.: BRIEF: binary robust independent elementary features. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 778–792. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15561-1_56
Chapter Google Scholar
Prakhya, S.M., Liu, B., Lin, W.: B-SHOT: a binary feature descriptor for fast and efficient keypoint matching on 3D point clouds. In: 2015 IEEE/RSJ International Conference Intelligent Robots and Systems (IROS), pp. 1929–1934. IEEE, September 2015
Google Scholar
Beksi, W.J., Papanikolopoulos, N.: Object classification using dictionary learning and RGB-D covariance descriptors. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 1880–1885. IEEE, May 2015
Google Scholar
Feng, G., Liu, Y., Liao, Y.: Loind: an illumination and scale invariant RGB-D descriptor. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 1893–1898. IEEE, May 2015
Google Scholar
do Nascimento, E.R., Oliveira, G.L., Vieira, A.W., Campos, M.F.: On the development of a robust, fast and lightweight keypoint descriptor. Neurocomputing 120, 141–155 (2013)
Article Google Scholar
Glocker, B., Izadi, S., Shotton, J., Criminisi, A.: Real-time RGB-D camera relocalization. In: 2013 IEEE International Symposium Mixed and Augmented Reality (ISMAR), pp. 173–179. IEEE, October 2013
Google Scholar
Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of RGB-D SLAM systems. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 573–580. IEEE, October 2012
Google Scholar
Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 971–987 (2002)
Article MATH Google Scholar
Guo, Y., Bennamoun, M., Sohel, F., Lu, M., Wan, J., Kwok, N.M.: A comprehensive performance evaluation of 3D local feature descriptors. Int. J. Comput. Vis. 116(1), 66–89 (2016)
Article MathSciNet Google Scholar
Agrawal, M., Konolige, K., Blas, M.R.: CenSurE: center surround extremas for realtime feature detection and matching. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5305, pp. 102–115. Springer, Heidelberg (2008). doi:10.1007/978-3-540-88693-8_8
Chapter Google Scholar

Download references

Acknowledgment

This work was supported by the National Natural Science Foundation of China (Nos. 61602499 and 61471371), the National Postdoctoral Program for Innovative Talents (No. BX201600172), and China Postdoctoral Science Foundation.

Author information

Authors and Affiliations

College of Information Science and Engineering, Hunan University, Changsha, 410082, China
Xiuzi Xiao & Songhua He
College of Electronic Science and Engineering, National University of Defense Technology, Changsha, 410073, China
Yulan Guo, Min Lu & Jun Zhang
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 10080, China
Yulan Guo

Authors

Xiuzi Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Songhua He
View author publications
You can also search for this author in PubMed Google Scholar
Yulan Guo
View author publications
You can also search for this author in PubMed Google Scholar
Min Lu
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yulan Guo .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Fuchun Sun
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Huaping Liu
College of Mechatronics and Automation, National University of Defense Technology, Changsha, China
Dewen Hu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xiao, X., He, S., Guo, Y., Lu, M., Zhang, J. (2017). BAG: A Binary Descriptor for RGB-D Images Combining Appearance and Geometric Cues. In: Sun, F., Liu, H., Hu, D. (eds) Cognitive Systems and Signal Processing. ICCSIP 2016. Communications in Computer and Information Science, vol 710. Springer, Singapore. https://doi.org/10.1007/978-981-10-5230-9_7

Download citation

DOI: https://doi.org/10.1007/978-981-10-5230-9_7
Published: 11 July 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-5229-3
Online ISBN: 978-981-10-5230-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics