Keywords

1 Introduction

Feature matching is a fundamental issue for computer vision tasks such as object recognition [1], object detection, 3D reconstruction and image retrieval. The performance of feature matching highly depends on descriptors. Feature descriptors are generally required to be invariant to variations in image scale, rotation and illumination. Therefore, designing descriptors with good performance is very important.

In 2D domain, Lowe et al. [2] proposed the Scale Invariant Feature Transform (SIFT) method using scale space and gradient orientation histograms. Bay et al. [3] proposed the Speeded-Up Robust Features (SURF) method to speed up the keypoint detection step of SIFT. In 3D domain, Tombari et al. [4] proposed a Signature of Histograms of Orientations (SHOT) descriptor using geometric histograms. Later, Tombari et al. [5] proposed a Color-SHOT (CSHOT) descriptor by combining shape and color information. Guo et al. [6] proposed Rotational Projection Statistics (RoPS) based on observations in cognition and multi-view information representation. However, these float descriptors are usually high in dimensionality and matching time and require a large amount of memory for storage. This issue becomes even challenging when the number of extracted keypoints is large. Consequently, current float feature descriptors are unsuitable for time crucial applications (e.g., mobile computing).

To achieve low memory cost and fast feature generation and matching, binary descriptors have been proposed. In 2D domain, Calonder et al. [7] simply compared the gray values of a specified point pair to generate the Binary Robust Independent Elementary Features (BRIEF) descriptor. In 3D domain, Prakhya et al. [8] used a certain binary mapping rules to propose the Binary SHOT (B-SHOT) descriptor. The aforementioned algorithms generate feature descriptors using RGB information or point cloud information only.

In recent years, with the availability of RGB-D images, few RGB-D image descriptors have been developed. Beksi et al. [9] integrated different features of RGB image and depth image to obtain a covariance descriptor. Feng et al. [10] proposed a Local Ordinal Intensity and Normal Descriptor (LOIND) by constructing a histogram. These descriptors are represented by float vectors, so they require higher memory consumption and more matching time than binary descriptors. Recently, Nascimento et al. [11] proposed a Binary Robust Appearance and Normal Descriptor (BRAND) by detecting variation in the normal or in the intensity on each point pair around a keypoint neighborhood.

In this paper, we propose a new efficient binary descriptor namely BAG for RGB-D images. Experimental results on the datasets 7-Scenes [12] and TUM RGB-D benchmark [13] show that the proposed BAG binary descriptor achieves better performance than other descriptors in terms of recall/1-precision, matching time and memory consumption.

2 BAG Binary Descriptor

2.1 Local Binary Pattern

Given a keypoint \( k \) and a circle patch \( p \) with radius \( R \) around a keypoint \( k \), a specific sampling method is designed to obtain \( N \) point pairs from patch \( p \), then local binary pattern [14] can be represented as follows:

$$ LBP_{R,N} (p) = \sum\limits_{i = 0}^{N - 1} {\tau (p:n_{i} ,m_{i} )} 2^{i} $$
(1)
$$ \tau (p:n_{i} ,m_{i} ) = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {p(n_{i} ) < p(m_{i} )} \hfill \\ {0,} \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right. $$
(2)

where \( n_{i} ,m_{i} \) stands for a point pair from patch \( p \), \( p(n_{i} ),p(m_{i} ) \) is the gray value at point \( n_{i} \) and \( m_{i} \), respectively.

It can be seen that local binary pattern is robust to illumination variations and fast to compute. Since RGB-D image can provide both color information and depth information, we use 3 bits rather than 1 bit for each point pair to improve the discriminativeness power of feature descriptor.

2.2 Scale and Rotation Invariance

To improve the discriminativeness and robustness of a feature descriptor in 2D image, scale invariance should be considered. Generally, it can be achieved by constructing a scale space and then searching for extreme values in the scale space. According to the principle of image formation, the scale of a point in 2D image is approximately inversely proportional to its corresponding depth. So, we can use depth information to estimate the scale \( s \) for each keypoint in RGB-D image [11], that is:

$$ s = \hbox{max} (0.2,\frac{3.8 - 0.4\hbox{max} (2,d)}{3}) $$
(3)

where \( d \) is the depth value of the keypoint. Since the depth sensor has a limited working range, part of the depth values measured by depth sensor are inaccurate. Therefore, we filter out the depth value smaller than 2 m.

To achieve rotation invariance, we use the same canonical orientation estimation method as SURF [3].

Then, the patch \( p \) with radius \( R \) around keypoint \( k \) is scaled and rotated:

$$ p = \{ (T_{\theta ,s} (x_{i} ),T_{\theta ,s} (y_{i} ))|(x_{i} ,y_{i} )\,{ \in }\,A\} $$
(4)
$$ T_{\theta ,s} = \left( {\begin{array}{*{20}c} {\cos \theta } & { - \sin \theta } \\ {\sin \theta } & {\cos \theta } \\ \end{array} } \right)\left( {\begin{array}{*{20}c} s & 0 \\ 0 & s \\ \end{array} } \right) $$
(5)

where \( A \) is the \( N \) point pairs sampled from patch \( p \), \( (x_{i} ,y_{i} ) \) is a point pair selected from \( A \). After doing this, the proposed descriptor is invariant to scale and rotation. Uniform distribution sampling is used in this paper, an illustration is shown in Fig. 1.

Fig. 1.
figure 1

Uniform distribution

2.3 Fusion of Color and Depth Information

Once \( N \) point pairs have been determined for patch \( p \), the BAG descriptor for each point pair is generated using the following three steps.

  1. (1)

    The average gray value \( ave \) of patch \( p \) in RGB image is first calculate, the gray value of each point in the point pair is then compared with the average gray value to produce a 1-bit representation \( \tau_{a} \). That is:

    $$ \tau_{a} (x_{i} ,y_{i} ) = \left\{ {\begin{array}{*{20}c} {1,} & {{\text{if }}(p(x_{i} ) > \;ave\& \& p(y_{i} ) > \;ave)||(p(x_{i} ){<} \;ave\& \& p(y_{i} ){ < } ave)} \\ {0,} & {\text{others}} \\ \end{array} } \right. $$
    (6)

where \( p(.) \) is the gray value.

  1. (2)

    The gray values of points in the point pair are compared directly, resulting in a 1-bit representation \( \tau_{b} \).

    $$ \tau_{b} (x_{i} ,y_{i} ) = \left\{ {\begin{array}{*{20}c} {1,} & {if\;p(x_{i} ) > p(y_{i} )} \\ {0,} & {\text{others}} \\ \end{array} } \right. $$
    (7)
  1. (3)

    The depth image is transformed to a point cloud, geometry information of the point cloud is then extracted and compared within each point pair, resulting in a 1-bit representation. The related geometry information is shown in Fig. 2.

    Fig. 2.
    figure 2

    Geometry information

Specifically, the following two features are constructed using surface geometry. The first one is the dot product between surface normals. The second one is the convexity of a surface. The convexity of a surface is defined as follows:

$$ k(x_{i} ,y_{i} ) = < p_{s} (x_{i} ) - p_{s} (y_{i} ),p_{n} (x_{i} ) - p_{n} (y_{i} ) {>} $$
(8)

Where \( {<},{>} \) represents the dot product of two vectors, \( p_{s} (x_{i} ),p_{s} (y_{i} ) \) and \( p_{n} (x_{i} ),p_{n} (y_{i} ) \) are the 3D positions and the normals of points \( x_{i} \) and \( y_{i} \), respectively.

Finally, the 1-bit representation \( \tau_{c} \) for geometry information [11] can be expressed as:

$$ \tau_{c} (x_{i} ,y_{i} ) = ( < p_{n} (x_{i} ),p_{n} (y_{i} )> < \cos (\rho ))^{\wedge} (k(x_{i} ,y_{i} ) < 0) $$
(9)

where \( \rho \) is the normal angle threshold.

Therefore, we have 3 bits for the representation of each point pair.

The BAG feature descriptor is then expressed as:

$$ b(k) = \sum\limits_{i = 0}^{N - 1} {(2^{3i} \tau_{a} (x_{i} ,y_{i} ) + 2^{3i + 1} \tau_{b} (x_{i} ,y_{i} ) + 2^{3i + 2} \tau_{c} (x_{i} ,y_{i} ))} $$
(10)

2.4 Gaussian Filter

To improve the robustness of BAG descriptor with respect to noise, the patch \( p \) is smoothed with a Gaussian kernel before the generation of binary features (as described in Sect. 2.3). The Gaussian kernel has a deviation \( \sigma \) of 2 and a window of \( 9 \times 9 \) pixels. The 2D Gaussian kernel function with deviation \( \sigma \) is defined as:

$$ g_{\sigma } (x,y) = \frac{1}{{2\pi \sigma^{2} }}\exp ( - \frac{{x^{2} + y^{2} }}{{2\sigma^{2} }}) $$
(11)

The Gaussian filter value \( F(q) \) at point \( q \) is defined as:

$$ F(q) = \frac{{\sum\limits_{{(l,k) \in\Omega }} {g_{\sigma } } (l,k)I_{(l,k)} }}{{\sum\limits_{{(l,k) \in\Omega }} {g_{\sigma } } (l,k)}} $$
(12)

where \( (l,k) \) is the relative position to \( q \), \( I_{(l,k)} \) is the gray value at position \( (l,k) \), \( \Omega \) is the filter window.

3 Experiments

3.1 Experimental Setup

In this paper, experiments were conducted on public datasets 7-Scenes and TUM RGB-D benchmark. The integral image method [10, 11] was used to calculate the normals of a depth image. The Nearest Neighbor Distance Ratio (NNDR) criterion was used for descriptor matching and the widely used recall/1-precision curve [15] was used to measure the performance of a feature descriptor. To ensure fair comparison, we used the same keypoint detector method (i.e., STAR) [16] to generate keypoints in all experiments. We then used different keypoint description methods to generate feature matching performance.

To find the optimal parameters for BAG descriptors, we have tested our BAG feature descriptor with different sizes and different thresholds for normal angle. Finally, we set the size for BAG to be 48 bytes and the normal angle to be 45°.

3.2 Comparative Experiments

In order to test the performance of the proposed BAG descriptor, we compared BAG with several existing methods, including SIFT [2], SURF [3], SHOT [4] and CSHOT [5] in terms of recall/1-precision, generation time, matching time and memory consumption. Note that, the NNDR matching criterion was used in all experiments. For feature matching, the Euclidean distance was used for all float descriptors, and the Hamming distance was used for our proposed BAG binary descriptor.

The recall/1-precision curves achieved on different sequences of the 7-Scenes dataset are shown in Fig. 3. The results show that the proposed BAG consistently outperformed SURF. Compared to SIFT, results on the first 3 sequences show that the proposed BAG significantly outperformed SIFT when the recall is low, the proposed BAG is then inferior to SIFT slightly. For the remaining 3 sequences, the proposed BAG achieved almost the same performance when recall is low, the proposed BAG is then slightly inferior to SIFT. That is because, the proposed BAG contains both appearance (RGB) information and geometric (depth) information, its performance is comparable to or better than the SIFT descriptor. Meanwhile, since the proposed BAG is represented by binary strings, its descriptiveness for a local surface is still slightly lower than the float descriptor. Therefore, the number of false positive matches for BAG matching increases when recall is high. Compared to SHOT and CSHOT, it is clear that the proposed BAG outperforms SHOT and CSHOT on all scenes when recall is high. Note that, CSHOT outperforms SHOT all the time.

Fig. 3.
figure 3

Recall/1-precision curves achieved on the 7-Scenes dataset

The recall/1-precison curves achieved on the publically available TUM RGB-D benchmark are shown in Fig. 4. The results show that the comparative performance of the proposed BAG is consistent with the results achieved on the 7-Scenes dataset, except that CSHOT achieves almost the same performance as SHOT on the TUM RGB-D benchmark. The observation indicates that CSHOT does not benefit from the color information of images on the TUM RGB-D benchmark.

Fig. 4.
figure 4

Recall/1-precision curves achieved on the TUM RGB-D benchmark

The feature generation and matching time for different feature descriptors is shown in Table 1. From Table 1, we can see that the descending ranking for descriptors in terms of generation time is CSHOT, SHOT, SIFT, BAG and SURF. Particularly, the generation time of the proposed BAG is smaller than all other algorithms except SURF. That is because, the proposed BAG has to correct the scale and rotation of each point pair to achieve scale invariance and rotation invariance. With the scale and rotation invariance, the proposed BAG consistently outperforms SURF in term of recall/1-precision curves. The descending rank for descriptors in terms of matching time is CSHOT, SHOT, SIFT, SURF and BAG. The reason is that the BAG uses Hamming distance while the other descriptors use Euclidean distance for feature matching. With Euclidean distance, the matching time increases with the feature dimensionality. Note that, the descriptor dimensionalities for CSHOT, SHOT, SIFT and SURF are 1344, 352, 128, 64, respectively.

Table 1. Feature generation and matching time for different descriptors

The memory consumption for different descriptors is shown in Table 2. From Table 2, we can see the descending ranking is CSHOT, SHOT, SIFT, SURF and BAG. BAG is significantly smaller in memory cost than others. That is because, the proposed BAG is represented by binary bits while the other descriptors are represented by float values. Since a float vector contains 32 bits, a binary descriptor is more compact than a float descriptor with the same feature dimensionality.

Table 2. Memory consumption comparison among different descriptors

4 Conclusion

The paper has proposed a new efficient binary descriptor for RGB-D images. The proposed descriptor integrates both RGB and depth information and it is highly discriminative. Experiments have been conducted to compare our BAG descriptors with SIFT, SURF, SHOT and CSHOT descriptors. Results show that the proposed BAG produces better keypoint matching performance, it also achieves the fastest matching speed and the smallest memory consumption.