1 Introduction

The binary descriptors (such as [1, 5, 10, 11, 18, 21, 22]) are extensively used in real-time applications for object detection and augmented reality. In particular, they are fast to compute, and the distance between two descriptors can be calculated in a few machine instructions. Given a query descriptor, the latter property greatly speeds up the search for the nearest neighbor within a set of reference descriptors.

Unfortunately, the binary descriptors are particularly sensitive to larger changes in viewpoint and scale [20]. This is mainly due to their discrete nature. For large binary descriptor sets, the nearest neighbor may not even be unique. However, it is relatively easy to find the closest K near neighbors by ranking the matches according to their descriptor distances. This list is more likely to contain the correct match compared to the set that includes only the nearest neighbor. In practice, this extra set of match hypotheses is rarely exploited.

In this paper, we propose a two-step approach to keypoint matching with binary descriptors. In the first step, for each query keypoint, we identify the list of the top K near neighbors according to the Hamming distance. In the second step, we rank these near neighbors according to a probabilistic and keypoint specific match quality score. This score exploits precomputed data extracted from the reference image during an off-line training phase. We show that a keypoint specific measure is more effective in ranking the match hypotheses than the Hamming descriptor distance as illustrated in Fig. 2.

To motivate our approach, we illustrate the existence of correct matches beyond the nearest neighbor. Figure 1 shows the number of correct matches between BRIEF descriptors that fall into the first ten near neighbors. Although most of the BRIEF matches obtained by the nearest neighbor matching are wrong, the correct ones are not very far away in the near neighbor list. The matches beyond the nearest neighbor are only reachable if we can supplement the Hamming distance with a secondary and more distinctive match score.

Fig. 1
figure 1

Near neighbor rank distribution of the correct matches between BRIEF descriptors of the first and the third images of the Graffiti data set. The nearest neighbor obtained by ranking matches according to Hamming distance captures only 159 of the 848 possible correspondences. Meanwhile, the first ten near neighbors include 517 correct matches, a potential improvement by a factor of more than three. In practice, as demonstrated by the results of Fig. 4, 385 of these can be recovered using the approach that we propose, which yields an actual improvement by a factor of more than two

Fig. 2
figure 2

Matches between the first and the fourth images of the Graffiti and Boat sequences. a Considering only the nearest neighbor with the Hamming distance. b Searching within the first ten near neighbors using the approach presented in Sect. 3

In designing this secondary match score, we make the following observations: Most binary descriptors, even the ones that optimize their representation, compute the same features for every keypoint. As shown recently by [3], while computationally appealing, globally optimizing the features for all keypoints is not as effective as picking unique features for each individual keypoint. There are a few existing approaches that learn and use keypoint specific representations [3, 8, 9, 19]. However, these require a separate distance computation to each reference keypoint and can not be directly used with approximate nearest neighbor (ANN) algorithms (such as Locality Sensitive Hashing [2], LSH).

In contrast to these, at the second matching stage, we have only K possible match hypotheses (corresponding to the K near neighbors). We employ a keypoint specific representation to rank these and recover correct matches that did not make it to the top initially. For each reference keypoint \(k_i\) in the K near neighbor list, our score computation relies on a precomputed statistical model of the variations of the descriptor bits for keypoint \(k_i\). Since the list of K near neighbors can be computed efficiently with ANN methods, our approach also scales well to larger descriptor sets.

Our main contributions can be summarized as follows:

  • We propose a method that is able to find keypoint matches within the list of K near neighbors at negligible additional computational cost at run-time.

  • We demonstrate that, despite learning a separate representation for each individual keypoint similar to  [3, 8, 9, 19], our approach does not require a brute-force search in a large descriptor set when coupled with the LSH [2] approach to compute the list of K near neighbors.

  • We show that our approach is relatively descriptor independent and it extends the matching range of several binary descriptors: BRIEF [5], ORB [21], BRISK [10], FREAK [1], and LATCH [11], which has recently been shown to outperform state-of-the-art binary descriptors.

2 Related work

As discussed in the introduction, a keypoint specific matching approach is shown to perform better than matching with a generic descriptor [3]. One way to achieve this is to compute a descriptor specifically adapted to a given patch. [8] proposed selecting a patch specific set of binary features that are robust to changes in intensity levels as well as small translation and rotations. More recently, [3] proposed to learn a combined set of generic features and a locally optimized binary mask that picks the best features depending on the image patch. While both approaches greatly improve the matching performance, they both require a brute-force search in the reference collection; therefore, they are not suitable for real-time operation on large data sets.

Similarly, the keypoint classification approaches of [9, 19] train keypoint specific classifiers and compute a probability distribution over all reference keypoints at run-time. Moreover, they require large amounts of memory per keypoint to store the learned probabilities.

Despite using a probabilistic model similar to that of [9, 19], our approach only makes use of these probabilities for the first ten match candidates. When matching a single keypoint against a database of a thousand reference keypoints, the required number of probability calculations is two orders of magnitude less than those required by Random Ferns.

The combination of a generic initial query followed by candidate specific filtering is quite common in the image retrieval literature [7, 12]. For image retrieval, the filtering in the second stage depends on the geometric consistency between the query image and the candidate results. Our approach has a very similar pipeline where the geometry check is replaced by a probabilistic observation probability.

3 A two-step approach to match keypoints

The usual approach for matching keypoints involves locating the nearest neighbor keypoint in descriptor space. Instead of this one-shot approach, we first rank the list of possible matches by their inverse descriptor distance and pick the first few tens of near neighbors to the query in the descriptor space as possible match candidates. We then evaluate each candidate based on detailed statistics of the texture around the specific candidate. Figure 3 gives an overview of the proposed approach.

Fig. 3
figure 3

a Given a query descriptor q, if the descriptor uses the same features for each patch then it is possible to use an ANN approach to limit distance computation only to a subset of the reference descriptors (dashed circle). b Some approaches learn and employ a patch specific representation that is more distinctive and robust. c We propose a two-step approach that combines the advantages of both by limiting the patch specific scoring to the list of K near neighbors

We learn the statistical model for each keypoint in an off-line training stage by simulating affine deformations and observing the change in the descriptor bits. In the following, we describe the way the descriptor statistics are collected during training and how to score the multiple match hypotheses based on this data.

3.1 Modeling descriptor variations

The binary descriptor bits are not fully invariant to changes in perspective and lighting. Which bits are more prone to variation depends on the texture around each keypoint. If there is a strong gradient near one of the pixels that are part of the computation of the bit’s value, then that bit is more likely to flip. Due to the complex nature of these interactions, bit variations are better captured by a probabilistic conditional model such as

$$\begin{aligned} P(D \; | \; C = k_i) = P(d_1, d_2, \ldots , d_S \; | \; C = k_i), \end{aligned}$$
(1)

where D and C are two random variables corresponding to the descriptor value and the keypoint identity. \(C = k_i\) means that the distribution is computed for keypoint i, \(d_j\) is a binary random variable representing the \(j{\text {th}}\) descriptor bit, and S is the descriptor size in bits.

This representation requires \(2^S\) parameters per keypoint and since S is usually larger than or equal to 64, directly modeling the joint probability of the descriptor bits is not feasible. Following the representation proposed by [19], we split the descriptor into N groups of M bits such that \(S=M\times N\) and assume independence between these:

$$\begin{aligned} P(D \; | \; C = k_i) = \prod _{j=1}^{N} P(D_j \; | \; C = k_i), \end{aligned}$$
(2)

where \(D_j\) represents the values of the bits in group j. This representation has \(N\times 2^M\) parameters. We use several settings with \(M=4,6,\;\text {and}\;8\). For 256 dimensional descriptors, these yield between \(N=32\) and 64. Larger M values yield a very large number of parameters and around \(M=12\), we experimentally found that the matching performance degrades. Moreover, a large M means greater memory consumption. In the next section, we provide experimental results for M between 4 and 8. N is determined by fixing M and taking \(N = \frac{\text {descriptor size}}{M}\).

The descriptor bits can be assigned to groups in many ways. The simplest possibility is to assign the consecutive descriptor bits to the same group. For descriptors like BRIEF, this is a natural choice as there is no reason to favor one grouping scheme over another. For descriptors like BRISK with more structure, an optimization over possible groupings might be beneficial. In practice, we found that the consecutive grouping works well for all descriptors and we use it in the rest of the experiments.

For each keypoint, we generate training samples using the affine deformation model proposed by [16]. This is not too restrictive since most keypoint detectors assume a locally affine model. In all the experiments, we use the same set of training parameters, scale changes between \(\frac{1}{\sqrt{2}}\) and \(\sqrt{2}\), in-plane rotation varies between \(-30^\circ \) and \(+30^\circ \), tilt amount (\(\theta \) in [16]) varies between \(0^\circ \) and \(60^\circ \), and tilt angle (\(\phi \) in [16]) varies between \(0^\circ \) and \(180^\circ \).

We generate roughly 200,000 images, yielding an equivalent number of training descriptors for each keypoint. For each bit group, the probabilities of Eq. 2 are inferred by counting the number of times training descriptors assume a particular value. For example, for bit groups of size 4, we have 16 possible outcomes. We count the number of times each outcome is observed within the training set and normalize by the total number of training samples. To counter the adverse effect of zeros in the estimated probabilities, we start counting from one as suggested by [19], which is equivalent to assuming \(2^M\) pseudo-samples for each keypoint taking on every possible descriptor value within a bit group. This is usually referred to as Laplace smoothing [13] and corresponds to assuming a Dirichlet prior when a Bayesian estimate is made for the probabilities of Eq. 2 [4].

3.2 Keypoint specific scoring of match hypotheses

For each keypoint, we first obtain the list of K near neighbors based on the Hamming distance between the descriptors. This list is then sorted according to a score specific to the particular candidate keypoint each near neighbor represents.

We have experimented with various functions to score each hypothesis by combining the descriptor statistics and the original Hamming distance in several different ways. The best results have been obtained when the score is the negative Hamming distance between the query and reference descriptor plus the logarithm of the probability of observing the query descriptor according to Eq. 2:

$$\begin{aligned} \text {score}(Q, k_i) = - \left| Q - D^i \right| _H + \log \prod _{j=1}^{N} P \left( Q_j \; | \; C = k_i \right) , \end{aligned}$$
(3)

where Q is the query descriptor, \(D^i\) is the reference descriptor i, and \(|X-Y|_H\) is the Hamming distance between X and Y.

Since the second term involves \(k_i\), its direct evaluation would require a number of operations linear in the reference data set size. Our main insight is that computing these for the first K near neighbors is sufficiently effective in recovering many more matches than using only the nearest neighbor.

To demonstrate this and the effect of K on the recognition performance, we perform a preliminary experiment on the Graffiti and Wall sequences [14, 15]. We detect approximately 1000 keypoints on the reference image and transform their coordinates to the rest of the images using the ground truth homography. We compute BRIEF descriptors around the reference and the test keypoints. For each test descriptor, we compute the K near neighbors using Hamming distance and then pick the best match according to Eq. 3. We measure and report the recognition rate as the percentage of test keypoints that are matched to the correct reference keypoints for various K values with bit group size M set to 8. The results are given in Fig. 4. Regardless of descriptor type, enlarging the near neighbor list yields improved recognition rates. The improvement is less pronounced after \(K=10\).

Fig. 4
figure 4

Recognition rates for BRIEF increase as we consider more near neighbors (NN). This is especially true for Graffiti, where the baseline performance (\(K=1\)) is low and the test images exhibit both scale and rotation changes. The difference between considering the first ten NN or all NNs is relatively low. At \(K=10\), the number of correct matches for Graffiti-3 increases to 385. This is still less than the maximum potential number 517 (See Fig. 1), but substantially better than 159, the nearest neighbor baseline

4 Experiments

To test the ability of our approach in recovering the correct matches from the near neighbor list, we have performed three sets of experiments. The first one follows the experimental setup of Fig. 4, and we report the recognition rate over ground truth correspondences. The second one measures the inlier ratio as a function of the number of keypoints matched in the test images. The final experiment measures the recognition rate when there are a larger number of reference images and an approximate nearest neighbor algorithm is used to compute the near neighbor list.

4.1 Recognition rate over ground truth correspondences

We repeat the experiments of Fig. 4 for four data sets and five descriptor types using \(K=10\) and three values for the bit group size (\(M \in \{ 4, 6, 8 \}\)). The resultsFootnote 1 are shown in Fig. 5. Our approach increases the recognition rate particularly when the data set includes larger changes in perspective. For the Bikes sequence with only changes in blur level, the improvement is less pronounced. This is expected since the observation probabilities represent behavior under perspective changes. We have tried including blur in the training data, but this did not yield noticeable improvement, possibly because blur causes loss of discriminative power.

Fig. 5
figure 5

Recognition rate improves when we exploit the information in the first ten near neighbors as opposed to using just the nearest neighbor. Note that the improvement is most significant when the recognition rate is low (3–6) and the number of matches may not be enough for correct registration of the test image. The performance improves as M is increased from 4 to 8, however increasing M beyond 10 degrades performance due to the exponentially increasing number of parameters in the probabilistic model that requires even more training data. Hence we limit our experiments to \(M=4\) and \(M=8\)

BRIEF lacks orientation estimation, as a result after the test image 2 of the Boat sequence even the top ten near neighbor list does not include enough correct matches and our approach does not improve recognition rate. For descriptors with orientation estimation such as ORB, the results are improved by a large amount for the test image 3. The most significant improvement is achieved for the Graffiti data set which contains scale and perspective changes.

4.2 Improving the inlier ratio characteristics

The previous experiments measure the classification performance of the keypoint matching over ground truth correspondences. As a more realistic test, we detect keypoints in the test images and we match each one to the reference keypoints to yield a list of potential correspondences. We rank these by their negative descriptor distance and measure the ratio of the correct matches to the total number of matches. This ratio is equal to the inlier ratio during the iterations of a robust estimator such as PROSAC [6], and it is a measure of the precision of the keypoint matching approach. The inlier ratio values obtained as such directly influences the required minimum number of PROSAC iterations to correctly calculate the pose of an object by keypoint matching.

To show the benefits of keypoint specific scoring, we perform three sets of measurements. First, as a baseline, we rank the nearest neighbor matches using the Hamming distance. Second, we rank only the nearest neighbor matches according to the scores of Eq. 3. Finally, we both pick the best match in the K near neighbor (KNN) list and rank these correspondences according to Eq. 3.

Table 1 lists the results obtained. The re-weighted nearest neighbor (RNN) values shows the improvement brought only by using the scores of Eq. 3. This leads to higher initial inlier ratios even though the final ratio at 500 matches stays nearly the same. The KNN curves show the additional improvement brought by searching beyond the nearest neighbor. In almost all cases, considering the KNNs leads to a higher inlier ratio over the best 500 keypoint matches.

Table 1 Inlier ratio values when matching the third test image to the reference image in each image

For Graffiti, the weaker BRIEF performance compared to BRISK and FREAK is more than offset by exploiting the K near neighbors. For the other data sets, the BRIEF performance is either too low (Boat) or too high (Bikes and Wall) to lead to a real difference in performance.

For Boat, the baseline inlier ratios among the best 250 BRISK and FREAK correspondences are 44 and 45 %, respectively. The ten near neighbor inlier rates increase to 86 and 83 %, nearly doubling in each case. Such rates mean almost immediate convergence for PROSAC, requiring only 5 iterations to guarantee sampling of four inliers with 95 % probability, more than 20 times faster than the baseline.

4.3 Keypoint matching with locality sensitive hashing

For some vision tasks, the reference image collection is larger than a single image. In this case, the number of reference descriptors also increases, and therefore, it is impractical to match keypoints with brute-force descriptor search. To evaluate the performance of our approach in such a case, we concatenate the descriptors from all reference images from Graffiti, Boat, Bikes, Wall, plus the four other images. The total number of reference descriptors is around 8000. As in Sect. 4.2, we compute the inlier ratio curves, but this time we compute the list of ten near neighbors with LSH. We use the FLANN [17] implementation of LSH available in OpenCV with 12 tables, a key length of 20, and a multi-probe level of 2. This yields an LSH precision of 90 % (The nearest neighbor found by LSH is the same as found by the brute-force search nine out of ten times).

As Table 2 shows, the resulting inlier ratios are lower than those in Table 1 since the number of reference descriptors is eight times greater. However, our approach recovers matches that would have been lost if only the nearest neighbor had been used. The improvement (usually a factor of two) is more dramatic for Boat and Graffiti.

Table 2 With eight reference images, we measure the inlier ratio values when matching the third test image of each image sequence and LSH is used to compute the near neighbor list

4.4 Computation time

For 1000 reference and 928 query keypoints, using a brute-force approach—that is when the descriptor distance to each reference keypoint is calculated—the total time for keypoint matching using only the nearest neighbor and Hamming distance is 4.0 milliseconds (using the POPCNT instruction on a 64-bit laptop CPU). At \(K=10\), our approach takes 4.2 milliseconds irrespective of the value of M. This means that the overhead to compute the scores of Eq. 3 for the top ten near neighbors is around 5 % for 1000 reference keypoints. The absolute overhead is negligible even for real-time applications, and at the worst case, it scales linearly with K; thus, it will be halved at \(K=5\) and doubled at \(K=20\).

5 Conclusion

We propose an approach that can be used in conjunction with the binary descriptors to search for keypoint matches beyond the nearest neighbor. Our approach strikes a balance between the keypoint specific representations and the generic descriptors. By using a two-step approach, we combine the advantages of both and improve the robustness of binary descriptors for real-time object detection applications.