Instance detection by keypoint matching beyond the nearest neighbor

Uzyıldırım, Furkan Eren; Özuysal, Mustafa

doi:10.1007/s11760-016-0966-6

Instance detection by keypoint matching beyond the nearest neighbor

Original Paper
Published: 20 August 2016

Volume 10, pages 1527–1534, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Signal, Image and Video Processing Aims and scope Submit manuscript

Instance detection by keypoint matching beyond the nearest neighbor

Download PDF

Furkan Eren Uzyıldırım¹ &
Mustafa Özuysal¹

573 Accesses
5 Citations
Explore all metrics

Abstract

The binary descriptors are the representation of choice for real-time keypoint matching. However, they suffer from reduced matching rates due to their discrete nature. We propose an approach that can augment their performance by searching in the top K near neighbor matches instead of just the single nearest neighbor one. To pick the correct match out of the K near neighbors, we exploit statistics of descriptor variations collected for each keypoint in an off-line training phase. This is a similar approach to those that learn a patch specific keypoint representation. Unlike these approaches, we only use a keypoint specific score to rank the list of K near neighbors. Since this list can be efficiently computed with approximate nearest neighbor algorithms, our approach scales well to large descriptor sets.

Efficient and Scalable 4th-Order Match Propagation

Training Binary Descriptors for Improved Robustness and Efficiency in Real-Time Matching

D2D: Keypoint Extraction with Describe to Detect Approach

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The binary descriptors (such as [1, 5, 10, 11, 18, 21, 22]) are extensively used in real-time applications for object detection and augmented reality. In particular, they are fast to compute, and the distance between two descriptors can be calculated in a few machine instructions. Given a query descriptor, the latter property greatly speeds up the search for the nearest neighbor within a set of reference descriptors.

Unfortunately, the binary descriptors are particularly sensitive to larger changes in viewpoint and scale [20]. This is mainly due to their discrete nature. For large binary descriptor sets, the nearest neighbor may not even be unique. However, it is relatively easy to find the closest K near neighbors by ranking the matches according to their descriptor distances. This list is more likely to contain the correct match compared to the set that includes only the nearest neighbor. In practice, this extra set of match hypotheses is rarely exploited.

In this paper, we propose a two-step approach to keypoint matching with binary descriptors. In the first step, for each query keypoint, we identify the list of the top K near neighbors according to the Hamming distance. In the second step, we rank these near neighbors according to a probabilistic and keypoint specific match quality score. This score exploits precomputed data extracted from the reference image during an off-line training phase. We show that a keypoint specific measure is more effective in ranking the match hypotheses than the Hamming descriptor distance as illustrated in Fig. 2.

To motivate our approach, we illustrate the existence of correct matches beyond the nearest neighbor. Figure 1 shows the number of correct matches between BRIEF descriptors that fall into the first ten near neighbors. Although most of the BRIEF matches obtained by the nearest neighbor matching are wrong, the correct ones are not very far away in the near neighbor list. The matches beyond the nearest neighbor are only reachable if we can supplement the Hamming distance with a secondary and more distinctive match score.

In designing this secondary match score, we make the following observations: Most binary descriptors, even the ones that optimize their representation, compute the same features for every keypoint. As shown recently by [3], while computationally appealing, globally optimizing the features for all keypoints is not as effective as picking unique features for each individual keypoint. There are a few existing approaches that learn and use keypoint specific representations [3, 8, 9, 19]. However, these require a separate distance computation to each reference keypoint and can not be directly used with approximate nearest neighbor (ANN) algorithms (such as Locality Sensitive Hashing [2], LSH).

In contrast to these, at the second matching stage, we have only K possible match hypotheses (corresponding to the K near neighbors). We employ a keypoint specific representation to rank these and recover correct matches that did not make it to the top initially. For each reference keypoint $k_i$ in the K near neighbor list, our score computation relies on a precomputed statistical model of the variations of the descriptor bits for keypoint $k_i$. Since the list of K near neighbors can be computed efficiently with ANN methods, our approach also scales well to larger descriptor sets.

Our main contributions can be summarized as follows:

We propose a method that is able to find keypoint matches within the list of K near neighbors at negligible additional computational cost at run-time.
We demonstrate that, despite learning a separate representation for each individual keypoint similar to [3, 8, 9, 19], our approach does not require a brute-force search in a large descriptor set when coupled with the LSH [2] approach to compute the list of K near neighbors.
We show that our approach is relatively descriptor independent and it extends the matching range of several binary descriptors: BRIEF [5], ORB [21], BRISK [10], FREAK [1], and LATCH [11], which has recently been shown to outperform state-of-the-art binary descriptors.

2 Related work

As discussed in the introduction, a keypoint specific matching approach is shown to perform better than matching with a generic descriptor [3]. One way to achieve this is to compute a descriptor specifically adapted to a given patch. [8] proposed selecting a patch specific set of binary features that are robust to changes in intensity levels as well as small translation and rotations. More recently, [3] proposed to learn a combined set of generic features and a locally optimized binary mask that picks the best features depending on the image patch. While both approaches greatly improve the matching performance, they both require a brute-force search in the reference collection; therefore, they are not suitable for real-time operation on large data sets.

Similarly, the keypoint classification approaches of [9, 19] train keypoint specific classifiers and compute a probability distribution over all reference keypoints at run-time. Moreover, they require large amounts of memory per keypoint to store the learned probabilities.

Despite using a probabilistic model similar to that of [9, 19], our approach only makes use of these probabilities for the first ten match candidates. When matching a single keypoint against a database of a thousand reference keypoints, the required number of probability calculations is two orders of magnitude less than those required by Random Ferns.

The combination of a generic initial query followed by candidate specific filtering is quite common in the image retrieval literature [7, 12]. For image retrieval, the filtering in the second stage depends on the geometric consistency between the query image and the candidate results. Our approach has a very similar pipeline where the geometry check is replaced by a probabilistic observation probability.

3 A two-step approach to match keypoints

The usual approach for matching keypoints involves locating the nearest neighbor keypoint in descriptor space. Instead of this one-shot approach, we first rank the list of possible matches by their inverse descriptor distance and pick the first few tens of near neighbors to the query in the descriptor space as possible match candidates. We then evaluate each candidate based on detailed statistics of the texture around the specific candidate. Figure 3 gives an overview of the proposed approach.

We learn the statistical model for each keypoint in an off-line training stage by simulating affine deformations and observing the change in the descriptor bits. In the following, we describe the way the descriptor statistics are collected during training and how to score the multiple match hypotheses based on this data.

3.1 Modeling descriptor variations

The binary descriptor bits are not fully invariant to changes in perspective and lighting. Which bits are more prone to variation depends on the texture around each keypoint. If there is a strong gradient near one of the pixels that are part of the computation of the bit’s value, then that bit is more likely to flip. Due to the complex nature of these interactions, bit variations are better captured by a probabilistic conditional model such as

$$\begin{aligned} P(D \; | \; C = k_i) = P(d_1, d_2, \ldots , d_S \; | \; C = k_i), \end{aligned}$$

(1)

where D and C are two random variables corresponding to the descriptor value and the keypoint identity. $C = k_i$ means that the distribution is computed for keypoint i, $d_j$ is a binary random variable representing the $j{\text {th}}$ descriptor bit, and S is the descriptor size in bits.

This representation requires $2^S$ parameters per keypoint and since S is usually larger than or equal to 64, directly modeling the joint probability of the descriptor bits is not feasible. Following the representation proposed by [19], we split the descriptor into N groups of M bits such that $S=M\times N$ and assume independence between these:

$$\begin{aligned} P(D \; | \; C = k_i) = \prod _{j=1}^{N} P(D_j \; | \; C = k_i), \end{aligned}$$

(2)

where $D_j$ represents the values of the bits in group j. This representation has $N\times 2^M$ parameters. We use several settings with $M=4,6,\;\text {and}\;8$. For 256 dimensional descriptors, these yield between $N=32$ and 64. Larger M values yield a very large number of parameters and around $M=12$, we experimentally found that the matching performance degrades. Moreover, a large M means greater memory consumption. In the next section, we provide experimental results for M between 4 and 8. N is determined by fixing M and taking $N = \frac{\text {descriptor size}}{M}$.

The descriptor bits can be assigned to groups in many ways. The simplest possibility is to assign the consecutive descriptor bits to the same group. For descriptors like BRIEF, this is a natural choice as there is no reason to favor one grouping scheme over another. For descriptors like BRISK with more structure, an optimization over possible groupings might be beneficial. In practice, we found that the consecutive grouping works well for all descriptors and we use it in the rest of the experiments.

For each keypoint, we generate training samples using the affine deformation model proposed by [16]. This is not too restrictive since most keypoint detectors assume a locally affine model. In all the experiments, we use the same set of training parameters, scale changes between $\frac{1}{\sqrt{2}}$ and $\sqrt{2}$, in-plane rotation varies between $-30^\circ $ and $+30^\circ $, tilt amount ($\theta $ in [16]) varies between $0^\circ $ and $60^\circ $, and tilt angle ($\phi $ in [16]) varies between $0^\circ $ and $180^\circ $.

We generate roughly 200,000 images, yielding an equivalent number of training descriptors for each keypoint. For each bit group, the probabilities of Eq. 2 are inferred by counting the number of times training descriptors assume a particular value. For example, for bit groups of size 4, we have 16 possible outcomes. We count the number of times each outcome is observed within the training set and normalize by the total number of training samples. To counter the adverse effect of zeros in the estimated probabilities, we start counting from one as suggested by [19], which is equivalent to assuming $2^M$ pseudo-samples for each keypoint taking on every possible descriptor value within a bit group. This is usually referred to as Laplace smoothing [13] and corresponds to assuming a Dirichlet prior when a Bayesian estimate is made for the probabilities of Eq. 2 [4].

3.2 Keypoint specific scoring of match hypotheses

For each keypoint, we first obtain the list of K near neighbors based on the Hamming distance between the descriptors. This list is then sorted according to a score specific to the particular candidate keypoint each near neighbor represents.

We have experimented with various functions to score each hypothesis by combining the descriptor statistics and the original Hamming distance in several different ways. The best results have been obtained when the score is the negative Hamming distance between the query and reference descriptor plus the logarithm of the probability of observing the query descriptor according to Eq. 2:

$$\begin{aligned} \text {score}(Q, k_i) = - \left| Q - D^i \right| _H + \log \prod _{j=1}^{N} P \left( Q_j \; | \; C = k_i \right) , \end{aligned}$$

(3)

where Q is the query descriptor, $D^i$ is the reference descriptor i, and $|X-Y|_H$ is the Hamming distance between X and Y.

Since the second term involves $k_i$, its direct evaluation would require a number of operations linear in the reference data set size. Our main insight is that computing these for the first K near neighbors is sufficiently effective in recovering many more matches than using only the nearest neighbor.

To demonstrate this and the effect of K on the recognition performance, we perform a preliminary experiment on the Graffiti and Wall sequences [14, 15]. We detect approximately 1000 keypoints on the reference image and transform their coordinates to the rest of the images using the ground truth homography. We compute BRIEF descriptors around the reference and the test keypoints. For each test descriptor, we compute the K near neighbors using Hamming distance and then pick the best match according to Eq. 3. We measure and report the recognition rate as the percentage of test keypoints that are matched to the correct reference keypoints for various K values with bit group size M set to 8. The results are given in Fig. 4. Regardless of descriptor type, enlarging the near neighbor list yields improved recognition rates. The improvement is less pronounced after $K=10$.

4 Experiments

To test the ability of our approach in recovering the correct matches from the near neighbor list, we have performed three sets of experiments. The first one follows the experimental setup of Fig. 4, and we report the recognition rate over ground truth correspondences. The second one measures the inlier ratio as a function of the number of keypoints matched in the test images. The final experiment measures the recognition rate when there are a larger number of reference images and an approximate nearest neighbor algorithm is used to compute the near neighbor list.

4.1 Recognition rate over ground truth correspondences

We repeat the experiments of Fig. 4 for four data sets and five descriptor types using $K=10$ and three values for the bit group size ($M \in \{ 4, 6, 8 \}$). The results^{Footnote 1} are shown in Fig. 5. Our approach increases the recognition rate particularly when the data set includes larger changes in perspective. For the Bikes sequence with only changes in blur level, the improvement is less pronounced. This is expected since the observation probabilities represent behavior under perspective changes. We have tried including blur in the training data, but this did not yield noticeable improvement, possibly because blur causes loss of discriminative power.

BRIEF lacks orientation estimation, as a result after the test image 2 of the Boat sequence even the top ten near neighbor list does not include enough correct matches and our approach does not improve recognition rate. For descriptors with orientation estimation such as ORB, the results are improved by a large amount for the test image 3. The most significant improvement is achieved for the Graffiti data set which contains scale and perspective changes.

4.2 Improving the inlier ratio characteristics

The previous experiments measure the classification performance of the keypoint matching over ground truth correspondences. As a more realistic test, we detect keypoints in the test images and we match each one to the reference keypoints to yield a list of potential correspondences. We rank these by their negative descriptor distance and measure the ratio of the correct matches to the total number of matches. This ratio is equal to the inlier ratio during the iterations of a robust estimator such as PROSAC [6], and it is a measure of the precision of the keypoint matching approach. The inlier ratio values obtained as such directly influences the required minimum number of PROSAC iterations to correctly calculate the pose of an object by keypoint matching.

To show the benefits of keypoint specific scoring, we perform three sets of measurements. First, as a baseline, we rank the nearest neighbor matches using the Hamming distance. Second, we rank only the nearest neighbor matches according to the scores of Eq. 3. Finally, we both pick the best match in the K near neighbor (KNN) list and rank these correspondences according to Eq. 3.

Table 1 lists the results obtained. The re-weighted nearest neighbor (RNN) values shows the improvement brought only by using the scores of Eq. 3. This leads to higher initial inlier ratios even though the final ratio at 500 matches stays nearly the same. The KNN curves show the additional improvement brought by searching beyond the nearest neighbor. In almost all cases, considering the KNNs leads to a higher inlier ratio over the best 500 keypoint matches.

Table 1 Inlier ratio values when matching the third test image to the reference image in each image

Full size table

For Graffiti, the weaker BRIEF performance compared to BRISK and FREAK is more than offset by exploiting the K near neighbors. For the other data sets, the BRIEF performance is either too low (Boat) or too high (Bikes and Wall) to lead to a real difference in performance.

For Boat, the baseline inlier ratios among the best 250 BRISK and FREAK correspondences are 44 and 45 %, respectively. The ten near neighbor inlier rates increase to 86 and 83 %, nearly doubling in each case. Such rates mean almost immediate convergence for PROSAC, requiring only 5 iterations to guarantee sampling of four inliers with 95 % probability, more than 20 times faster than the baseline.

4.3 Keypoint matching with locality sensitive hashing

For some vision tasks, the reference image collection is larger than a single image. In this case, the number of reference descriptors also increases, and therefore, it is impractical to match keypoints with brute-force descriptor search. To evaluate the performance of our approach in such a case, we concatenate the descriptors from all reference images from Graffiti, Boat, Bikes, Wall, plus the four other images. The total number of reference descriptors is around 8000. As in Sect. 4.2, we compute the inlier ratio curves, but this time we compute the list of ten near neighbors with LSH. We use the FLANN [17] implementation of LSH available in OpenCV with 12 tables, a key length of 20, and a multi-probe level of 2. This yields an LSH precision of 90 % (The nearest neighbor found by LSH is the same as found by the brute-force search nine out of ten times).

As Table 2 shows, the resulting inlier ratios are lower than those in Table 1 since the number of reference descriptors is eight times greater. However, our approach recovers matches that would have been lost if only the nearest neighbor had been used. The improvement (usually a factor of two) is more dramatic for Boat and Graffiti.

Table 2 With eight reference images, we measure the inlier ratio values when matching the third test image of each image sequence and LSH is used to compute the near neighbor list

Full size table

4.4 Computation time

For 1000 reference and 928 query keypoints, using a brute-force approach—that is when the descriptor distance to each reference keypoint is calculated—the total time for keypoint matching using only the nearest neighbor and Hamming distance is 4.0 milliseconds (using the POPCNT instruction on a 64-bit laptop CPU). At $K=10$, our approach takes 4.2 milliseconds irrespective of the value of M. This means that the overhead to compute the scores of Eq. 3 for the top ten near neighbors is around 5 % for 1000 reference keypoints. The absolute overhead is negligible even for real-time applications, and at the worst case, it scales linearly with K; thus, it will be halved at $K=5$ and doubled at $K=20$.

5 Conclusion

We propose an approach that can be used in conjunction with the binary descriptors to search for keypoint matches beyond the nearest neighbor. Our approach strikes a balance between the keypoint specific representations and the generic descriptors. By using a two-step approach, we combine the advantages of both and improve the robustness of binary descriptors for real-time object detection applications.

Notes

Note that for each descriptor type, we use different keypoint detector and descriptor settings (either the OpenCV defaults or the setup used by the authors). So, the figures in this paper should not be used for performance comparison between different descriptors.

References

Alahi, A., Ortiz, R., Vandergheynst, P.: Freak: Fast retina keypoint. In: Conference on Computer Vision and Pattern Recognition, pp. 510–517 (2012)
Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)
Article Google Scholar
Balntas, V., Tang, L., Mikolajczyk, K.: Bold-binary online learned descriptor for efficient image matching. In: Conference on Computer Vision and Pattern Recognition (2015)
Bishop, C.: Pattern Recognition and Machine Learning. Springer, New York (2006)
Calonder, M., Lepetit, V., Ozuysal, M., Trzcinski, T., Strecha, C., Fua, P.: BRIEF: computing a local binary descriptor very fast. IEEE Trans. Pattern Anal. Mach. Intell. 34(7), 1281–1298 (2012)
Article Google Scholar
Chum, O., Matas, J.: Matching with prosac—progressive sample consensus. In: Conference on Computer Vision and Pattern Recognition, pp. 220–226. San Diego, CA (2005)
Chum, O., Mikulik, A., Perdoch, M., Matas, J.: Total recall ii: Query expansion revisited. In: Conference on Computer Vision and Pattern Recognition, pp. 889–896 (2011)
Gupta, R., Mittal, A.: Smd: a locally stable monotonic change invariant feature descriptor. In: European Conference on Computer Vision, pp. 265–277 (2008)
Lepetit, V., Fua, P.: Keypoint recognition using randomized trees. IEEE Trans. Pattern Anal. Mach. Intell. 28(9), 1465–1479 (2006)
Article Google Scholar
Leutenegger, S., Chli, M., Siegwart, R.Y.: Brisk: binary robust invariant scalable keypoints. In: International Conference on Computer Vision, pp. 2548–2555 (2011)
Levi, G., Hassner, T.: LATCH: learned arrangements of three patch codes. CoRR abs/1501.03719 (2015). http://www.openu.ac.il/home/hassner/projects/LATCH
Li, X., Larson, M., Hanjalic, A.: Pairwise geometric matching for large-scale object retrieval. In: Conference on Computer Vision and Pattern Recognition (2015)
Manning, C., Raghavan, P., Schütze, M.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 27(10), 1615–1630 (2004)
Article Google Scholar
Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., Van Gool, L.: A comparison of affine region detectors. Int. J. Comput. Vision 65(1/2), 43–72 (2005)
Article Google Scholar
Morel, J.M., Yu, G.: Asift: a new framework for fully affine invariant image comparison. SIAM J. Imaging Sci. 2(2), 438–469 (2009)
Article MathSciNet MATH Google Scholar
Muja, M., Lowe, D.G.: Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2227–2240 (2014)
Oszust, M.: An optimisation approach to the design of a fast, compact and distinctive binary descriptor. Signal Image Video Process. (2016). doi:10.1007/s11760-016-0907-4
Google Scholar
Ozuysal, M., Calonder, M., Lepetit, V., Fua, P.: Fast keypoint recognition using random ferns. IEEE Trans. Pattern Anal. Mach. Intell. 32(3), 448–461 (2010)
Article Google Scholar
Proença, H.: Performance evaluation of keypoint detection and matching techniques on grayscale data. SIViP 9(5), 1009–1019 (2015)
Article Google Scholar
Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: Orb: an efficient alternative to sift or surf. In: International Conference on Computer Vision, pp. 2564–2571 (2011)
Trzcinski, T., Christoudias, M., Lepetit, V.: Learning image descriptors with boosting. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 597–610 (2015)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, İzmir Institute of Technology, Izmir, Turkey
Furkan Eren Uzyıldırım & Mustafa Özuysal

Authors

Furkan Eren Uzyıldırım
View author publications
You can also search for this author in PubMed Google Scholar
Mustafa Özuysal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mustafa Özuysal.

Additional information

This work was supported by the TÜBİTAK project 113E496.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 11 KB)

Supplementary material 2 (pdf 11 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Uzyıldırım, F.E., Özuysal, M. Instance detection by keypoint matching beyond the nearest neighbor. SIViP 10, 1527–1534 (2016). https://doi.org/10.1007/s11760-016-0966-6

Download citation

Received: 07 March 2016
Revised: 20 July 2016
Accepted: 08 August 2016
Published: 20 August 2016
Issue Date: November 2016
DOI: https://doi.org/10.1007/s11760-016-0966-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Instance detection by keypoint matching beyond the nearest neighbor

Abstract

Similar content being viewed by others

Efficient and Scalable 4th-Order Match Propagation

Training Binary Descriptors for Improved Robustness and Efficiency in Real-Time Matching

D2D: Keypoint Extraction with Describe to Detect Approach

1 Introduction

2 Related work