Keywords

1 Introduction

Sparse features detection and description is at the root of many computer vision tasks: Structure-from-Motion (SfM), Simultaneous Localization and Mapping (SLAM), image retrieval, tracking, etc. They offer a compact representation in terms of memory storage and allow for efficient image matching, and are thus well suited for large-scale applications [14, 35, 36]. These features should however be able to cope with real world conditions such as day-night changes [44], seasonal variations [34] and matching across large baselines [40].

To be able to do matching in extreme scenarios, the successive feature detectors and descriptors have become more and more invariant [23]. The Harris corner detector [12] was already invariant to rotations, but not to scale. The SIFT detector and descriptor [20] was one of the first to achieve invariance with respect to scale, rotation and uniform light changes. More recently, learned descriptors have been able to encode invariance without handcrafting it. On the one hand, patch-based descriptors can become invariant to transforms when estimating the shape of the patch [10, 25, 29, 43]. On the other hand, recent dense descriptors leverage the power of large convolutional neural networks (CNN) to become more general and invariant. Most of them are trained on images with many variations in the training set, either obtained through data augmentation [8], with large databases of challenging images [9, 42] or with style transfer [31]. They can also directly encode the invariance in the network itself [19]. The general trend in descriptor learning is thus to capture as much invariance as possible.

Fig. 1.
figure 1

Importance of invariance among descriptors. SIFT descriptors (left) perform well on rotated images (top), but are outperformed by Upright SIFT descriptors (middle) when no rotation is present (bottom). We propose a method (right) that automatically selects the proper invariance during matching time.

While feature detectors should generally be invariant to be repeatable under different scenarios [44], the same is not necessarily true for descriptors [41]. There is a direct trade-off for descriptors between generalization and discriminative power. More invariance allows a better generalization, but produces descriptors that are less informative. Figure 1 shows that the rotation variant descriptor Upright SIFT performs better than its invariant counterpart SIFT when only small rotations are present in the data. We argue that the best level of invariance depends on the situation. As a consequence, this questions the recent trend of jointly learning detector and descriptor: they may have to be dissociated if one does not want the descriptor to be as invariant as the detector.

In this work we focus on learning descriptors only and propose to select at runtime the right invariance given the context. Instead of learning a single generic descriptor, we compute several descriptors with different levels of invariance. We then propose a method to automatically select the most suitable invariance during matching. We achieve this by leveraging the local descriptors to learn meta descriptors that can encode global information about the variations present in the image. At matching time, the local descriptors distances are weighted by the similarity of these meta descriptors to produce a single descriptor distance. Matches based on this distance can then be filtered using standard heuristics such as ratio test or mutual nearest neighbor.

Overall, our method, named Local Invariance Selection at Runtime for Descriptors (LISRD - pronounced as lizard), brings flexibility and interpretability into the feature description. When some image variations are known to be limited for a given application, one may directly use the most discriminative descriptor among all our learned local descriptors. However, it is usually hard to make such an assumption about the inter-image variations, and LISRD can instead automatically select the best invariance independently for each local region. Hence we are able to distinguish between different levels of variations within the same image (e.g. if half of the image is in the shadow but not the other half) and we show that this can improve the matching capabilities in comparison to using a single descriptor. The meta descriptors formulation is also not restricted to our proposed learned local descriptors, but can be easily generalized to most keypoint detectors and descriptors, as shown in Fig. 1 where it is applied to SIFT and Upright SIFT. Furthermore, the meta description only adds a small overhead to the current pipelines of keypoint detection and description in terms of runtime and memory consumption, which makes it suitable for real time applications. In summary, this work makes the following contributions:

  • We show how to learn several local descriptors with multiple variance properties through a single network, in a similar spirit as in multi-task learning.

  • We propose a light-weight meta descriptor approach to automatically select the best invariance of the local descriptors given the context.

  • Our concept of meta descriptor and general approach of invariance selection can be easily transferred to most feature point detectors and descriptors, which we demonstrate for learned as well as traditional handcrafted descriptors.

2 Related Work

Learned Local Feature Descriptors. The recent progress in deep learning has enabled learned local descriptors to outperform the classical baselines by a large margin [8, 9, 21, 31]. Following the classical approach, early works run a CNN on a small image region around the point of interest to get a patch descriptor [24, 29, 38]. The patch is not restricted to square areas, but can encode spatial transforms, such as affine [25] and polar [10] ones. The network is often optimized with a triplet loss using heuristics to extract positive and negative patches [3, 11, 22, 39], or by directly maximizing the average precision (AP) [13]. Working on sparse features also gives the possibility to leverage both the visual context of the image and the spatial relationships between the keypoint locations [21]. More recently, descriptors extracted densely by CNN architectures from full images have shown both fast inference time and high performance on matching and retrieval tasks, and can jointly detect a heatmap of keypoints. Some works detect keypoints and describe them in parallel, such as SuperPoint [8] and R2D2 [31], with for the latter an additional reliability map keeping track of the most informative locations in the image. Another approach is to use the features of the network as dense descriptors to subsequently detect keypoints, based on those features [9, 28, 42]. DELF [28] selects the keypoints using a learned attention, D2-Net [9] retrieves the maximum responses of the descriptor feature map across all channels, while UR2KID [42] clusters the channels in different groups and extracts keypoints based on their L2 responses. Even though jointly estimating the keypoints and descriptors allows a faster prediction and yields descriptors that are more correlated to the keypoints, the consequence is that detector and descriptor will share the same invariance. Therefore, we choose to focus exclusively on descriptor learning in this work.

Invariance in Feature Descriptors. Selecting an online invariance for binary descriptors is the core idea of BOLD [2], where a subset of the binary tests is chosen at runtime for each image patch to maximize the invariance to small affine transformations. Similarly, the general trend of most recent learned methods is to obtain descriptors as invariant as possible to any image variations. LIFT [43] mimics SIFT to achieve rotation invariance by estimating the keypoints, their orientation and finally their descriptor. Invariance to specific geometric changes can be achieved through group convolutions [7] by clustering the different geometrical transformations into specific groups [19]. However, the usual strategy is to incorporate as much diversity in the training data as possible. Illumination invariance can for example be obtained by training on images with multiple lighting conditions [15]. Photometric and homographic data augmentations also increase robustness to illumination and viewpoint changes [8]. Similarly, R2D2 [31] improves the robustness to day-night changes by synthesizing night images with style transfer and also to viewpoint changes by leveraging flow between close-by images [30]. Methods like D2-Net [9] and UR2KID [42] leverage a large database of images with multiple conditions and non planar viewpoint changes thanks to SfM data [17]. In this work, we adopt a mixture of the previously mentioned methods, namely the same synthesized night images as in [31], homographic augmentation, and training on datasets with multiple illumination changes [27].

Multi-task Learning in Description and Matching Tasks. Using a single network to achieve multiple and related tasks in feature description and matching is not new. Jointly learning the detector and descriptor [8, 9, 31] is already multi-task learning that makes the descriptors more discriminative at the predicted keypoint locations. HF-Net [32] unifies the detection of feature points, local and also global descriptors for image retrieval using multi-task distillation with a teacher network. Methods such as SuperGlue [33] and ContextDesc [21] can leverage both visual and geometric context in their descriptors in order to get a more consistent matching between images. UR2KID [42] bypasses the need of keypoint supervision during training and directly optimizes the descriptors jointly for local matching and image retrieval. In our approach, multiple descriptors are also learned in parallel, but instead of differing in their scope, they differ in their level of invariance. Furthermore, unlike previous hierarchical global-to-local approaches, our method relies on local descriptors first and leverages global information only to refine the local matching.

Fig. 2.
figure 2

Overview of our network architecture. Our network computes four local dense descriptors with diverse invariances and aggregates them through a NetVLAD layer [1] to obtain a regional description of the variations of the image.

3 Learning the Best Invariance for Local Descriptors

Our approach to select the most relevant variance for local feature descriptors consists in two steps. First, we design a network to learn several dense descriptors, each with a different type of invariance (see Sect. 3.1). Second, we propose a strategy in Sect. 3.2 to determine the best invariance to use when matching the local descriptors. Figure 2 provides an overview of the full architecture.

3.1 Disentangling Invariance for Local Descriptors

Many properties of an image have an influence on descriptors, but disentangling all of them would be intractable. We focus here on two factors known to have a large impact on descriptors performance: rotation and illumination. Our framework can however be generalized to other kinds of variations, for instance scaling. Since each of the two factors can either be variant or invariant, there are four possible combinations of variance with respect to illumination and rotation. We show in the following that the variant versions of descriptors are more discriminative since they are more specialized, while the invariant ones are trading the discriminative power for better generalization capabilities.

Network Architecture. Our network is inspired by SuperPoint [8], with slight modifications. It takes RGB images as input, computes semi-dense features with a shared backbone of convolutions and is then divided into 4 heads predicting a semi-dense descriptor each, one per combination of variance, as shown in Fig. 2. Since most computations are redundant between the 4 local descriptors, the shared backbone reduces the number of weights in the network and offers an inference time competitive with the current learned descriptors.

Dataset Preparation. The training dataset is composed of triplets of images. The first one, the anchor image \(I^A\), is taken from a large database of real images. The variant image \(I^V\) is a warped version of the anchor by a homography without rotation and with equal illumination to train variant descriptors. Finally, the invariant image \(I^I\) used for invariant descriptors is also related to the anchor by a homography, but its orientation and illumination can differ from the anchor.

Training Losses. The local descriptors are trained using variants of the margin triplet ranking loss [5, 24], depending on whether the descriptor should be invariant or not to the variations present in \(I^I\). The dense descriptors are first sampled on selected keypoints of the images, they are L2-normalized and the losses are computed on the resulting set of feature descriptors. Since we focus on descriptors only, we use SIFT keypoints during training to propagate the gradient in informative areas of the image only. Any kind of keypoint can be used at inference time nonetheless, as demonstrated in Sect. 4.5.

Formally, given two images \(I^a\) and \(I^b\) related by a homography \(\mathcal {H}\) and n keypoints \(\mathbf {x}^a_{1..n}\) in image \(I^a\), we warp each point to image \(I^b\) using the homography: \(\mathbf {x}^b_{1..n} = \mathcal {H}(\mathbf {x}^a_{1..n})\). This yields a set of n correspondences between the two images, where we can extract the descriptors from each dense descriptor map: \(\mathbf {d}^a_{1..n}\) and \(\mathbf {d}^b_{1..n}\). Let us define a generic triplet loss \(L_T(I^a, I^b, \mathrm {dist})\) between \(I^a\) and \(I^b\), given a descriptor distance \(\mathrm {dist}(\mathbf {x}^a, \mathbf {x}^b)\). The triplet loss first enforces a correct correspondence \((\mathbf {x}_i^a, \mathbf {x}_i^b)\) to be close in descriptor space through a positive distance

$$\begin{aligned} p_i = \mathrm {dist}(\mathbf {x}_i^a, \mathbf {x}_i^b). \end{aligned}$$
(1)

Additionally, the triplet loss increases the negative distance \(n_i\) between \(\mathbf {x}_i^a\) and the closest point in \(I^b\) which is at least at a distance T from the correct match \(\mathbf {x}_i^b\). This distance is computed symmetrically across the two images and the minimum is kept:

$$\begin{aligned} n_i = \min (\mathrm {dist}(\mathbf {x}_i^a, \mathbf {x}_{n_b(i)}^b), \mathrm {dist}(\mathbf {x}_i^b, \mathbf {x}_{n_a(i)}^a)), \end{aligned}$$
(2)

with , and similarly for \(n_a(i)\). Given a margin M, the triplet margin loss is then defined as

$$\begin{aligned} L_T(I^a, I^b, \mathrm {dist}) = \frac{1}{n} \sum _{i=1}^n \max (M + (p_i)^2 - (n_i)^2, 0). \end{aligned}$$
(3)

In our case, the loss \(L_I\) for invariant descriptors is an instance of this generic triplet loss between the anchor image \(I^A\) and the invariant image \(I^I\), for the L2 descriptor distance:

$$\begin{aligned} L_I = L_T(I^A, I^I, ||\mathbf {d}^A - \mathbf {d}^I||_2). \end{aligned}$$
(4)

The loss \(L_V\) for variant descriptors is based on the full triplet of images: \(I^A\), \(I^I\) and \(I^V\). It enforces variant descriptors to be different between the anchor and the invariant image, while preserving similarity between the anchor and the variant image. Its positive loss is the distance in descriptor space of positive matches between \(I^A\) and \(I^V\), and similarly for the negative distance between \(I^A\) and \(I^I\):

$$\begin{aligned} L_V = \frac{1}{n} \sum _{i=1}^n \max (f M + ||\mathbf {d}^A_i - \mathbf {d}^V_i||_2^2 - ||\mathbf {d}^A_i - \mathbf {d}^I_i||_2^2, 0), \end{aligned}$$
(5)

where f is a factor controlling at which point the anchor and the invariant images are different. For rotation changes, \(f = \min (1, \frac{\theta _I}{\theta _{max}})\), where \(\theta _I\) is the absolute angle of rotation between the anchor and the invariant image and \(\theta _{max}\) is a hyper-parameter representing the threshold beyond which the two images should be considered different. This threshold ensures that only large rotations are penalized by the loss. It is hard to quantify the difference in illumination between two real images, so we set \(f = 1\) when the illumination differs between the anchor and invariant image.

When a descriptor d in the set \(\mathcal {D}\) of descriptors is supposed to be invariant to all changes (illumination and/or rotation) between \(I^A\) and \(I^I\), we use \(L_I\). Otherwise, \(L_V\) is used. We define \(L_{I/V}(d)\) as the selected loss and the total loss for local descriptors as

$$\begin{aligned} L_l = \frac{1}{|\mathcal {D}|} \sum _{d\in \mathcal {D}} L_{I/V}(d). \end{aligned}$$
(6)
Fig. 3.
figure 3

The LISRD descriptor distance between two points is the sum of the four local descriptors distances, weighted by the similarity of the meta descriptors.

3.2 Online Selection of the Best Invariance

Given the local descriptors of the previous section, this section explores how to pick the most relevant invariance when matching images. Since it would be costly to recompute and compare the image variations for every pair of images to be matched, we propose to rely solely on the information contained in the descriptors to perform the selection. A naive approach would be to separately compute the similarity of the different local descriptors and to pick the most similar ones. However, the invariance selection would gain by having more context than the information of a single local descriptor and should be consistent with neighboring descriptors. Therefore, we propose to extract regional descriptors from the local ones and to use them to guide the invariance selection.

The local descriptors are thus gathered in neighboring areas through a NetVLAD layer [1] to get a meta descriptor sharing the same kind of invariance as the subset of local descriptors, but with more context than a single local descriptor. Thus, having similar meta descriptors means sharing the same level of variations. The neighboring areas are created by tiling the image into a \(c \times c\) grid and computing a meta descriptor for each tile. Hence, we get four meta descriptors per tile, which are then L2 normalized.

When matching the local descriptors of a tile, the four similarities between the meta descriptors are computed with a scalar product and we can rank the four local descriptors according to these similarities. Instead of making a hard choice by taking only the closest local descriptor, we use a soft assignment. A softmax operation is applied to the four similarities, to get four weights summing to one. These weights are then used to compute the distance between the local descriptors as shown in Fig. 3. More precisely, suppose that we want to compute the distance in descriptor space between point \(\mathbf {x}^a\) in image \(I^a\) and point \(\mathbf {x}^b\) in image \(I^b\). Point \(\mathbf {x}^a\) is associated with 4 local descriptors \(\mathbf {d}^a_{1..4}\) and 4 meta descriptors \(\mathbf {m}^a_{1..4}\) corresponding to the region where \(\mathbf {x}^a\) lies, and similarly for \(\mathbf {x}^b\). Then the final descriptor distance between \(\mathbf {x}^a\) and \(\mathbf {x}^b\) is

$$\begin{aligned} \mathrm {dist}(\mathbf {x}^a, \mathbf {x}^b) = \sum _{i=1}^4 \frac{\exp {((\mathbf {m}^a_i)^\intercal \cdot \mathbf {m}^b_i})}{\sum _{j=1}^4 \exp {((\mathbf {m}^a_j)^\intercal \cdot \mathbf {m}^b_j})} ||\mathbf {d}^a_i - \mathbf {d}^b_i||_2. \end{aligned}$$
(7)

Thus, the similarity of the meta descriptors acts as a weighting of the local descriptors distances and can put a stronger emphasis on one specific variance when the corresponding meta descriptors have a high similarity. Matching is then performed with this descriptor distance, and can easily be refined with ratio test [20] or mutual nearest neighbor.

Training Loss. The 4 NetVLAD layers are trained with a weak supervision based on another instance of the triplet loss \(L_T\) between \(I^A\) and \(I^I\) with the distance defined above:

$$\begin{aligned} L_m = L_T(I^A, I^I, \mathrm {dist}) \end{aligned}$$
(8)

Thanks to this weak supervision, there is no need to explicitly supervise the meta descriptors, which would require knowing the amount of rotation and illumination for every tile in the image. The total loss of the network is finally a combination of the local and meta descriptors, weighted by a factor \(\lambda \):

$$\begin{aligned} L = L_l + \lambda L_m. \end{aligned}$$
(9)

3.3 Training Details

Datasets. To train descriptors with different levels of variance in terms of rotation and illumination, datasets presenting all possible combinations of changes are needed. Control over the amount of changes is also required in order to know which loss between \(L_I\) and \(L_V\) should be used for each descriptor. We use in total four datasets to accomplish that. Illumination variations are obtained through the multi illumination dataset in the wild [27] and the style transferred night images of the Aachen day dataset [31]. Both offer pairs of images with fixed viewpoint and different illuminations. Images with fixed illumination come from the MS COCO dataset [18] and the day flow images from the Aachen dataset [31]. For all datasets except the latter, the images are augmented with random homographies containing translation, scaling, rotation and perspective distortion, similarly as in [8]. For the day images of Aachen, the flow is used to create the correspondences and we consider that these images contain only small rotations and no major illumination changes. Overall, there is an equal distribution of images with and without illumination changes, and of rotated and non rotated images.

Implementation Details. We describe here the details of our architecture. The backbone network, inspired by the VGG16 [37], is composed of successive \(3 \times 3\) convolutional layers with channel size 64-64-64-64-128-128-256-256. Each conv layer is followed by ReLU activation and batch normalization. Every two layers, a \(2 \times 2\) average pooling with stride 2 is applied to reduce the spatial resolution by 2. For an image of size \(H \times W \times 3\), the output feature map will have a size of \(H/8 \times W/8 \times 256\). The local descriptor heads are all composed of the following operations: \(3 \times 3\) conv of channel size 256 - ReLU - Batch Norm - \(1 \times 1\) conv of channel size 128. The final dimension of each local descriptor is thus \(H/8 \times W/8 \times 128\), and each concatenated descriptor is 512-dimensional. The semi-dense descriptors can then be bilinearly interpolated to the locations of any keypoint. Note that in order to achieve a better robustness to scale changes, one can also detect the keypoints and describe them at multiple image resolutions and aggregate the results in the original image resolution, similarly as in [9] and [31]. The NetVLAD layers consists in 8 clusters of 128-dimensional descriptors, hence a meta descriptor size of 1024. We used \(c \times c = 3 \times 3\) tiles per image.

The network is trained on RGB images resized to \(240 \times 320\) with the following hyper-parameters: distance threshold \(T = 8\), \(\theta _{max} = \frac{\pi }{4}\), margin \(M = 1\), loss factor \(\lambda = 1\). It comprises roughly 3.7M parameters, which are optimized with the Adam solver [16] (\(\mathrm {learning\ rate} = 0.001\) and \(\beta = (0.9, 0.999)\)). In practice, the local descriptors are pre-trained first and then fine-tuned by an end-to-end training with the meta descriptors. At test time, a single forward pass on a GeForce RTX 2080 Ti with \(480 \times 640\) images takes 6ms on average.

4 Experimental Results

We present here experiments validating the relevance of our method. Section 4.2 highlights the importance of learning different invariances, validates the proposed approach with an ablation study, and shows that LISRD can be extended to other descriptors such as SIFT and Upright SIFT. LISRD is then compared to the state of the art on a benchmark homography dataset (Sect. 4.3), on a challenging dataset with diverse conditions where the presence or lack of invariance is essential (Sect. 4.4) and on a visual localization task in the real world (Sect. 4.5).

4.1 Metrics

Since we want to compare the performance of the descriptors only, all the following metrics are computed on SIFT keypoints if not stated otherwise. The metrics are computed on pairs of images resized to \(480 \times 640\) and related by a known homography. Resizing is performed by upscaling/downscaling the images to have each edge greater or equal respectively to 480 and 640, and a central crop is applied to get the target resolution. We keep a maximum of 1000 points among the keypoints shared between the two views and matches are obtained after mutual nearest neighbor filtering.

Homography Estimation. We follow the procedure of [8] to compute a homography estimation score. Given a pair of images, RANSAC is used to fit a homography between the clouds of matched keypoints. The score is obtained by warping the four corners of the first image \(\hat{c}_{1\ldots 4}\) with the predicted homography and comparing their distance to the same points \(c_{1\ldots 4}\) warped by the ground truth homography. The homography is considered as correct when the average distance is below a threshold \(\epsilon \), which is set to 3 pixels in all experiments: \(\mathrm {HEstimation} = \frac{1}{4} \sum _{i=1}^4 ||\hat{c}_i - c_i||_2 \le \epsilon \).

Precision. Precision (also known as mean matching accuracy) is the percentage of correct matches over all the predicted matches [9, 31]. We use by default a threshold of 3 pixels to consider a match to be correct.

Recall. Recall is the ratio of correctly predicted matches over the total number of ground truth matches, where a ground truth correspondence is the closest point within an error threshold of 3 pixels. A predicted match with the second closest point but still within the correct threshold is considered as incorrect.

Fig. 4.
figure 4

Precision on HPatches of the 4 local descriptors. Variant ones are better when invariance is not needed (e.g. rotation for the illumination dataset).

4.2 Method Validation

Impact of the Different Invariances. One can check the validity of our approach by comparing the 4 local descriptors. We use the HPatches dataset [4], which is standard in descriptor evaluation. It is composed of 116 sequences of 5 pairs of images, with either viewpoint changes (given by a known homography) or illumination changes with fixed viewpoint. Figure 4 shows the comparison between the 4 descriptors in terms of precision. On viewpoint changes, the illumination variant descriptors are superior as the lighting is fixed in these images and they are thus more discriminative. Since HPatches contains few rotations, there is no significant difference in terms of rotation invariance and being rotation variant brings a small advantage on average. The precision on illumination changes shows that the best performing descriptors are the illumination invariant ones and that being rotation variant helps since the viewpoint is fixed. Thus there is no descriptor outperforming the others in all cases, and our hypothesis that variant descriptors are more discriminative than invariant ones is validated.

Ablation Study. To confirm the benefit of our online selection of invariance and choice of parameters, we compare LISRD on homography estimation on the HPatches dataset with other selection methods of the local descriptors as well as with variants of our approach (Table 1). Best of the 4 computes the metrics for the 4 local descriptors separately and picks the best score. Greedy computes the pairwise distances of all points for each local descriptor and greedily chooses the local descriptor with smallest distance for each pair of points. Hard assignment selects the local descriptor that maximizes the meta descriptor similarity, instead of choosing a soft assignment as in our proposed method. No tiling and \( 5 \times 5 \) tiles are variants of our method with no tiling or with \( 5 \times 5 \) tiles for the meta descriptors. Finally, Single desc is a descriptor trained with exactly the same architecture as ours, but with the 4 local descriptors concatenated and trained with invariance in both illumination and rotation.

On the full HPatches dataset, Best of the 4 corresponds to the descriptor invariant to both illumination and rotation, as both changes are present. However, our selection method can still leverage the other descriptors: for example an illumination variant descriptor for the viewpoint part. The disparity between LISRD and Greedy and Hard assignment highlights the added value of the meta descriptors, and shows that a soft assignment can better leverage the 4 descriptors at the same time. Finally, the comparison with Single desc confirms our hypothesis that disentangling the types of invariance is beneficial compared to learning a single invariant descriptor with the same number of weights.

Table 1. Ablation study on the HPatches dataset.
Fig. 5.
figure 5

Variants of SIFT vs. our method fusing them (LISRD-SIFT). Precision is computed on HPatches viewpoint.

Generalization to Other Descriptors. LISRD can be easily generalized to other kinds of descriptors, and not only to our proposed learned local descriptors. We demonstrate this by applying our approach to the duo of local descriptors SIFT and Upright SIFT – SIFT without rotation invariance, as presented in Fig. 1. Instead of having four local descriptors, there are only two of them, one invariant to rotation and one variant, and similarly for the meta descriptors. Our method is evaluated against SIFT and Upright SIFT on the viewpoint part of HPatches. This dataset contains indeed sequences with no rotation, where Upright SIFT performs better, and other sequences with strong rotations, where SIFT takes over. Figure 5 shows that our method can effectively leverage both SIFT and Upright SIFT and outperforms the two.

4.3 Descriptor Evaluation on HPatches

This section compares the performance of LISRD against state-of-the-art local descriptors on the benchmark dataset HPatches. Since our approach requires global context from full images, we cannot run it on the patch level dataset. We use the full sequences of images instead, similarly as in [8, 9, 31]. We consider the following baselines: Root SIFT with the default KorniaFootnote 1 implementation; HardNet [24] (trained on the PS-dataset [26]), SOSNet [39] (trained on the Liberty dataset of UBC Phototour [6]), SuperPoint (SP) [8], D2-Net [9], R2D2 [31] and GIFT [19] with the authors implementation. Since we want to evaluate the descriptors only, SIFT keypoints are detected in the images and for each method, we extract the local descriptors at these locations. For Root SIFT, HardNet and SOSNet, we sample \(32 \times 32\) patches at each SIFT keypoint and rotate them according to the SIFT orientation. As we want to evaluate the impact of rotation and illumination invariance only, we use single scale implementations for all methodsFootnote 2. Our method could however be made scale invariant using similar multi-scale approaches as in [9, 31].

Table 2. Comparison to the state of the art on HPatches. Homography estimation, precision and recall are computed for error thresholds of 3 pixels. The best score is in bold and the second best one is underlined.

The results are summarized in Table 2. Overall, LISRD ranks among the two best methods in precision and recall. The possibility to leverage rotation variant descriptors on the fixed pairs of the illumination part and to alternatively select the right level of lighting invariance given the amount of illumination changes probably explains our superior performance on the illumination part. Note the comparison with SuperPoint, whose architecture and training procedure are very similar to LISRD, and where our method displays better results in all metrics, thus showing the gain of our approach. The weaker results in homography estimation can be explained by a limitation of our method. Since our meta descriptors have a very coarse spatial resolution (\(3 \times 3\) grid), if one of them fails to pick the right invariance, this will impact all the matches of its region. Thus, the correct matches predicted by LISRD can in that case become very concentrated in a specific part of the image, which makes the homography estimation with RANSAC less accurate. This issue could be avoided with a finer tiling of the meta descriptors, but at the price of a reduced global context.

4.4 Evaluation in Challenging and Cross-Modal Situations

The HPatches dataset offers a fair benchmark, but is limited to only few rotations and medium illumination changes. Our approach is designed to be used in a variety of scenarios and with changing conditions, so that all our local descriptors can be leveraged. In order to evaluate our method on such a versatile task, we designed a new benchmark dataset, based on the day-night image matching (DNIM) dataset [44]. This dataset is composed of sequences of images of a fixed camera taking pictures at regular time intervals and across day and night, with a total of 1722 images. For each sequence, the image with timestamp closest to noon is taken as day reference and the image closest to midnight as night reference. We create two benchmarks, where the images of each sequence are paired with either the day reference or the night one. We then synthetically warp the pairs with the same homography sampling scheme as in [8] with an equal distribution of homographies with and without rotations. We plan to release the homographies used in this benchmark to let other researchers compare with their own methods. Examples of images and matches for this dataset can be found in the supplementary material.

Table 3 and Fig. 6 show the evaluation with the state-of-the-art descriptors, using SuperPoint keypoints. LISRD can adapt its invariance to illumination and rotations to alternatively select the most relevant descriptor and it outperforms the other methods by a large margin both in terms of precision and recall.

Table 3. Evaluation on a use case where invariance selection matters. Homography estimation, precision and recall are computed with SuperPoint keypoints on a dataset with day-night changes and various levels of rotation. Selecting the relevant variant or invariant descriptors boosts the precision and recall of our method compared to the previous state-of-the-art methods.
Fig. 6.
figure 6

Precision curves on the DNIM dataset [44] augmented with rotations. LISRD leverages its variant and more discriminative descriptors whenever possible and is thus more accurate than the state-of-the-art descriptors for all pixel error thresholds.

4.5 Application to Localization in Challenging Conditions

A typical application of image matching including adverse conditions such as strong illumination changes and wide baselines is the visual localization task. We evaluate our method on the local feature challenge of CVPR 2019 based on the Aachen Day-Night dataset [34]. The goal is to localize 98 night time query images as accurately as possible, 20 day images per query with known camera pose. As the keypoint quality is essential in this task, we compare our method with other descriptors for various types of keypoints: SIFT, SuperPoint and D2-Net multi-scale (MS). The numbers for the baseline methods are taken from the benchmark on the official websiteFootnote 3. The results in Table 4 show that our method is not limited to SIFT keypoints and can effectively improve the performance of local descriptors in challenging conditions. Note in particular the improvement over SuperPoint, which shares a similar architecture as ours.

Table 4. Visual localization performance on the Aachen Day-Night dataset [34]. We report the percentage of correctly localized queries for various distance and orientation error thresholds for SIFT, SuperPoint and D2-Net multi-scale (MS). Our method shows a good generalization when evaluated on different keypoints (KP) and can improve the original descriptor performance.

5 Conclusion

We presented a novel approach to learn local feature descriptors able to adapt to multiple variations in images, while remaining discriminative. We unified the learning of several local descriptors with multiple levels of invariance and of meta descriptors leveraging regional context to guide the local descriptors matching.

While restricted to illumination and rotation invariance, our framework can be generalized to more variations, at the cost of an exponentially growing number of descriptors however. A future direction of work would be to reduce the amount of redundancy between each descriptor by enforcing a stronger disentanglement separating each factors of variation. Since our approach is able to enforce different levels of invariance, one can also add another head to our network to predict invariant keypoints, while keeping discriminative descriptors, thus solving the current issue in joint learning of invariant detectors and descriptors.

Overall, this work is a first step towards disentangled descriptors. Separating the types of invariances paves the way to a full disentanglement of the factors of variations of images and could lead to flexible and interpretable local descriptors.