Keywords

1 Introduction

Establishing sparse feature correspondences between images is a fundamental problem in many computer vision tasks, such as 3D information inferring [1, 13, 14, 24], Structure-from-Motion [22, 36, 40], robot sensing [25] and image retrieval [3, 15]. Given two groups of keypoints, the main steps of image matching are: i) computing a high dimensional feature descriptor for each keypoint and ii) establishing correspondences between them by for example, finding the nearest neighbor in the feature space. In the above pipeline, feature descriptor is a key factor to improve the final matching result.

In the past two decades, researchers in the community have proposed many excellent handcrafted descriptors [2, 5, 7, 26], as well as modern learned descriptors [10, 16, 29, 32, 43, 44]. Despite their great success, these methods have their own limitations. Firstly, they still suffer from mismatches in challenging situations such as wide baseline and small scene overlap. Secondly, as observed by previous studies [18, 19], the performance of different descriptors may vary a lot for the same image. A keypoint might be correctly matched by one descriptor but mismatched by another. This difference implies that using a single descriptor hardly applies to all the scenarios, but different descriptors are complementary and they can cooperate. At last, while feature similarity is given much attention, the useful spatial structure information [20, 42, 45] is overlooked in these methods. This makes the matching result sensitive to local ambiguity. How to integrate multiple features as well as spatial structure constraint still remains an open problem.

In this paper, an image matching method based on multi-feature embedding is proposed. Different form existing methods that use a single feature, it first extracts multiple feature descriptors at each keypoint. Then, a new representation that encodes both multi-feature similarity and keypoint structure is computed via subspace embedding, which is a widely used methodology [23]. There are two properties of this subspace. On the one hand, if the inter-image similarity between two points measured by multiple descriptors is high, they will be close to each other in the embedded subspace. On the other hand, if two points on the same image are spatially close to each other, the distance between them in the new subspace will also be small. As a result, the structure of each point set is preserved and similar points from different point sets are pulled closer. This task is formulated as a Laplacian Embedding problem, which can be solved via eigen decomposition. Vectors in the computed subspace are treated as the new descriptors for the keypoints. In this way, both multi-feature and spatial structure information are utilized by the proposed method.

To summarize, the proposed method distinguishes itself from existing methods in the following aspects. (1) It generates a novel descriptor for each keypoint by computing a subspace, which is equivalent to the Laplacian Embedding problem. (2) The method is a general framework which fuses multiple off-the-shelf descriptors instead of using only one of them. In this way, the embedded descriptor can adapt to more challenging scenarios. (3) The subspace also preserves the spatial structure of the kepoints, which makes the algorithm robust to local appearance ambiguity.

2 Related Work

2.1 Feature Description Methods

As the most fundamental part of image matching, the performance of feature descriptor is very important. The most famous manual descriptor is SIFT [26], which is obtained by statistical histogram of local image gradient direction of keypoints, and it has been widely used until today. After that, many different kinds of manual feature descriptors have been designed to adapt to different situations, such as faster speed [5, 33], smaller memory [7, 34], and more robustness [2].

In recent years, feature descriptors based on neural network have developed rapidly, and generally get better matching results than handcrafted descriptors. Some methods [27, 37, 43, 52] take image patches as input, and can directly calculate the feature vector representations of these patches. HardNet [29] is based on L2-Net [43] network structure, it proposed a triple-network, by introducing a margin and encouraging negative pair feature distance to be greater than the sum of positive pair distance and margin, forcing the network focuses on those negative samples which are most difficult to distinguish. SOSNet [44] achieves better results by using the first-order similarity loss(similar to triplet-loss) and introducing a second-order regularization term between positive matching pairs. Interestingly, one method [53] proposes a soft margin relative to hard margin in HardNet, it discusses that the traditional hard margin is not flexible enough, so this paper proposes a dynamic soft margin to overcome this problem.

Another kind of end-to-end methods use image as input to obtain more reliable matching results by calculating dense features. Aiming at a large number of multi-views geometry problems in computer vision, SuperPoint [10] proposes a self supervised training framework of keypoint detection and description, which outputs highly abstract features of the input image. Subsequent end-to-end methods will also compute dense feature representation. D2-Net [11] proposes a “detect-and-describe” method, which uses a single CNN for joint feature detection and description, so an image can only get a 3D tensor. The goal of R2D2 [32] is to learn repeatable and reliable keypoints and powerful descriptors, and its outputs are dense descriptors, reliability map and repeatability map.

The existing deep learning methods all need ground truth correspondences to train, and the acquisition of correspondences is costly in some cases. Therefore, CAPS [47] proposes a method that directly uses the relative camera pose between image pairs as the supervision, thus greatly reducing the training costs. However, dense features tend to occupy more memory and computation is time-consuming.

2.2 Feature Matching Methods

Some researchers try to improve the results of image matching from another perspective. The most basic feature matching relationship is usually obtained by finding the mutual nearest neighbor features in feature space. SIFT [26] proposes ratio test based on mutual nearest neighbor searching and greatly improves the matching accuracy. Some methods [12, 28, 38, 39, 41, 42] use Gaussian mixture model for image matching, where each keypoint in the first image is treated as a Gaussian component, and the probability of each keypoint in the second image being assigned to each Gaussian component is modeled. Other methods [48] to treat the matching problem as a classification problem, in this case, the keypoints in one image can be regarded as cluster centers, while the keypoints in another image are the keypoints to be assigned. Some multi-image matching methods [17, 56] can promote the matching accuracy of image pairs to some extent by establishing the cycle-consistency constraint between multiple images.

The feature matching correspondence can also be restored from the feature similarity matrix, which is very common in graph matching [50, 55] and multi-graph matching [8, 31, 46, 49]. A spectral method [21] proposes to find the correspondences from the feature similarity matrix, this spectral method is also used in many subsequent graph matching methods. Besides feature similarity, some methods [20, 35, 42, 45] also considers the spatial structure of keypoints in the same image, and the better matching results are obtained by combining feature and spacial information, but this approach only takes into account a single feature. Recently, a novel method, SuperGlue [35], uses neural network to find correspondences, which fully considers the relationship of cross-image keypoints and self-image keypoints, this is also reflected in this paper.

The above image matching methods can not solve the inherent problem of features, that is, a good correspondence basically depends on a good feature descriptor. As we can not guarantee that a certain feature can be widely used in all scenes, from another perspective, the method of fusing multiple different existing features in this paper is a good choice.

2.3 Feature Fusion Methods

There are also some matching methods from the perspective of multiple features fusion. Hu et al. proposed in [19] that the best feature can be selected for each keypoint in the homography space for matching, but each keypoint essentially uses a single descriptor information. Yu et al. proposed a multi-feature fusion matching method [51], but their fusion features are geometric, gray, color and texture features. LISRD [30] proposes a method to separate invariants from local descriptors. In its framework, it includes the structure of learning multiple local descriptors, which makes people think it is a multi-feature fusion method. In fact, LISRD does not fuse features.

The goal of this paper is to design a multi-feature fusion method, in which each feature has its own contribution. And for different keypoints, different features have different contributions. In this way, different features complement effectively, and image matching accuracy can also be improved.

3 The Proposed Method

Given two images \(I_{1}\) and \(I_{2}\), we detect two groups of keypoints \(X_{1}\in R^{m\times 2}\) and \(Y_{2}\in R^{n\times 2}\) on each image. For each keypoint, K kinds of descriptors are extracted, which are denoted as \(P_{1}^{k}\in R^{m\times d_{k}}\) and \(Q_{2}^{k}\in R^{n\times d_{k}}\). \(k=1,..,K\) is the k-th feature and \(d_{k}\) is the dimension of it.

Different from existing methods which use a single descriptor, we want to fuse multiple features and impose structural constraint at the same time. To this end, we compute a new representation \(E_{1}=\left\{ e_{1}^{1},e_{2}^{1},...,e_{m}^{1} \right\} ^{T}\in R^{m\times c}\) and \(E_{2}=\left\{ e_{1}^{2},e_{2}^{2},...,e_{n}^{2} \right\} ^{T}\in R^{n\times c}\) of the original keypoints by projecting all these keypoints information into a subspace. The superscript 1 or 2 indicates the first or the second image, and c is the dimension of the subspace feature. \(E_{1}\) and \(E_{2}\) can be computed by minimizing the following objective function [45]:

$$\begin{aligned} \min \sum _{l=1,2}\sum _{i,j}\left\| e_{i}^{l}-e_{j}^{l} \right\| ^{2}S_{l,ij}+ \sum _{i,j}\left\| e_{i}^{1}-e_{j}^{2} \right\| ^{2}U_{ij}. \end{aligned}$$
(1)

The first term in Eq. (1) encodes intra-image spatial information, where \(S_{l,ij}\) represents the spatial similarity between keypoints i and j in image l. \(S_{1,ij}\) and \(S_{2,ij}\) can be computed by the following kernel function \(K_{s}\left( \cdot ,\cdot \right) \):

$$\begin{aligned} S_{1,ij}&=K_{s}\left( x_{i},x_{j} \right) =e^{-\frac{\left( x_{i}-x_{j} \right) ^{2}}{2\sigma ^{2}}},\ x_{i},x_{j}\in X_{1},\end{aligned}$$
(2a)
$$\begin{aligned} S_{2,ij}&=K_{s}\left( y_{i},y_{j} \right) =e^{-\frac{\left( y_{i}-y_{j} \right) ^{2}}{2\sigma ^{2}}},\ y_{i},y_{j}\in Y_{2}. \end{aligned}$$
(2b)

According to Eq. (2), if two points on the same image are spatially close to each other, the corresponding similarity in \(S_{1,ij}\) would be large. To minimize Eq. (1), their distance in the subspace should be small.

The second term in Eq. (1) encodes inter-image feature information, in which \(U_{ij}\) is the feature similarity defined by multiple descriptors between \(x_i\) and \(y_j\). \(U_{ij}\) can be computed from the following equation:

$$\begin{aligned} U_{ij}=\frac{1}{K}\sum _{k=1}^{K}U^k_{ij}, \end{aligned}$$
(3)

where

$$\begin{aligned} U^k_{ij}=K_{u}\left( p_{i}^{k},q_{j}^{k} \right) =e^{-\frac{\left( p_{i}^{k}-q_{j}^{k} \right) ^{2}}{2\beta ^{2}}},\ p_{i}^{k}\in P_{1}^{k}\ \mathbf{and} \ q_{j}^{k} \in Q_{2}^{k} \end{aligned}$$
(4)

is a kernel function representing the feature similarity between \(x_i\) and \(y_j\) with the k-th descriptor. As we can see from Eq. (3) and Eq. (4), the feature information in Eq. (1) is jointly defined by multiple descriptors. If two points from different images are similar to each other, the correponding similarity in \(U_{ij}\) would be large. To minimize Eq. (1), their distance in the subspace should be small as well. As a result, the subspace defined by Eq. (1) has the following properties: similar points from different images measured by multiple descriptors are pulled closer and the relative structure of points from the same image is preserved.

The feature information and spatial information can be expressed in a compact matrix form, which is shown in Eq. (5).

$$\begin{aligned} A=\begin{bmatrix} S_{1} &{} U\\ U^{T} &{} S_{2} \end{bmatrix}. \end{aligned}$$
(5)

Here A is a \(2 \times 2\) block matrix. Its diagonal blocks \(S_1 \in R^{m \times m}\) and \(S_2 \in R^{n \times n}\) are the spatial information matrices computed from Eq. (2). Its off-diagonal block \(U \in R^{m \times n}\) is the feature information matrix computed from Eq. (3). Denoting \(E=\left[ E_{1}^{T},E_{2}^{T} \right] \) and applying some simple derivation, Eq. (1) can be rewritten in the following form:

$$\begin{aligned} \min tr(E^{T}AE), \end{aligned}$$
(6)

which can be seen as the Laplacian Embedding problem [6]. The optimal embedding features E in Eq. (6) can be obtained by solving the following problem,

$$\begin{aligned} \min _{E^{T}DE=I}tr(E^{T}LE), \end{aligned}$$
(7)

where \(L=D-A\) is the Laplacian matrix of A, and D is a diagonal matrix whose non-zero elements are computed from \(D_{ii}=\sum _{j}A_{ij}\). Equation (7) is a generalized eigenvector problem, whose solution is the eigenvectors corresponding to the c smallest non-zero eigenvalues.

After computing E from Eq. (7), we have a new c-dimensional representation for each keypoint in \(X_1\) and \(Y_2\). This new descriptor not only fuses multi-feature information, but also encodes spatial structure constraint. We then match the keypoints by searching for mutual nearest neighbors in the subspace.

4 Experiments

4.1 Evaluation Metrics

The experiments are performed on a machine equipped with Xeon E5-2620 2.1GHz, 64GB RAM and one GTX 1080Ti. Following SuperPoint [10], D2-Net [11], UCN [9] and CAPS [47], the proposed method is evaluated in terms of Mean Matching Accuracy (MMA) and several downstream tasks such as homography estimation accuracy and relative pose estimation accuracy.

Mean Matching Accuracy (MMA). For a certain keypoint, if the distance between its estimated matching position and the ground truth matching position is smaller than a threshold, this match would be deemed as correct. The Mean Matching Accuracy (MMA) is the ratio of correct correspondences in the whole dataset. Higher MMA is preferable.

Homography Estimation Accuracy: Homography is a \(3\times 3\) matrix which plays an important role in a variety of areas such as panorama generation and planar surface detection. It can be estimated from correspondences between two views. To be specific, we use the OpenCV function to estimate the homography matrix and compare it with the ground truth. Following SuperPoint [10], the four-corner accuracy is used to check whether the estimated homography is correct. That is, the four corners of an image are warped by the estimated homography and the ground truth homography, respectively. If the average distance error between them is less than a threshold \(\varepsilon \), then the estimated homography is admitted to be correct.

Relative Pose Estimation Accuracy: Another application of image feature point matching is 3D reconstruction, which requires to estimate the relative pose between two cameras. The pose parameters, i.e. the rotation matrix \(R\in R^{3\times 3}\) and the translation vector \(t \in R^{3\times 1}\) can also be computed from correspondences. For rotation, we compute the angle error between the estimation and the ground truth. As for translation, we simply compute the directional error with the ground truth because its magnitude is determined up to an unknown scale factor. The estimation is deemed as correct if the error is below a threshold.

4.2 Datasets

Similar to CAPS [47], the experiments are carried out on two datasets: HPatches [4] and COLMAP [54].

HPatches is used to evaluate MMA and homography estimation accuracy. It consists of 116 scenes, among which 57 scenes are for illumination change and the other 59 scenes are for viewpoint change. Each scene contains 6 images and 5 pairs by matching the first image to the others, leading to a total of 580 image pairs. For every image pair, a homography is provided as the ground truth. SuperPoint [10] is applied to detect at most 1000 keypoints on each image except for the i_dc scene, because SuperPoint is not able to handle its resolution.

Table 1. The MMA on the HPatches dataset. The pixel threshold is from 1 to 10. Best results are in bold.
Table 2. Average homography estimation accuracy on HPatches under different thresholds \(\varepsilon \). Best results are in bold.

COLMAP is used for the evaluation of relative pose estimation accuracy. It contains four scenes: gerrard, graham, person and south, with 100, 560, 330 and 128 images respectively. These images, which are captured by different users and collected from the Internet, present great challenges such as viewpoint changes, scaling and occlusion. The camera parameters estimated in a standard SfM pipeline are provided as ground truth. Similar to [47], we divide all the image pairs in this dataset into three groups according to the viewing angle difference: easy \(\left[ 0,15^{\circ } \right] \), moderate \(\left[ 15^{\circ } ,30^{\circ }\right] \) and hard \(\left[ 30^{\circ } ,60^{\circ }\right] \). In each group, we randomly select 200 image pairs, resulting a total of 600 image pairs for testing. SuperPoint [10] is also applied to detect at most 1000 keypoints on each image.

The proposed method is compared with several state-of-the-art descriptors including SIFT [26], RootSIFT [2], HardNet [29], SOSNet [44], SoftMargin [53] and SuperPoint [10]. The first two are famous handcrafted descriptors while the last three are outstanding deep learned descriptors. Our method is also compared with the OS [45] matching algorithm, which is closely related to our method, but it considers only a single descriptor. To evaluate the performance of each descriptor itself, we do not apply ratio test and all the matches are established by simply finding mutual nearest neighbors.

Table 3. Average relative pose (rotation/translation) estimation accuracy on the COLMAP dataset. The angle threshold is strictly set to \(5^{\circ }\). Best results are in bold.
Fig. 1.
figure 1

The mean matching accuracy (MMA) for different thresholds on HPatches. From left to right are: results on the whole dataset, the illumination subset and the viewpoint subset.

4.3 Ablation Studies

Existing descriptors are either handcrafted or deep learned. Here we test 4 different combinations of them and analyze the results. 2-Hand uses two handcrafted descriptors SIFT and RootSIFT. 2-Depth uses two of the outstanding deep descriptors, HardNet and SOSNet. 4-Descs uses a mixture of both handcrafted and deep learned descriptors. Two of them are from 2-Hand and the others from 2-Depth. 4-Depth uses four deep learned descriptors, including HardNet, SOSNet, SoftMargin and SuperPoint.

The results on MMA, homography estimation accuracy and relative pose estimation accuracy are shown in Table 1, Table 2 and Table 3, respectively. As we can see from the data, when using the same number of descriptors (for example 2-Hand and 2-Depth), deep learned descriptors outperforms traditional handcrafted ones. It also reveals that using more descriptors will improve the results (see 2-Depth and 4-Depth). However, we also find that 4-Descs is lower than 4-Depth and 2-Depth. This indicates that not all the descriptors will contribute to the results. Some descriptors that are not good enough might even make the results worse. Based on the above observations, we recommend to use 4-Depth in the following experiments.

We also test the role of spatial structure information. According to [45], we replace the diagonal blocks of A in Eq. (5) with identity matrices for the method 4-Depth. In this case, spatial structure information is removed and only feature information is considered. The results, denoted as F-Only, is also shown in Table 1, Table 2 and Table 3. As we can see, F-Only is significantly lower than 4-Depth, showing that integrating spatial structure information is beneficial.

4.4 Mean Matching Accuracy Evaluation

Figure 1 shows the result of MMA under different thresholds (from 1 to 10). We plot the statistics on the whole dataset (Overall), as well as two subsets (Illumination and Viewpoint). HardNet (OS) and SOSNet (OS) represent the matching results of [45] when using HardNet and SOSNet, respectively.

The proposed method achieves the best performance on the whole dataset and the viewpoint subset. It also returns the best results on the illumination subset when the threshold is less than 6. HardNet and SOSNet are the top two compared methods. The two handcrafted descriptors, SIFT and RootSIFT, fall behind other learned descriptors on the viewpoint subset but receive good results on the illumination subset. Figure 2 gives some visualization results of the correspondences on some example image pairs. It shows that our method returns more correct and fewer incorrect matches.

Fig. 2.
figure 2

Visualization results of the correspondences on three typical image pairs. Green and red lines indicate correct and incorrect matches, respectively. (Color figure online)

4.5 Results on Downstream Tasks

Table 4 shows the average homography estimation accuracy on HPatches for different methods. Three thresholds are used. The proposed 4-Depth method achieves the best result for \(\varepsilon =3\) and \(\varepsilon =5\), and ranks second for \(\varepsilon =1\). HardNet(OS) and SOSNet(OS) outperform HardNet and SOSNet, respectively by involving spatial constraint. There is a remarkable improvement between our method and [45], showing that using multiple descriptors is beneficial.

Table 5 shows the average relative pose estimation accuracy on the COLMAP dataset for different methods. The angle thresholds error is strictly set to \(5^{\circ }\). For all these methods, the score drops from easy to hard. Our 4-Depth method achieves the best results except for translation on the hard subset. HardNet (OS) and SOSNet (OS) defeat HardNet and SOSNet, and rank the top two among the remaining compared methods.

To test some other simple feature fusing strategies, we use intersection, union and voting of four deep features in Table 4 and Table 5. They are denoted as In., Un. and Vo., respectively. For Vo., a correspondence is required to be found by at least three out of four descriptors. As we can see, Un. contains too many false matches so its results are generally not as good as ours. Vo. shows much higher score in Table 5, but it’s worth noting that the increase of accuracy is at the cost of sacrificing many correct matches. To prove this, we give the number of Average Matching Points(A.M.P) and the Forecast Loss Rate (F.L.R) in both tables. It shows that Vo. sacrifices nearly \(50\%\) and \(82\%\) matches in Table 4 and Table 5, while the statistics for In. is \(71\%\) and \(93\%\). Losing too many correspondences may lead to failure when estimating the geometry models due to insufficient data. The Forecast Loss Rate of Vo. and In. can range from \(1\%\) up to \(33\%\). As a result, although Vo. and In. can achieve higher accuracy in easy situations, they are infeasible in harder situations due to high failure rate.

Table 4. Average homography estimation accuracy on HPatches under different thresholds \(\varepsilon \). A.M.P is the number of Average Matching Points and F.L.R is the Forecast Loss Rate. The best and second best results are in bold and blue.
Table 5. Average relative pose (rotation/translation) estimation accuracy. The angle error threshold is strictly set to \(5^{\circ }\). A.M.P is the number of Average Matching Points. F.e, F.m and F.h are the Forecast Loss Rate for each subset. The best and second best results are in bold and blue.
Fig. 3.
figure 3

The average matching accuracy and running time for different embedded feature dimension c. As a trade-off, we set the embedding feature dimension to \(c=55\) in all the experiments.

4.6 Parameters and Efficiency

In our method, the dimension c of the subspace is an important parameter. To investigate its influence, an experiment is carried out on the v_grace scene of HPatches, in which c increase from 5 to 400 with a step size of 5. The average matching accuracy and running time are shown in Fig. 3. The results show that the matching accuracy of our method will increase when the embedded dimension becomes higher, but it will cost more time as well. In particular, the running time keeps growing but the average matching accuracy remains stable when the feature dimension c exceeds 60. As a trade-off, we set the embedding feature dimension to \(c=55\) in all the experiments.

5 Conclusions

This paper proposes a novel image matching method based on multi-feature fusion and subspace embedding. The basic idea is to compute a subspace, in which intra-image structures of the keypoints are preserved and inter-image multi-feature similarities are encoded. This goal is achieved by solving a Laplacian Embedding problem. The proposed method is tested on a variety of scenes. Both the mean matching accuracy and performance on downstream tasks such as homography estimation and relative pose estimation are evaluated. Results show that the proposed method achieves the best performance when combining four deep descriptors: HardNet, SOSNet, SoftMargin and SuperPoint.