1 Introduction

Image registration is a vital prerequisite in many applications for remote image analysis [1,2,3,4,5,6,7,8,9]. The primary goal of image registration is to spatially align two or more different images of the same scene (i.e., the reference and sensed images) taken at different times, from different viewpoints, or by different sensors [10].

Over the years, a number of methods have been proposed to deal with the problem of automated remote image registration [11, 12]. Generally, these methods can be divided into two categories: methods based on pixel intensities and methods based on image features [10, 12]. Methods based on pixel intensities use original pixel intensities to find the matching information in the overlap area of two images, while methods based on image features seek the correspondence between two sets of features based on descriptors and/or using certain spatial constraints. Methods based on pixel intensities are suitable when the transformations between images are not so complex. These methods are sensitive to noise, illumination variance, and difference of sensor types. Besides, they also suffer from heavy computational complexities. On the contrary, methods based on image features are relatively more robust, as features are not so easy to be influenced by the changes mentioned above, which allow them to handle more complex image distortions. The features used for image registration are typical point features obtained by feature detectors such as scale-invariant feature transform (SIFT) [13], as they are more general and easier to be extracted. In this paper, we focus on point-based feature registration techniques.

The point-based matching problem is typically solved by using a two-step strategy [14]. In the first stage, a set of putative matches is constructed based on a similarity constraint requiring that two feature points can be matched if they have similar descriptors. In this process, the scale of possible matches is largely reduced; however, the putative matches established by only local feature descriptors are unreliable due to viewpoint changes, repeated patterns, as well as occlusions. Therefore, in the second step, the geometric constraint is further adopted to remove the false matches and the matching task boils down to determining the correctness of each putative match.

Although quite a lot of point-based registration methods have been investigated in the past decades, it is still a challenging task to design an efficient strategy when dealing with real-world remote sensing image registration problems. First, the image transformation model is often unknown in advance and varies with respect to different types of remote sensing scenes, making it difficult to design a general algorithm. Second, a large number of false matches will exist due to low overlaps, repeated patterns, or local distortions of remote sensing images, requiring high robustness of the matching model. Third, high computational complexity is often incurred, especially in case of non-rigid transformation, limiting its applicability in handling real-time problems or matching high-resolution remote sensing images.

In view of the above problems, we propose an efficient method for robust feature matching of remote sensing images based on Gaussian field criterion. The mismatch removal is formulated as a robust regression problem, and we customize the Gaussian field criterion to solve it to achieve a robust estimation. We model the image transformation with both linear and nonlinear functions to handle various matching problems, and a regularization term is imposed in the nonlinear case to ensure well-posedness. In addition, we apply a sparse approximation to the transformation to significantly reduce the computational complexity.

Our contribution in this paper includes three aspects. Firstly, we introduce the Gaussian field criterion and analyze its robustness both in theory and in experiment, which provides support for why it can resist outliers. Secondly, we customize the Gaussian field criterion to address the feature matching problem for remote sensing image registration, which can efficiently remove false matches from a set of putative matches with even up to \(80\%\) outliers. Thirdly, we adopt both linear and nonlinear transformation model together with a sparse approximation, which can deal with more general matching problem in low computational complexity, especially in the nonlinear situation.

The rest of the paper is organized as follows: Section 2 overviews the previous work related to image registration, especially in the context of remote sensing image registration. In Sect. 3, we propose the Gaussian field criterion and customize it to solving the linear and nonlinear feature matching problem. Subsequently, we present the fast implementation of the proposed method and discuss the computational complexity. In Sect. 4, we evaluate the performance of our method on various natural as well as remote sensing image pairs with comparisons to several state-of-the-art approaches, followed by some concluding remarks in Sect. 5.

2 Related work

Image registration benefits many research fields, including computer vision [15,16,17], pattern recognition [18,19,20,21], medical imaging [22, 23], and remote sensing [24,25,26,27,28]. Here we briefly overview two major categories of registration methods (i.e., methods based on intensities and methods based on features) that are most relevant to our approach, particularly in the context of remote sensing registration.

2.1 Methods based on pixel intensities

Methods based on pixel intensities typically use windows of certain size for the correspondence estimation without detecting salient objects in images, such as features. The most representative methods that fall in this category are correlation-based methods [29], Fourier methods [30], and mutual information methods [31].

Classical correlation-based methods, such as cross-correlation [29], compute the similarities of windows in two images. They deal directly with pixel intensities, without any structural analysis. There have been some of the correlation-based methods using maxima of wavelet coefficients in remote sensing image registration as in [32]. Correlation-based methods are often used in real-time applications because of their easy implementation in hardware. However, these methods have some obvious disadvantages when the images in windows suffer from complex transformations or contain relatively smooth areas without prominent details.

Fourier methods exploit the Fourier representation of images in the frequency domain. One representative is the phase correlation method based on the Fourier shift theorem, which was originally proposed to deal with translated images and later extended to account for rotation and scaling [30]. The applications to remote sensing are described in [33], which has shown that the images can be aligned using a Fourier–Mellin transform together with the phase correlation. Compared to the correlation-based methods, the Fourier methods have the advantages of computational efficiency and have shown strong robustness against varying imaging conditions and frequency-dependent noise.

Mutual information, which measures the statistical dependency between two sets of data, comes from the information theory. Mutual information methods take such information as the metric to maximize the dependency between the given images. These methods are particularly suitable for multimodal registration; for example, in remote sensing applications the images are often captured by different modalities of sensors to exploit complementary information [34]. Rather than operating directly on the image intensities, a mutual information method that operates on the extracted features such as points of the area borders has also been developed [31]. In addition, the mutual information methods are also frequently adopted in medical imaging to compare the anatomical and functional images of the patient’s body, leading to a diagnosis accordingly [31].

2.2 Methods based on image features

The second category of image registration methods relies on features, namely visually salient structures in images. It is usually assumed in these methods that the features extracted can be represented as a set of spatial points, called control points in the literature [12].

A popular strategy for solving the matching problem involves two steps [14]: (i) computing a set of putative correspondences and (ii) then removing the outliers via geometrical constraints. Putative correspondence instances are obtained in the first step by pruning the set of all possible point correspondences. This scenario is achieved by computing feature descriptors at the points and eliminating the matches between points whose descriptors are excessively dissimilar. Lowe [13] proposed the SIFT descriptor with a distance ratio method that compares the ratio between the nearest and next-nearest neighbors against a predefined threshold to filter out unstable matches. Guo and Cao [35] proposed a triangle constraint, which can produce better putative correspondences in terms of quantity and accuracy compared with the distance ratio in [13]. Pele and Michael [36] applied the earth mover’s distance to replace the Euclidean distance in [13] to measure the similarity between descriptors and improve the matching accuracy. Li et al. [37] further refined the descriptor of SIFT to overcome the difference in the gradient intensity and orientation between remote image pairs. In addition, Hu et al. [38] proposed the local selection of a suitable descriptor for each feature point instead of employing a global descriptor during putative correspondence construction. A cascade scheme has been suggested to prevent the loss of true matches, which can significantly enhance the correspondence number [39, 40]. Although there have been various sophisticated approaches for putative match construction, the use of only local appearance features will inevitably result in a lot of false matches. In the second step, robust estimators based on geometrical constraints are used to detect and remove the outliers.

To efficiently remove mismatches, the traditional resampling methods such as random sample consensus (RANSAC) [41] and its variants [42, 43] have achieved great success in many situations. However, they rely on a predefined parametric model, which become less efficient when the underlying image transformation is non-rigid. In addition, they also tend to severely degrade if the outlier proportion becomes large. To address these issues, several nonparametric interpolation methods have recently been introduced, such as identifying correspondence function (ICF) [44], graph shift (GS) [45], vector field consensus (VFC) [14] and locally linear transforming (LLT) [10]. These methods usually interpolate the underlying image transformation with kernel functions using unsupervised learning and are able to handle complex non-rigid deformation. Specifically, by using support vector regression, the ICF method learns a correspondence function pair mapping feature points across two images and then rejects the outliers by checking whether they are consistent with the estimated correspondence functions. The GS method constructs an affinity graph for the correspondences based on the spectral matching method, and the maximal clique of the graph is viewed as spatially coherent correspondences. The VFC method learns a smooth field to fit the potential inliers, while the LLT method recovers a smoothness transformation according to a local linear constraint, and both can resist a great quantity of outliers. Specifically, in the remote sensing community, Li et al. [25] proposed a robust method based on support-line voting and affine-invariant ratios, which has shown large improvements compared with RANSAC. Yao et al. [46] introduced a Harris corner matching method based on the quasi-homography transform (QHT) and self-adaptive window, where the QHT can reduce the search range effectively for match candidates. In addition, Yu et al. [47] presented a method based on LLT and a manifold regularization (MR) [48] for feature matching of unmanned aerial vehicle (UAV) images.

The feature matching can also formulated in terms of a correspondence matrix between feature points together with a parametric, or nonparametric, geometric constraint. Specifically, Boughorbel et al. [49] introduced the Gaussian fields into 3D rigid shape registration, which was later generalized to the non-rigid setting in [50]. Based on this concept [51], further promoted the matching performance by constructing a context-aware representation for assignment initialization. During the manuscript preparing, the authors in [51] introduced a mismatch removal method named Gaussian field consensus [52] for nonparametric feature matching based on a locally linear constraint similar to LLT [10]. In this paper, we formulate the parametric and nonparametric models in a general framework based on the Gaussian field criterion for remote sensing image matching. It is robust and general and is able to handle both linear and nonlinear distortions in low computational complexity.

3 Method

This section describes our method for robust feature matching. To this end, we first create a set of putative matches by considering all possible matches and pruning matches whose feature descriptors are sufficiently different. Fortunately, this can be achieved by using existing well-designed local feature descriptors such as SIFT [13]. Next, we will introduce a robust estimator and customize it to further remove false matches from the putative set based on a global geometric constraint.

3.1 Gaussian field criterion

Many problems in applied mathematics, such as the heat equation, vortex methods, tomography and nonparametric statistics, involve the Gauss transform [53]

$$\begin{aligned} G_\sigma \mathbf {f}(\mathbf {x})=\int _{\mathcal {H}}\mathrm{e}^{-\frac{\Vert \mathbf {x}-\mathbf {y}\Vert ^2}{\sigma ^2}}\mathbf {f}(\mathbf {y}){\text {d}}\mathbf {y}, \quad (\sigma >0) \end{aligned}$$
(1)

of a function \(\mathbf {f}\) defined on \({\mathcal {H}}\subset \mathbb {R}^D\) with \(\sigma \) being a range parameter. In practical problem, the \(G_\sigma \mathbf {f}\) is usually discretized for numerical computation. Specifically, given the values of \(\mathbf {f}\) at a set of points \(\{\mathbf {s}_j\}_{j=1}^N\subset \mathbb {R}^D\), the integral (1) could be approximated by the following discrete Gauss transform

$$\begin{aligned} G(\mathbf {x})=\sum _{j=1}^Nq_j\mathrm{e}^{-\frac{\Vert \mathbf {x}-\mathbf {s}_j\Vert ^2}{\sigma ^2}}, \end{aligned}$$
(2)

where \(q_j\) depending on \(\mathbf {f}(\mathbf {s}_j)\) is a coefficient. Equation (2) is also called Gaussian field due to sources of strengths \(\{q_j\}_{j=1}^N\) at the point set \(\{\mathbf {s}_j\}_{j=1}^N\), evaluated at the “target” \(\mathbf {x}\).

The Gaussian field is a sum of Gaussian distances which is robust to noise and outliers. (We will discuss this property later.) It has been applied to the shape registration problem and demonstrated promising results [49, 50]. In the following, we will customize it to solve the feature matching problem and show its robustness both in theory and in experiment.

3.2 Problem formulation

Suppose we have obtained a set of putative feature matches \(S = \{(\mathbf {x}_i, \mathbf {y}_i)\}_{i=1}^N\) with \(\mathbf {x}_i\) and \(\mathbf {y}_i\) being spatial positions of feature points extracted from two images. The putative set may be contaminated by some unknown false matches, and our goal is to distinguish those outliers to establish accurate matches. Typically, the underlying inliers obey a global geometric constraint; for example, there exists a spatial transformation \(\mathbf {f}\) satisfying that \(\forall i\in \mathbb {N}_N, \mathbf {y}_i = \mathbf {f}(\mathbf {x}_i)\) if \((\mathbf {x}_i, \mathbf {y}_i)\) is an inlier. Thus, the matching problem boils down to seeking the solution of \(\mathbf {f}\) and a robust estimation is desirable due to the existence of outliers.

Fig. 1
figure 1

The penalty curves of \(L_2\) and Gaussian distances

Intuitively, considering the noiseless case, the optimal solution \(\mathbf {f}\) is the one leading to the maximum point-to-point overlap between \(\{\mathbf {f}(\mathbf {x}_i)\}_{i=1}^N\) and \(\{\mathbf {y}_i\}_{i=1}^N\). This can be formulated by using a Boolean operator that “counts” overlapping point number under the transformation \(\mathbf {f}\):

$$\begin{aligned} {\mathcal {E}}(\mathbf {f}) = \sum _{i=1}^N\delta (d(\mathbf {f}(\mathbf {x}_i), \mathbf {y}_i)),\,\,\delta (t)=\left\{ \begin{array}{ll} 1, &{} t=0\\ 0, &{} t\ne 0 \end{array}\right. , \end{aligned}$$
(3)

where \(d(\mathbf {x},\mathbf {y})\) is a distance metric such as Euclidean distance. Equation (3) is a combinational optimization problem which is difficult to obtain the optimal solution. Rather than using the discontinuous indicator function \(\delta \), we consider the Gaussian field in Eq. (2) and obtain the Gaussian field criterion:

$$\begin{aligned} G(\mathbf {f}) = -\sum _{i=1}^N\mathrm{e}^{-\frac{d^2(\mathbf {f}(\mathbf {x}_i), \mathbf {y}_i)}{\sigma ^2}}. \end{aligned}$$
(4)

Now we consider the robustness of the Gaussian field criterion (3). On the one hand, the point position noise can be dealt with by tuning the range parameter \(\sigma \) to the noise variance. On the other hand, the Gaussian distance is inherently robust to outliers, which can be seen from penalty curves of \(L_2\) and Gaussian distance in Fig. 1. Specifically, the \(L_2\) loss, i.e., \(L_2(\mathbf {f}) = \sum _{i=1}^Nd^2(\mathbf {f}(\mathbf {x}_i), \mathbf {y}_i)\), has quadratic penalty curve and resists assigning large distance \(d^2(\mathbf {f}(\mathbf {x}_i), \mathbf {y}_i)\) to any matches and hence tends to be biased by outliers. In contrast, the Gaussian distance could be approximately seen as a truncated \(L_2\) loss, which can bear large distance without paying a too high penalty. In addition, the Gaussian distance is computationally efficient, since it can be simplified into its corresponding logarithmic form.

We further give some intuition on the robustness through the experiment in Fig. 2. The two point sets are related by a rotation transformation, and we are required to seek the optimal rotation angle \(\theta \). From the results, we see that the \(L_2\) loss works well if the data samples do not involve outliers; however, its global optimum becomes steadily worse as the amount of outliers increases. In contrast, the Gaussian field criterion can always produce a global optimum near the ground truth \(\theta =\pi /4\) even half of the data are outliers.Footnote 1

Fig. 2
figure 2

Robustness comparison between the \(L_2\) and Gaussian distances. Left: data samples containing 100 inliers (denoted in blue) with Gaussian noise and 100 random outliers (denoted in red), where the “\(\circ \)” and “\(+\)”, respectively, denote the point sets \(\{\mathbf {x}_i\}_{i=1}^N\) and \(\{\mathbf {y}_i\}_{i=1}^N\), and the spatial transformation is a rotation (i.e., \(\mathbf {y}_i = R(\theta )\mathbf {x}_i\)) with \(\theta =\pi /4\) as the ground truth. Middle and right: the curves of \(L_2\) and Gaussian distances with respect to the rotation angle \(\theta \). Initially, the test data contain only 100 inliers, and then we vary it by always adding 20 new outliers until 100 outliers

To solve the Gaussian field criterion in Eq. (4), we need to model the transformation \(\mathbf {f}\). As can be seen from Eq. (4), the Gaussian field criterion is quite general, which does not rely on any specific transformation model. Next, we model the transformation with homography and non-rigid model for linear and nonlinear matching, respectively.

3.3 Matching with linear transformation model

For image pairs related by a linear transformation model, we consider the general projection transformation which is characterized by a \(3 \times 3\) homography matrix \(\mathbf {H}\). Specifically, we use the homogeneous coordinate for each feature point, e.g., \(\mathbf {x}_i = (\mathbf {x}_i^u, \mathbf {x}_i^v, 1)^{\mathrm{T}}\) and \(\mathbf {y}_i = (\mathbf {y}_i^u, \mathbf {y}_i^v, 1)^{\mathrm{T}}\). Thus, the transformation has the following form:

$$\begin{aligned} \mathbf {y}_i = \mathbf {H}\mathbf {x}_i. \end{aligned}$$
(5)

Due to the use of homogeneous coordinate, the 3D vectors \(\mathbf {y}_i\) and \(\mathbf {H}\mathbf {x}_i\) are not equal, which have the same direction but may differ in magnitude by a nonzero scale factor. Thus, the equation can be expressed in terms of vector cross-product, i.e., \(\mathbf {y}_i\times \mathbf {H}\mathbf {x}_i = {\mathbf {0}}\). By denoting \(\mathbf {h}^j\) the jth row of \(\mathbf {H}\), we obtain the following equation:

$$\begin{aligned} \mathbf {y}_i\times \mathbf {H}\mathbf {x}_i = \mathbf {y}_i\times \left( \begin{array}{c} \mathbf {h}^1\\ \mathbf {h}^2\\ \mathbf {h}^3 \end{array} \right) \mathbf {x}_i = \left( \begin{array}{c} \mathbf {y}_i^v\mathbf {h}^3\mathbf {x}_i-\mathbf {h}^2\mathbf {x}_i\\ \mathbf {h}^1\mathbf {x}_i-\mathbf {y}_i^u\mathbf {h}^3\mathbf {x}_i\\ \mathbf {y}_i^u\mathbf {h}^2\mathbf {x}_i-\mathbf {y}_i^v\mathbf {h}^1\mathbf {x}_i \end{array} \right) . \end{aligned}$$
(6)

Therefore, the equation \(\mathbf {y}_i\times \mathbf {H}\mathbf {x}_i = {\mathbf {0}}\) has the follow form:

$$\begin{aligned} \left( \begin{array}{c} \mathbf {y}_i^v\mathbf {h}^3\mathbf {x}_i-\mathbf {h}^2\mathbf {x}_i\\ \mathbf {h}^1\mathbf {x}_i-\mathbf {y}_i^u\mathbf {h}^3\mathbf {x}_i\\ \mathbf {y}_i^u\mathbf {h}^2\mathbf {x}_i-\mathbf {y}_i^v\mathbf {h}^1\mathbf {x}_i \end{array} \right) =\left[ \begin{array}{ccc} {\mathbf {0}}^{\mathrm{T}}&{}-\mathbf {x}_i^{\mathrm{T}}&{}\mathbf {y}_i^v\mathbf {x}_i^{\mathrm{T}}\\ \mathbf {x}_i^{\mathrm{T}}&{}{\mathbf {0}}^{\mathrm{T}}&{}-\mathbf {y}_i^u\mathbf {x}_i^{\mathrm{T}}\\ -\mathbf {y}_i^v\mathbf {x}_i^{\mathrm{T}}&{}\mathbf {y}_i^u\mathbf {x}_i^{\mathrm{T}}&{}{\mathbf {0}}^{\mathrm{T}} \end{array} \right] \mathbf {h}= {\mathbf {0}}, \end{aligned}$$
(7)

where \(\mathbf {h}=(\mathbf {h}^1, \mathbf {h}^2, \mathbf {h}^3)^{\mathrm{T}}\) is a nine-element vector made up of the entries of \(\mathbf {H}\) in row-major order.

Note that Eq. (7) contains three equations, only two of which are linearly independent. Thus, a point correspondence produces only two equations; here, we utilize the former two and obtain the following equation:

$$\begin{aligned} \mathbf {W}_i\mathbf {h}= {\mathbf {0}}, \end{aligned}$$
(8)

where the coefficient matrix \(\mathbf {W}_i\) is defined as:

$$\begin{aligned} \mathbf {W}_i=\left[ \begin{array}{ccccccccc} \mathbf {x}_i^u &{} \mathbf {x}_i^v &{} 1 &{} 0 &{} 0 &{} 0 &{} -\mathbf {y}_i^u\mathbf {x}_i^u &{} -\mathbf {y}_i^u\mathbf {x}_i^v &{} -\mathbf {y}_i^u\\ 0 &{} 0 &{} 0 &{} \mathbf {x}_i^u &{} \mathbf {x}_i^v &{} 1 &{} -\mathbf {y}_i^v\mathbf {x}_i^u &{} -\mathbf {y}_i^v\mathbf {x}_i^v &{} -\mathbf {y}_i^v \end{array} \right] . \end{aligned}$$
(9)

Therefore, the Gaussian field criterion in Eq. (4) for linear matching becomes:

$$\begin{aligned} G(\mathbf {h}) = -\sum _{i=1}^N\mathrm{e}^{-\frac{\Vert \mathbf {W}_i\mathbf {h}\Vert ^2}{\sigma ^2}}. \end{aligned}$$
(10)

Clearly, the optimal solution of the above objective function is \(\mathbf {h}={\mathbf {0}}\), and it is of no interest to us. Actually, the homography \(\mathbf {H}\) is a homogeneous matrix, consisting of nine elements, but it has only eight degrees of freedom. Without loss of generality, we fix the last element of \(\mathbf {h}\) to 1. By defining \({\widetilde{\mathbf {h}}}\) as an eight-element vector (i.e., the first eight elements of the original \(\mathbf {h}\)) and \({\widetilde{\mathbf {W}}}_i\) as follows:

$$\begin{aligned} {\widetilde{\mathbf {W}}}_i=\left[ \begin{array}{cccccccc} \mathbf {x}_i^u &{} \mathbf {x}_i^v &{} 1 &{} 0 &{} 0 &{} 0 &{} -\mathbf {y}_i^u\mathbf {x}_i^u &{} -\mathbf {y}_i^u\mathbf {x}_i^v\\ 0 &{} 0 &{} 0 &{} \mathbf {x}_i^u &{} \mathbf {x}_i^v &{} 1 &{} -\mathbf {y}_i^v\mathbf {x}_i^u &{} -\mathbf {y}_i^v\mathbf {x}_i^v \end{array} \right] , \end{aligned}$$
(11)

we have the following equation:

$$\begin{aligned} {\widetilde{\mathbf {W}}}_i{\widetilde{\mathbf {h}}} = \mathbf {y}_i. \end{aligned}$$
(12)

Thus, the Gaussian field criterion becomes:

$$\begin{aligned} G({\widetilde{\mathbf {h}}}) = -\sum _{i=1}^N\mathrm{e}^{-\frac{\Vert \mathbf {y}_i-{\widetilde{\mathbf {W}}}_i{\widetilde{\mathbf {h}}}\Vert ^2}{\sigma ^2}}. \end{aligned}$$
(13)

The above objective function is always continuously differentiable with respect to the homography parameter \({\widetilde{\mathbf {h}}}\), and its derivative has the following form:

$$\begin{aligned} \frac{\partial G({\widetilde{\mathbf {h}}})}{\partial {\widetilde{\mathbf {h}}}} = \sum _{i=1}^N\frac{2({\widetilde{\mathbf {W}}}_i^{\mathrm{T}}{\widetilde{\mathbf {W}}}_i{\widetilde{\mathbf {h}}}-{\widetilde{\mathbf {W}}}_i^{\mathrm{T}}\mathbf {y}_i)}{\sigma ^2} \mathrm{e}^{-\frac{\Vert \mathbf {y}_i-{\widetilde{\mathbf {W}}}_i{\widetilde{\mathbf {h}}}\Vert ^2}{\sigma ^2}}. \end{aligned}$$
(14)

By using the derivative in Eq. (14), we adopt gradient-based numerical optimization techniques such as the quasi-Newton method to solve the homography parameter \({\widetilde{\mathbf {h}}}\). In addition, we apply the deterministic annealing technique [54] on the range parameter \(\sigma \) to improve the convergence, where \(\sigma \) is initialized with a large value and gradually reduced by \(\sigma \rightarrow \gamma \sigma \) with \(\gamma \) being the annealing rate.

The correctness of a putative match could be determined by checking whether it is consistent with \(\mathbf {H}\). We summarize the proposed Gaussian field criterion (GFC) method for linear matching in Algorithm 1.

figure f

3.4 Matching with nonlinear transformation model

We next consider the matching problem undergoing nonlinear transformation. We utilize a general nonparametric model to handle complex (e.g., non-rigid) transformations. Specifically, the transformation \(\mathbf {f}\) is modeled by requiring it to lie within a functional space \({\mathcal {H}}\), namely reproducing kernel Hilbert space (RKHS) [55]. We define \({\mathcal {H}}\) by a diagonal decomposable kernel \(\varGamma \): \(\varGamma (\mathbf {x}_i, \mathbf {x}_j) = \mathrm{e}^{-\beta \Vert \mathbf {x}_i-\mathbf {x}_j\Vert ^2}\cdot \mathbf {I}\); then, \(\mathbf {f}\) has the following form:

$$\begin{aligned} \mathbf {f}(\mathbf {x}) = \sum _{i=1}^N\varGamma (\mathbf {x}, \mathbf {x}_i)\mathbf {c}_i \end{aligned}$$
(15)

with \(\{\mathbf {c}_i\}_{i=1}^N\) being the coefficients.

The non-rigid transformation will lead to ill-posedness in Eq. (4), as the solution of \(\mathbf {f}\) is not unique. To ensure well-posedness, we introduce a regularization term to control the complexity of \(\mathbf {f}\) and the Gaussian field criterion for nonlinear matching then becomes:

$$\begin{aligned} G(\mathbf {f}) = -\sum _{i=1}^N\mathrm{e}^{-\frac{d^2(\mathbf {f}(\mathbf {x}_i), \mathbf {y}_i)}{\sigma ^2}}+\lambda \Vert \mathbf {f}\Vert _{\mathcal {H}}^2, \end{aligned}$$
(16)

where \(\lambda >0\) controls the trade-off between the first data fidelity term and the second regularization term and \(\Vert \cdot \Vert _{\mathcal {H}}\) is a functional norm which can be defined by an inner product, e.g., \(\Vert \mathbf {f}\Vert _{\mathcal {H}}^2 = \langle \mathbf {f}, \mathbf {f}\rangle _{\mathcal {H}}\).

To optimize the objective function in Eq. (16), we rewrite it in the matrix form:

$$\begin{aligned} {\textstyle {G(\mathbf {C}) = -\sum _{i=1}^N\mathrm{e}^{-\frac{\Vert \mathbf {y}_i^{\mathrm{T}}-\varvec{\Gamma }_{i,\cdot }\mathbf {C}\Vert ^2}{\sigma ^2}}+\lambda \cdot \text {tr}(\mathbf {C}^{\mathrm{T}}\varvec{\Gamma }\mathbf {C}),}} \end{aligned}$$
(17)

where \(\varvec{\Gamma }\in \mathbb {R}^{N\times N}\) with \(\varvec{\Gamma }_{ij} = \mathrm{e}^{-\beta \Vert \mathbf {x}_i-\mathbf {x}_j\Vert ^2}\) is the so-called Gram matrix, \(\mathbf {C}= (\mathbf {c}_1, \ldots , \mathbf {c}_N)^{\mathrm{T}}\), and tr\((\cdot )\) is the trace.

The objective function in Eq. (17) is always continuously differentiable with respect to \(\mathbf {C}\), and its derivative has the following form:

$$\begin{aligned} \frac{\partial G(\mathbf {C})}{\partial \mathbf {C}}=&-\sum _{i=1}^N\frac{2\varvec{\Gamma }_{i,\cdot }^{\mathrm{T}}(\varvec{\Gamma }_{i,\cdot }\mathbf {C}-\mathbf {y}_i^{\mathrm{T}})}{\sigma ^2}\mathrm{e}^{-\frac{\Vert \mathbf {y}_i^{\mathrm{T}}-\varvec{\Gamma }_{i,\cdot }\mathbf {C}\Vert ^2}{\sigma ^2}}\nonumber \\&+2\lambda \varvec{\Gamma }\mathbf {C}. \end{aligned}$$
(18)

Similar to the linear matching algorithm, by using the derivative in Eq. (18) and some numerical optimization techniques together with the deterministic annealing technique, we can solve the transformation parameter \(\mathbf {C}\).

Fast Implementation. The Gram matrix \(\varvec{\Gamma }\) is of size \(N\times N\), and hence, it will typically cost \(O(N^3)\) to solve \(\mathbf {C}\). This will pose significant computational burden when dealing with real-time or large-scale problems. To address this issue, we apply a sparse approximation to the transformation \(\mathbf {f}\) based on the subset of regressors method [56, 57]. Specifically, we randomly pick a subset of M points \(\{\tilde{\mathbf {x}}_i\}_{i=1}^M\) and only let them have nonzero coefficients in Eq. (15); thus, \(\mathbf {f}\) becomes:

$$\begin{aligned} \mathbf {f}^s(\mathbf {x}) = \sum _{i=1}^M\varGamma (\mathbf {x}, {\tilde{\mathbf {x}}}_i)\mathbf {c}_i^s. \end{aligned}$$
(19)

It has been shown in [56] that selecting subset randomly performs no worse than those more time-consuming and sophisticated methods. Accordingly, the derivative becomes:

$$\begin{aligned} \frac{\partial G(\mathbf {C}^s)}{\partial \mathbf {C}^s} =&-\sum _{i=1}^N\frac{2\mathbf {U}_{i,\cdot }^{\mathrm{T}}(\mathbf {U}_{i,\cdot }\mathbf {C}^s-\mathbf {y}_i^{\mathrm{T}})}{\sigma ^2}\mathrm{e}^{-\frac{\Vert \mathbf {y}_i^{\mathrm{T}}-\mathbf {U}_{i,\cdot }\mathbf {C}^s\Vert ^2}{\sigma ^2}}\nonumber \\&+2\lambda \varvec{\Gamma }^s\mathbf {C}^s, \end{aligned}$$
(20)

where the definitions of \(\varvec{\Gamma }^s\in \mathbb {R}^{M\times M}\) and \(\mathbf {C}^s\in \mathbb {R}^{M\times 2}\) are similar to \(\varvec{\Gamma }\) and \(\mathbf {C}\), \(\mathbf {U}\in \mathbb {R}^{N\times M}\) with \(\mathbf {U}_{ij} = \mathrm{e}^{-\beta \Vert \mathbf {x}_i-{\tilde{\mathbf {x}}}_j\Vert ^2}\).

The correctness of a putative match could be determined by checking whether it is consistent with \(\mathbf {f}\). We summarize the proposed Gaussian field criterion (GFC) method for nonlinear matching in Algorithm 2.

figure g

3.5 Computational complexity

The major computational burden is to compute the derivative in Eq. (14) or (18) and solve the transformation parameter using a numerical technique such as the quasi-Newton method. For linear matching, they cost both about O(N) and hence, the total time complexity is about O(N). The space complexity is about O(N) due to the requirement of storing the N coefficient matrix \(\mathbf {W}_i\).

For the nonlinear matching in Algorithm 2, without using the sparse approximation, the calculation of the Gram matrix \(\varvec{\Gamma }\) in Line 1 costs \(O(N^2)\). The computation of derivative in Eq. (18) and optimization of transformation with quasi-Newton method cost, respectively, about \(O(N^2)\) and \(O(N^3)\), and hence, the total time complexity is about \(O(N^3)\). The space complexity is about \(O(N^2)\) due to the requirement of storing the Gram matrix \(\varvec{\Gamma }\). By using the sparse approximation, the calculations of matrices \(\varvec{\Gamma }^s\) and \(\mathbf {U}\) in Line 1 cost, respectively, \(O(M^2)\) and O(MN), while the time complexities of computing the derivative and solving \(\mathbf {C}\), respectively, reduce to about O(MN) and \(O(M^3)\). Therefore, the total time complexity reduces to about \(O(MN+M^3)\). As M is a constant and \(M\ll N\), we can simply write the time complexity as O(N). Clearly, the space complexity is also reduced to O(N). In practice, the sparse approximation is always encouraged to achieve fast implementations, especially in real-time matching problems.

In conclusion, our Gaussian field criterions for linear and nonlinear matching both have linear time complexity and linear space complexity, which is significant for handling real-time problems or matching high-resolution images with plenty of feature points.

3.6 Implementation details

The performance of feature matching typically depends on the coordinate system which is used to express the feature points; here, we use data normalization as that in [50] to control this. More specifically, we apply a linear scaling to each point set, so that it has zero mean and unit variance.

For the linear matching problem, there is only one parameter in our GFC such as \(\gamma \), the deterministic annealing rate used for improving the convergence of transformation estimation. We fix it to 0.93 throughout the experiments. For the nonlinear matching problem, three additional parameters are involved such as \(\beta \), \(\lambda \), and M. Parameters \(\beta \) and \(\lambda \) both control the smoothness of the transformation \(\mathbf {f}\), where the former determines the width of interaction range between feature points through the transformation and the latter constrains the complexity of the transformation. Parameter M determines the number of basis used for sparse approximation to construct the transformation. We fix them throughout the experiments as follows: \(\beta = 0.01\), \(\lambda = 0.1\), and \(M = 30\).

4 Experimental results

In this section, we test our GFC for feature matching on various real image data including both natural images and remote sensing images and compare it with five other state-of-the-art feature matching methods including RANSAC [41], ICF [44], GS [45], VFC [14], LPM [58], and MR [47]. All the experiments are conducted on a laptop with 2.4 GHz Intel Core CPU, 8 GB memory, and MATLAB code.

4.1 Datasets and settings

To perform a comprehensive evaluation of our method, we conduct experiments on both natural and remote sensing image datasets. In the following, we first describe the test data and then introduce the evaluation criteria.

  • Natural image dataset The publicly available VGG dataset [59] is adopted for linear matching. This dataset contains eight groups of images, and each group consists of five image pairs, leading to 40 image pairs in total. The image pairs suffer from different geometric and photometric transformations including viewpoint change, scale change, rotation, light change, image blur, and JPEG compression. In addition, these image pairs are related always by homography due to that they are either of planar scenes or acquired in a fixed camera position. The ground truth homographies are provided in the dataset. Some example image pairs are shown in Fig. 3. To make the dataset more challenging, we construct two putative sets from each image pair by setting different distance ratio thresholds to measure the similarity between SIFT descriptors. For nonlinear matching, we collect several typical image pairs with relatively complex deformations, including wide baseline pairs and image pairs involving non-rigid deformations.

  • Remote sensing image dataset For linear matching, we collect several color–infrared aerial photograph image pairs and SPOT image pairs, which involve translation, rotation, scaling, and so on. For nonlinear matching, we collect a set of 50 image pairs including SPOT images, SAR images, panchromatic aerial photographs, as well as UAV images. These image pairs usually involve ground relief variations and imaging viewpoint changes, where the image transformations do not exactly satisfy a linear model such as homography. Therefore, a more complex nonlinear model is necessary to generate accurate matching results.

Fig. 3
figure 3

Example image pairs in the VGG dataset [59]

We use the open-source VLFeat toolbox [60] to determine the putative SIFT matches. The putative matches are then established based on the similarity between SIFT descriptors. The matching performance is characterized by precision and recall, where the precision is defined as the ratio between the identified correct match number and the whole preserved match number, and the recall is defined as the ratio between the identified correct match number and the whole correct match number.

To establish the ground truth feature matches in each image pair, for linear matching, we first acquire the ground truth homography and then an overlap error is used to determine the match correctness as that in [14]. The ground truth homography in the remote sensing image dataset is calculated from manually selected correspondences together with least squares. For nonlinear matching, we manually check the match correctness of each putative match for all image pairs.

Fig. 4
figure 4

Statistics of initial inlier ratio (left), precision–recall pair (middle), and run-time (right) of different methods on the VGG dataset [59]

Fig. 5
figure 5

Qualitative matching results of our GFC with linear model on NotreDame, Church, DogCat, Peacock, and T-shirt, respectively. The inlier ratios of 5 image pairs are 75.27, 54.76, 81.74, 83.63, and \(49.64\%\). The head and tail of each arrow in the motion field correspond to the positions of feature points in two images (blue = true positive, black = true negative, green = false negative, and red = false positive). For visibility, in the image pairs, at most 50 randomly selected matches are presented, and the true negatives are not shown. Best viewed in color

4.2 Results on natural image pairs

We first evaluate the matching performance of our GFC with linear model on the VGG dataset [59]. The precision–recall (p–r) statistics of different methods are reported in the middle of Fig. 4. Each scatter point corresponds to a p–r pair of a certain method on a certain image pair. The average match number in the putative sets is about 773.9. From the results, we see that all the six methods work quite well on this dataset. In general, our GFC has a better trade-off compared to the other methods and the p–r pairs of our method almost concentrate on the top right corner. More specifically, the average p–r pair achieves about \((97.81, 97.57\%)\). We also report the run-time statistics of different methods, as shown in the right figure of Fig. 4. We see that LPM is the most efficient one and our GFC ranks middle with an average run-time of about 84 milliseconds.

Table 1 Quantitative comparison of different methods on the image pairs in Fig. 5

We next provide some intuitionistic matching results of our GFC with nonlinear model on several typical image pairs with relatively complex deformations, as shown in Fig. 5. The first two pairs are wide baseline images, and the rest three pairs involve non-rigid deformations. For each group of results, the left image pair schematically shows the matching result, while the right field presents the matching correctness of each putative match. From the results, we again see that our GFC is robust and accurate and very few putative matches are misjudged.

In addition, we also provide quantitative comparison of different methods on these image pairs, as shown in Table 1. As RANSAC relies on a parametric model which is not suitable for non-rigid matching, we do not report its results in this table. From the results, we see that ICF and GS usually cannot obtain good precision and recall simultaneously, while LPM cannot work well in case of low initial inlier percentage. VFC and MR in general can produce satisfying results. However, our GFC clearly has the best precision–recall trade-off on all the five image pairs.

Fig. 6
figure 6

Qualitative matching results of our GFC with linear model on six typical remote sensing image pairs involving linear transformation. The inlier ratios of 6 image pairs are 90.15, 84.14, 90.12, 87.97, 85.31, and \(89.98\%\). The head and tail of each arrow in the motion field correspond to the positions of feature points in two images (blue = true positive, black = true negative, green = false negative and red = false positive). For visibility, in the image pairs, at most 50 randomly selected matches are presented and the true negatives are not shown. Best viewed in color

Table 2 Quantitative comparison of different methods on the image pairs in Fig. 6

4.3 Results on remote sensing image pairs

Now we test our GFC on the remote sensing image datasets. First, we test the linear model on several typical image pairs involving only linear transformation, as shown in Fig. 6. From the results, we see that our GFC is able to always produce satisfying matching performance and almost all the inliers and outliers are accurately identified. We also provide the results of the other six comparison methods in Table 2. Again, all the seven methods’ results are quite satisfying, which is mainly due to the simplicity of linear matching and the large initial inlier ratios in the image pairs.

Fig. 7
figure 7

Statistics of initial inlier ratio (left), precision–recall pair (middle), and run-time (right) of different methods on the remote sensing dataset

We further report the matching performance of our GFC with nonlinear model on a remote sensing image dataset with 50 image pairs. The statistic results are shown in Fig. 7. To make the dataset more challenging, we use the nearest neighbor strategy to construct the putative set in each image pair. For example, for each feature point in one image, we search its nearest neighbor in the other image in descriptor space under Euclidean distance, and then these two points are considered as a putative match. This procedure results in very low initial inlier ratios in the putative sets, as shown in the left figure of Fig. 7. The average initial inlier ratio is merely \(35.99\%\). Here we do not provide the results of LPM either, as it cannot obtain satisfying results in case of such low inlier ratio. The performance comparison of the rest five methods is shown in the middle figure of Fig. 7. Clearly, our GFC has the best results, in terms of both precision and recall. That is to say, our GFC is much more robust in case of large outliers. We also report the run-time statistics of different methods in the right figure of Fig. 7. Our GFC again ranks middle with an average run-time of about 0.678 s. Moreover, we also have tested the performance of our GFC without sparse approximation on this dataset and obtained an average p–r pair of about \((95.05, 96.82\%)\). However, the average run-time exceeds 2000 s. This demonstrates that the sparse approximation can significantly reduce the computational complexity without sacrificing the matching accuracy.

Fig. 8
figure 8

Average precisions and recalls on the remote sensing dataset with respect to different values of \(\beta \) (left), \(\lambda \) (middle), and M (right)

Finally, we test the influence of parameter settings of our GFC on the remote sensing image dataset. There are mainly three parameters such as \(\beta \), \(\lambda \), and M. We vary them and compute the average p–r pairs on the whole dataset. The statistic results are shown in Fig. 8. In each subfigure, we vary one parameter and fix the other parameter to their “optimal values.”Footnote 2 From the results, we see that the average precision and recall both achieve their largest values at \(\beta = 0.01\) and \(\lambda = 0.1\). For parameter M, the average precision and recall do not change too much when \(M\ge 15\). Nevertheless, \(M=15\) is more preferable for computational efficiency.

5 Conclusion

Within this paper, we introduce a robust estimator named Gaussian field criterion (GFC) for image feature matching. A key characteristic of our GFC is that it inherently has robustness to outliers and hence can efficiently remove false matches from a set of putative matches. The image transformation is modeled by both linear and nonlinear functions, and a sparse approximation is applied to the nonlinear case to significantly reduce the computational complexity. The qualitative and quantitative experiments on publicly available datasets, including both natural and remote sensing image pairs, demonstrate that our GFC is quite efficient for mismatch removal. More specifically, it is able to produce superior matching performance over several state-of-the-arts, especially when the image scenes suffer from complex non-rigid deformations or lots of outliers. This will be of great benefit to registering remote sensing images with low overlaps, repeated patterns, local distortions or low quality.