1 Introduction

In recent years, there has been a growing interest in 3D construction [1, 2], image matching [3, 4], and image registration [5]. Efficient descriptor learning has emerged as a crucial research area within these computer vision applications. The dominant approach to generate local feature descriptors is to encode image patches into representative vectors.

In previous studies, the focus was on handcrafted descriptors, which require sophisticated engineering knowledge and mathematical derivation processes [6]. Traditional image-matching pipelines rely on handcrafted descriptors, which have been successful in various applications. However, learning-based descriptors have demonstrated better scalability, robustness, and discriminability, achieving higher matching performance compared to handcrafted descriptors [7]. In this case, a single-scale network cannot realize the specific size of specific learning and cannot integrate more rich information.

Previous studies primarily focused on handcrafted descriptors, which require sophisticated engineering knowledge and mathematical derivation processes [6]. Traditional image matching pipelines, based on handcrafted descriptors like SIFT [8] and SURF [9], have been successful in various applications. However, these crafted descriptors still face challenges, such as the omission of important image details. Recently, learning-based descriptors [7, 10, 11] have demonstrated better scalability, robustness, and discriminability, achieving higher matching performance compared to handcrafted descriptors. These studies show that deep learning can greatly improve the efficiency of descriptors. In this case, a single-scale network is typically used to learn local features. However, a single-scale network cannot realize specific-scale learning, thereby failing to integrate richer information. Center-surround multi-scale feature extraction is performed by generating sub-patches of different sizes cropped from the central regions of input patches. Features are then extracted from the respective sub-patches to form multi-scale features. The adoption of deep learning-based descriptor extraction, as seen in CS L2Net [10] and Patch-NetVLAD [12], is more effective in introducing richer information into the learned features. However, it is important to highlight that simple concatenation and fixed-weight fusion, commonly used in this approach, cannot be optimized during backpropagation.

The descriptor network based on the triplet loss function proves effective in enhancing descriptor performance under challenging conditions. Numerous studies have focused on refining this loss function, incorporating regularization for modification [13,14,15,16,17]. This function is commonly employed to train descriptor networks, emphasizing that the distance between descriptors of negative samples should exceed that between descriptors of positive samples. Despite its effectiveness, prior works often overlook the characteristics of the descriptor distance measure. The relative distance between matching pairs can also be constrained in the regularization term. Additionally, while previous works commonly simplify calculations using a similarity matrix, they neglect that the symmetry of the similarity matrix can be used for the network’s training constraint.

To tackle the previously mentioned challenges, we propose a novel approach called AFSRNet, a center-surround multi-scale convolutional neural network designed for extracting local descriptors. AFSRNet employs three branches as its input, each dedicated to a distinct center-surround scale for processing input patches. These branches generate features of the same size using dilated convolution. In addition to the network architecture, we address descriptor learning challenges by introducing a unique regularization term based on symmetry of the descriptor similarity matrix. Unlike previous works, our approach considers not only the distance between descriptors of negative and positive samples, but also the relative distance between matching pairs in the regularization term. Moreover, we perform the regularization by enhancing symmetry of the similarity matrix to further decrease the calculation of descriptor learning. In summary, the main contributions of our work include the following:

  1. 1)

    We introduce a novel neural network based on center-surround Adaptive Multi-Scale Feature Fusion (AMSF) for learning local descriptors, which can be optimized during backpropagation.

  2. 2)

    We propose a new regularization term called Symmetric Regularization (SR), which constrains the similarity matrix of the descriptors to improve the robustness of the learned descriptors.

  3. 3)

    By combining SR with triplet loss, our descriptor-learning network can be trained to achieve state-of-the-art results for local descriptor learning.

2 Related work

The research of designing local descriptors has gradually moved from handcrafted ones to learning-based ones. Since the purpose of this paper is descriptor learning, below we give a brief review of descriptor learning methods in this paper, ranging from traditional methods to the recently proposed learning-based methods and various applications of multi-scale feature fusion.

2.1 Descriptors learning network

Handcrafted descriptors for local patches have primarily focused on mathematical derivations, such as gradient filters and intensity comparisons. SIFT [8] is widely considered as the most commonly used real-valued descriptor, which computes smoothed histograms using the gradient field of the image patch. SURF[9] utilizes a box filter to extract image gradient information and employs a multi-scale filter in the scale space, replacing the downsampling operation in SIFT to improve computational efficiency. This modification effectively enables many practical applications to be realized. While SURF has demonstrated the significance of multi-scale features in traditional descriptor learning, they are rarely incorporated into deep learning-based methods.

HardNet [11] employs a simple but effective strategy known as hard negative mining, which highlights the significance of proper sampling. This sampling strategy aims to select the most indistinguishable image descriptors to train the network, resulting in improved robustness in patch matching. Consequently, many models, including our own, have adopted this sampling strategy since its introduction. Furthermore, Liang et al. [18] proposed a multi-level aggregation technique that facilitates descriptor learning across the entire network. Each level of their network extracts a feature vector after feature fusion, and the final descriptor concatenates these outputs. CRBNet [19] is built upon L2Net and starts by using strided convolutional layers for down-sampling. They bring in a residual learning framework for enhancement. The architecture has four stages with shortcut connections, and they strategically put strided convolutional layers in the last two stages to effectively keep the input patch information. Zhang et al. DarkFeat [20] addresses the descriptor challenges in low-light visual perception at night. The local feature descriptor is crucial for such applications, face performance degradation in extreme low-light scenarios due to low signal-to-noise ratio in images. To overcome this, the paper proposes a deep learning model designed for end-to-end detection and description of local features directly from RAW format images captured in extreme low-light conditions.

The majority of the aforementioned research on descriptor matching primarily focuses on single-scale approaches. In contrast, we propose a center-surround multi-scale fusion network to enhance matching accuracy and enable the network to focus on different parts of the image patch. In the following section, we provide a detailed description of our network architecture and its components.

2.2 Multi-scale feature fusion

Multiscale feature fusion has greatly contributed to target detection and semantic segmentation tasks by combining semantic and spatial information from different feature scales to improve overall performance. Multiscale input methods include building multiscale pyramids, using multiscale intermediate feature maps, and parallel inputting of multiscale image information.

In serial multi-scale architectures, FPN [21] integrates multi-scale features with a modified top-down path using center-cropped patches, creating a pyramid-like structure. EFPN [22] enhances object detection with a Pyramid Network featuring an extra high-res level for small object detection. MSPFN [23] achieves rain streak removal with recurrent calculation and a multi-scale pyramid. MFANet [24] improves detection accuracy using channel attention. In parallel multi-scale networks, SPP [25] extracts features with multiple pooling layers. ASPP [26] captures multi-scale information with parallel atrous convolutions. Trident-net [27] adjusts the receptive field with parallel multi-branch architectures, enhancing feature extraction efficiency.

Building upon insights gained from prior methods, CS L2Net [10] introduces a novel architectural paradigm in the realm of feature extraction. Departing from conventional approaches, CS L2Net embraces a concatenation tower structure that leverages the strength of two distinct L2Net models, each characterized by parallel towers. Patch-NetVLAD [12] deviates from conventional concatenation by employing fixed weights for the fusion of descriptors across center-surround scales. The process involves cutting patches, generating descriptors, and applying predetermined weights for fusion.

However, an important observation is that existing multi-scale fusion methods lack learnability or relies on a direct connection. This limitation restricts the optimization of the fusion process during backpropagation. To overcome this challenge, our proposed approach introduces an adaptive center-surround multi-scale fusion method. This innovative strategy addresses the non-learnable nature of existing techniques, allowing dynamic optimization of descriptor fusion across various center-surround scales through adaptive weight adjustment during backpropagation.

To effectively utilize multi-scale features, we also design a parallel multi-branch and center-surround network. Each branch focuses on studying features at a particular scale, and an efficient fusion technique is employed instead of simple concatenation. This approach allows us to fully exploit the benefits of multi-scale information. We propose a feature fusion approach that can make adaptive learning and achieve better performance.

2.3 Descriptor distance constraint

Modifying the loss function and regularization terms has proven to be an effective approach for constraining the distance of descriptors. Triplet loss [28] enable the learning of more suitable network weights by comparing Euclidean distances between samples. While Euclidean distances provide an absolute measurement, relative distance measurement is more appropriate for vector embedding learning. To incorporate relative measurement into descriptor learning, RALNet [13] utilizes the angle distance between feature vectors instead of the L2 distance to measure their similarity. In our approach, the regularization of the loss function is also designed based on angular distance. In addition to modifying how descriptor distances are measured, researchers have explored modifications to other aspects of the loss function. For instance, CDF [17] enhances the triplet loss by utilizing a dynamic margin based on cumulative distribution instead of a fixed margin. HSD [15] projects the entire network onto hyperspherical space by altering the normalization, demonstrating that hyperspherical learning is more suitable for descriptors. RDLNet [16] focuses on learning hard samples and compact descriptors through triplet networks, directing the network’s attention to challenging examples and promoting the generation of concise descriptors.

The constraint of descriptor distances solely based on matched and unmatched pairs may not be sufficiently robust. In addition to first-order optimization techniques, SOSNet [14] demonstrates that second-order constraints further enhance the quality of descriptors. However, SOSNet overlooks the fact that the properties distance between matching pairs can also be constrained within the regularization term. In our approach, we introduce this constraint into the second-order constraint and implement it using the similarity matrix. This regularization term incorporates the consideration of matching pair properties, improving the overall performance and robustness of the descriptors.

Fig. 1
figure 1

The main architecture of AFSRNet consists of three parts: multi-scale input, feature extraction, and adaptive multi-scale fusion. The architecture includes three branches, named Full Patch Branch, Center-Peripheral Branch, and Central Branch from top to bottom

Fig. 2
figure 2

The feature extraction architecture is adapted from HyNet [29]. Each convolutional layer is followed by Filtere Response Normalization (FRN) and Thresholded Linear Unit(TLU)

3 Method

3.1 Network architecture

Figure 1 shows the main architecture of AFSRNet, which mainly contains three parts: multi-scale input, feature extraction, and feature fusion. Each part is described in detail in the following.

Multi-scale input

Previous work has demonstrated that focusing on pixels near the center of a patch and paying more attention to the central region of input patches can enhance the accuracy of descriptor matching. To this end, we propose a three-stream structure that contains the central region at all three different scales. As shown in Fig. 1, the three multi-scale branches are referred to as the full-patch branch, center-peripheral branch, and central branch, with sizes of 64\(\times \)64, 48\(\times \)48, and 32\(\times \)32, respectively. The latter two parts are obtained by cutting in a center-surround way. The one closest to the center, and also the smallest, is called the central part, corresponding to the center branch. The second smallest part is the center-periphery section between the full patch and the center, containing the center part, and it corresponds to the center-periphery branch. The three inputs are processed using dilated convolution with different strides in the three branches to generate features of dimension 32\(\times \)32\(\times \)32. Such a central-surround multi-scale patch information input is critical for enhancing the performance of descriptor matching.

Feature Extraction

After adjusting the size of the feature maps from the cropped patches, we feed them into feature extraction module. The architecture of the feature extraction module of AFSRNet is shown in Fig. 2. Instead of pooling layers, the spatial size is reduced using stridden convolutions since pooling layers tend to negatively impact the performance of the descriptor [11]. The output of this module is a feature map of size 8\(\times \)8\(\times \)128.

Feature Fusion

In our proposed method, we utilize adjacent sub-networks in parallel for feature fusion. This allows the descriptor to focus on feature information from different center-surround receptive fields. To achieve adaptive feature map fusion, we implement a fusion method based on normalized weights, as follows:

$$\begin{aligned} F_1=Conv(\frac{\omega _{11}\cdot P_1+\omega _{12}\cdot P_2}{\omega _{11}+\omega _{12}+\varepsilon }), \end{aligned}$$
(1)
$$\begin{aligned} F_2=Conv(\frac{\omega _{21}\cdot P_1+\omega _{22}\cdot P_2+\omega _{23}\cdot P_3}{\omega _{21}+\omega _{22}+\omega _{23}+\varepsilon }), \end{aligned}$$
(2)
$$\begin{aligned} F_3=Conv(\frac{\omega _{32}\cdot P_2+\omega _{33}\cdot P_3}{\omega _{32}+\omega _{33}+\varepsilon }), \end{aligned}$$
(3)
$$\begin{aligned} F_{final}=\frac{\omega _{f_1}\cdot F_1+\omega _{f_2}\cdot F_2+\omega _{f_3}\cdot F_3}{\omega _{f_1}+\omega _{f_2}+\omega _{f_3}+\varepsilon }. \end{aligned}$$
(4)

The weight \(\omega _{ij}\) of the \(i^\textrm{th}\) output and the \(j^\textrm{th}\) input is learnable. To ensure that \(\omega _{ij}\ge 0\), we apply a ReLu after each \(\omega _{ij}\). Besides, \(\varepsilon =0.0001\) is a small value to avoid numerical instability.

3.2 Loss function and regularization

Triplet loss Triplet loss is a commonly used loss function for training local descriptors. It enforces smaller distances between positive matches and larger distances between negative matches. The common expression for triplet loss is as follows:

$$\begin{aligned} L_{triplet}=\frac{1}{N}\sum _{i}^N max(s(a_{i},p_{i})-s(a_{i},n_{i})+m,0), \end{aligned}$$
(5)

where a, p and n represent an anchor, a positive and a negative of the triplet tuple, m represents the margin and function s(xy) represents the similarity score between the two features x and y, and x and y are L2 normalized. Sampling is crucial to achieve both performance gain and computational efficiency, thus, we adopt the same sampling strategy as HardNet [11]. RALNet [13] has demonstrated that cosine similarity is superior to Euclidean distance in comparing descriptor distances. Therefore, we define the similarity function s(xy) as follows, hereafter denoted as \(s_{xy}\) for simplicity, as follows:

$$\begin{aligned} s(x,y)=1-x\cdot y^{T}=1-||x||\,||y||\cos \theta _{{xy}}, \end{aligned}$$
(6)

where \(\theta _{xy}\) represents the angle between x and y.

Regularization

During training, a training batch consists of N pairs of matched patches. The size of the descriptor matrices A and P are \(N \times 384\) while the similarity matrix D is \(N \times N\). The similarity matrix D is shown as follows:

$$\begin{aligned} {\begin{matrix} D = (1-A\cdot P^{T}) =\left[ \begin{array}{ccccc} s_{a_{1}p_{1}} &{} s_{a_{1}p_{2}} &{} . &{} s_{a_{1}p_{N}} \\ s_{a_{2}p_{1}} &{} . &{}. &{}. \\ . &{} . &{} . &{}.\\ . &{} . &{} . &{}.\\ s_{a_{N}p_{1}} &{} . &{} . &{} s_{a_{N}p_{N}}\end{array}\right] .\\ \end{matrix}} \end{aligned}$$
(7)

The foundational principle of second-order similarity (SOS) posits that vertices with similar neighbors are likely to exhibit similarity. SOSNet [14] employs second-order similarity regularization (SOSR) by minimizing the Euclidean distance between \(a_i\),\(a_j\) and \(p_i\),\(p_j\) to enhance descriptor similarity. However, the efficacy of using Euclidean distance as a second-order similarity measure may be limited [13]. Additionally, SOSNet overlooks the potential improvement that could result from integrating first-order similarity properties into second-order similarity constraints. While constructing similarity matrices for training is common in descriptor methodologies, these matrices are rarely utilized for regularization terms within constraints.

In response to these limitations, we introduce a novel regularization term, symmetric regularization (SR). The motivation behind SR lies in addressing the inadequacies of relying solely on Euclidean distance and exploring the untapped potential of incorporating first-order and second-order similarity properties. To achieve this, SR strategically leverages distance matrices, which comprehensively capture the pairwise relationships between descriptors.

Distance matrices provide a detailed representation of the similarity landscape by encapsulating the distances between all descriptor pairs. This nuanced understanding of pairwise relationships enables a more refined and context-aware descriptor learning process. Moreover, SR places emphasis on the symmetry of similarity relationships.

By incorporating distance matrices and symmetry considerations into the regularization process, SR aims to refine descriptor training. The regularization term encourages bidirectional consistency in similarity relationships, enhancing the overall robustness and accuracy of the descriptor model. This strategic incorporation of distance matrices and symmetry considerations represents a departure from conventional approaches, ensuring a more detailed exploration of the underlying structures within descriptor spaces. The introduction of SR contributes to a more nuanced understanding of the intrinsic relationships between descriptors, ultimately leading to improved descriptor learning outcomes.

Proposition 1

Let \(a_i\), \(p_i\) are pairs of matched descriptors in a batch and they are second order similar, that is

$$\begin{aligned} First\,\,Order\,\,Similitry\,(FOS):\,\,{s_{{a_i}{p_i}}}={s_{{a_j}{p_j}}}=0, \end{aligned}$$
(8)
$$\begin{aligned} Second\,\,Order\,\,Similitry\,(SOS):\,\,s_{{a_i}{a_j}}=s_{{p_i}{p_j}}, \end{aligned}$$
(9)

then the similarity matrix D is symmetric.

Proof

Based on (8), we can conclude that

$$\begin{aligned} \left\{ \begin{aligned} a_i\cdot {p_i}^{T}&= ||a_i||\,||p_i||\cos \theta _{{a_i}{p_i}}= 1,\\ a_j\cdot {p_j}^{T}&= ||a_j||\,||p_j||\cos \theta _{{a_j}{p_j}}= 1 \end{aligned} \right. \end{aligned}$$
(10)

The descriptors are all L2 normalized, so the norm of them equal to 1, and thus can be conclude from (6) and (10) that

$$\begin{aligned} \theta _{{a_i}{p_i}}=\theta _{{a_j}{p_j}}= 0. \end{aligned}$$
(11)

Based on (9) and (6), we can conclude that

$$\begin{aligned} \theta _{{a_i}{a_j}}=\theta _{{p_i}{p_j}}, \end{aligned}$$
(12)

and we have that

$$\begin{aligned} \left\{ \begin{aligned} \theta _{{a_i}{p_j}}&=\theta _{{a_i}{a_j}}{\pm }\theta _{{a_j}{p_j}},\\ \theta _{{a_j}{p_i}}&=\theta _{{p_i}{p_j}}{\pm }\theta _{{a_i}{p_i}}. \end{aligned} \right. \end{aligned}$$
(13)

Then, according to (6), (10), (11) and (12), we derive that

$$\begin{aligned} \left\{ \begin{aligned} \theta _{{a_i}{p_j}}&=\theta _{{a_j}{p _i}},\\ s_{{a_i}{p_j}}&=s_{{a_j}{p_i}} \end{aligned} \right. \end{aligned}$$
(14)

Then referring to (7), \(s_{{a_i}{p_j}}\) and \(s_{{a_j}{p_i}}\) are the symmetrical elements of the similarity matrix D. Therefore, it can conclude that if \({s_{{a_i}{p_j}}}={s_{{a_j}{p_i}}}\), D=\(D^T\), so the similarity matrix D is a symmetric matrix.

If D is a symmetric matrix and \(a_i\), \(p_i\) are fisrt order matched, we have (11) and (14), then we conclude that:

$$\begin{aligned} \left\{ \begin{aligned} \theta _{{a_i}{a_j}}&=\theta _{{a_i}{p_j}}{\pm }\theta _{{a_j}{p_j}}=\theta _{{a_i}{p_j}},\\ \theta _{{p_i}{p_j}}&=\theta _{{a_j}{p_i}}{\pm }\theta _{{a_j}{p_j}}=\theta _{{a_j}{p_i}}. \end{aligned} \right. \end{aligned}$$
(15)

Then referencing to (6), we conclude that \(s_{{a_i}{a_j}}=s_{{p_i}{p_j}}\). So if D is a symmetric matrix and \(a_i\), \(p_i\) are fisrt order matched, we can conclude that \(a_i\), \(p_i\) are second order similar.

Finally, we have that if and only if \(a_i\), \(p_i\) are pairs of matched descriptors and they are second order similar, the similarity matrix of them is symmetric. \(\square \)

This suggests that enhancing the symmetry of the descriptor similarity matrix can lead to second-order similarity among the descriptors within the same batch. We formulate the symmetric regularization term as follows:

$$\begin{aligned} SR=\frac{\Vert \textbf{D}-\textbf{D}^{\top }\Vert _{\textbf{F}}}{\Vert \textbf{D}\Vert _{\textbf{F}}}, \end{aligned}$$
(16)

where \(\Vert \textbf{D}\Vert _{\textbf{F}}\) denotes the Frobenius norm of D. Essentially, (7) quantifies the proportion of asymmetry in the similarity matrix D, with a decreasing value indicating reduced asymmetry in the matrix.

In contrast to SOSNet, which employs a KNN algorithm and constructs three matrices for computation with an algorithmic complexity of O(N), our proposed symmetric regularizatio only requires the construction of a single matrix and has an algorithmic complexity of O(1). The total loss function is expressed as:

$$\begin{aligned} L=\frac{1}{N}\sum _{i}^N max(s(a_{i},p_{i})-s(a_{i},n_{i})+m,0)+\lambda SR. \end{aligned}$$
(17)

4 Experiments

In this section, we compare our proposed descriptor learning method with several methods on two benchmark datasets including Brown dataset [30] and HPatches dataset [31].

4.1 Implementation details

To prevent overfitting, a dropout rate of 0.2 is applied. The PyTorch library is utilized to train our local descriptor network. SGD is chosen as the optimizer with an initial learning rate of 0.1, momentum of 0.9, and weight decay of 0.0001. The value of \(\lambda \) in the loss function is set to 0.15.

Table 1 Patch-verification performance on the UBC phototour dataset. False positive rates at 95\(\%\) recall are reported
Fig. 3
figure 3

Results of the patch-verification, patch-matching, and patch-retrieval tasks on HPatches [31]. Our proposed method outperforms all other methods in all three tasks

4.2 Experimental results and analysis

4.2.1 Brown phototour

For network training, we use the Brown Dataset [30], which is composed of local patches extracted from different scenes. Brown Dataset consists of three subsets: Yosemite, Notredame, and Liberty. Usually, we take one of the subsets as training set while the other two are used for testing. Each patch in the dataset has a unique 3D point indexes and patches with identical 3D point index are matching ones. What is more, for each 3D point, there are at least 2 matching patches. There are approximately 500K (1.5M) and 3D points (patches) in the Brown dataset. The original size of each patch is 64\(\times \)64. In addition, we extract 1000K triplets of patches from the training set with a batch size of 384 for training. We follow the standard evaluation protocol of it by using the 100K pairs provided by the authors and report the false positive rate at 95\(\%\) recall.

The impact of effective data-driven neural networks on improving results compared to traditional methods is unquestionable. The deep learning based approach is a major leap forward, especially the modifications to the network structure and regularization terms that improve the descriptor learning performance. As shown in Table 1, experimental results on the Brown dataset highlight the superiority of CNN-based approaches compared to SIFT, such as L2Net [10], HardNet [11], RAL-Net [13] and SOSNet [14] .

Thanks to our adaptive fusion and simplified computational but efficient SR regularization, our approach outperforms previous methods. Notably, our method employs novel three-branch center-surround multiscale learning, again outperforming similar methods such as CS L2Net [10].

Based on the experimental results, we present the following observations. It is evident that deep learning-based methods can significantly outperform traditional methods by using data-driven and efficient neural networks. As shown in Table 1, our approach outperforms other descriptor-learning methods on the Brown dataset. Our method outperforms unsupervised approaches such as TBLD [32], BLCD [33], and HybridDesc [34]. Notably, it consistently surpasses these methods and notably excels when compared to HybridDesc, acknowledged as the leading unsupervised feature descriptor learning approach.

4.2.2 HPatches

HPatches [31], a local descriptor evaluation benchmark, provides a huge dataset and evaluation criteria for modern descriptors. HPatches dataset consists of over 1.5 million patches extracted from 116 viewpoint and illumination changing scenes and different from the Brown dataset, it contains more diversity and noisy changes. According to the different levels of geometric noise, the extracted patches can be divided into three groups: easy, hard, and tough. There are three evaluation tasks of HPatches: patch verification, image matching, and patch retrieval. In the evaluation on the HPatches dataset, all learning methods shown in Fig. 3 are trained on Liberty, a subset of the Brown Dataset.

It is essential to note that all the methods presented in Fig. 3 were trained using the Liberty subset of the Brown dataset. This specific training dataset provides a consistent benchmark for assessing the performance of various methods. Our proposed approach stands out with substantial enhancements across multiple evaluation metrics, encompassing patch verification, image matching, and patch retrieval. The detailed analysis depicted in Fig. 3 illustrates the comparative performance of different methods in these key aspects. Notably, our method consistently outperforms existing approaches, underscoring its effectiveness in handling various challenges posed by patch-based tasks. This comprehensive evaluation on HPatches reaffirms the robustness and superiority of our approach in comparison to the other methods, thereby contributing to its credibility and relevance in real-world applications.

Table 2 Comparison of AMSF, concatenation, and single-scale descriptors. AMSF stands for adaptive multi-scale fusion

4.3 Ablation study

To validate the efficacy of our proposed method, we performed ablation experiments on various components, including the AMSF module, the SR, the number of branches, and the value of \(\lambda \). These experiments were conducted on the Brown dataset.

Fig. 4
figure 4

Model analysis. AFSRNet is trained and tested on tested on Brown dataset.(a):Effect of AMSF. Area under the ROC curve (AUC) of different training epoch (train on Liberty and test on Notredam) is served as an indicator. (b):Effect of the \(\lambda \). The curve shows the relation between FPR95 and \(\lambda \)

Table 3 Effect of SR, where “w/” and “w/o” mean with and without, respectively
Table 4 Comparison of different numbers of branches
Fig. 5
figure 5

Visualization matching results on HPatches [31]. The correct matching pairs are indicated by blue lines

1)Impact of AMSF: We evaluate the effect of AMSF on descriptor learning by comparing AMSF with single-scale learning methods, such as L2Net [10], and simple concatenation methods, such as CS L2Net [10]on the Brown dataset. Our AFSRNet comprises three sub-networks, so the dimension of the output produced by simply concatenating the descriptors from the three sub-networks is 3\(\times \)128, i.e., 384. Therefore, for a fair comparison, all compared methods are trained such that the dimension of their descriptors is also 384. Table 2 shows that our proposed adaptive multi-scale fusion module is superior to concatenation.

In addition to the previously mentioned experimental results, we further conducted a detailed analysis by examining the Area Under the Curve (AUC) throughout the entire training process. The AUC curve, illustrated in Fig. 4-(a), serves as a dynamic metric to evaluate the performance evolution over different epochs. Remarkably, our model consistently exhibits superior performance when compared to the other two models, surpassing them notably around the tenth epoch. As shown in the Fig. 4-(a) , Concatenation also surpasses Single scale in about the 15th cycle, because one more scale information can make the feature descriptor have better matching performance, while AMSF surpasses Concatenation in about the 10th cycle, which shows that not only scale information is important, but also how to fuse them into one descriptor is also important, and our AMSF solves the problem of how to fuse them in a better way based on the utilization of multiscale information, and it can be seen that this superiority is embodied in the whole process of training.

This in-depth investigation not only provides a more comprehensive understanding of the superiority of the AMSF module but also emphasizes that this superiority is sustained throughout the entire training process. The AUC curve showcases the continuous and progressive excellence of our model, going beyond occasional instances of favorable experimental results. This extended analysis not only bolsters the evidence supporting the effectiveness of the AMSF module but also enhances the overall persuasiveness and credibility of our work.

2)Impact of SR: Moreover, we conducted experiments to evaluate the effect of symmetric regularization(SR) on the performance of our network. Our results show that networks trained with symmetric regularization outperform networks trained without symmetric regularization, as evidenced by the decrease in FPR95. Specifically, Table 3 shows that, for the same descriptor distance, using symmetric regularization results in a 30.1\(\%\) reduction in FPR95.

3)Impact of the number of branches: To understand how the number of branches affects the final matching results, we conducted experiments comparing two-branch and four-branch models with our proposed three-branch model. The input patch size for a two-branch network is 64\(\times \)64 and 32\(\times \)32, while for a four-branch network, it is 64\(\times \)64, 56\(\times \)56, 48\(\times \)48, and 32\(\times \)32. The network configuration is the same, utilizing the AMSF module and SR. As shown in Table 4, our three-branch model performs better. Despite the computational intensity of the four-branch network, it doesn’t outperform our approach in matching effectiveness. This highlights that a network’s efficacy isn’t solely determined by the number of branches or scales. Regarding the two-branch network, althoug it’s performance is worse than four- and three-branch ones, it outperforms other two-branch networks like CS L2Net [10], showcasing the effectiveness of our adaptive fusion module.

4) Impact of \(\lambda \): We also explored the effect of the hyperparameter \(\lambda \) on the performance of our method by varying its value from 0.05 to 0.4. The training was conducted on the Liberty subset, and the evaluation was performed on the remaining two subsets. As shown in Fig. 4-(b), we observed that the best performance of the learned descriptors was achieved when \(\lambda \) was set to 0.15. Therefore, we selected \(\lambda \)=0.15 as the weight value to appropriately balance the first-order and second-order loss functions in our approach.

4.4 Visualization result

In order to thoroughly assess the efficacy of our proposed method, we present a comprehensive performance validation based on visual results obtained from the HPatches dataset [31]. The initial phase of our experimentation involved the utilization of the SIFT algorithm [8] for both keypoint detection and descriptor extraction. Three distinct methods, namely HardNet [11], SOSNet [14], and our novel AFSRNet, were employed for the extraction of descriptors.

Following the descriptor extraction process, we applied a nearest neighbor distance ratio matching strategy with a specified threshold of 0.75, as illustrated in Fig. 5. This matching strategy, depicted in the figure, serves to establish correspondences between descriptors and plays a crucial role in evaluating the performance of our method.

As shown in Fig. 5, it serves as a valuable reference, visually highlighting the superior performance of our proposed method in terms of descriptor pair matching accuracy. Our results demonstrate that our AFSRNet consistently outperforms both HardNet and SOSNet, producing more accurately matched descriptor pairs. This superiority is particularly evident in the visual results obtained from . The visualization evaluation underscores the robustness of our approach, showcasing its ability to generate more precise and reliable descriptor matches when compared to the other two established methods.

5 Conclusion

In this paper, we introduce a novel feature extraction model based on adaptive multi-scale feature fusion (AMSF) and a new regularization term, called symmetric regularization (SR). Our proposed method achieves state-of-the-art performance on the Brown dataset, and also exhibits strong generalization ability on the HPatches dataset. Furthermore, we conducted a comprehensive ablation study to reveal the contribution of each proposed component to the final performance. Our results show that our proposed AMSF module and SR play critical roles in enhancing the performance of descriptor learning.