Keywords

1 Introduction

Finding pixel-level correspondences across semantically similar images facilitates a variety of computer vision applications, including non-parametric scene parsing  [22, 30, 52], image manipulation  [10, 26, 51], visual localization  [41, 47], and to name a few.

Classical approaches for dense correspondence take visually similar images taken under constraint settings, such as 1D epipolar line for stereo matching  [43, 50] and 2D small motion for optical flow estimation  [1, 9]. Contrarily, semantic correspondence has no such constraints on the input image pairs except that two images describe the same object or scene category, posing additional challenges due to large appearance and geometric intra-class variations. Recent state-of-the-art methods  [17, 19, 20, 23, 26, 28, 39,40,41, 44] have attempted to address these challenges by carefully designing convolutional neural networks (CNNs) that mimic the classical matching pipeline  [36]: feature extraction, similarity score computation, and correspondence estimation.

Fig. 1.
figure 1

Visualization of our intuition: (a) image pair, (b) selected confident matches, and (c) warped image using the correspondences from our method. The proposed method, guided semantic flow, establishes reliable dense semantic correspondences by leveraging the guidance from confident matches to reduce matching ambiguities.

Since no viewpoint constraint is imposed on the source and target images, the search space for each pixel on the source image have to be defined with all pixels of the target image. However, searching over the full set of pairwise matching candidates inevitably increases the uncertainty in the matching pipeline, especially in the presence of non-rigid deformations and repetitive patterns.

One possible approach to this issue is to design additional modules that can vote for plausible transformation candidates from the full set of pairwise matches  [17, 39,40,41, 44]. Following the pioneering work of  [39], several methods  [40, 44] attempted to directly regress an image-level global transformation (e.g. affine or thin plate spline) between images. However, all matching scores are equally treated regardless of how confident they are, thus these approaches are inherently vulnerable to inaccurate matching scores that are often produced under severe intra-class variations. Without the need of global geometry, some methods  [17, 41] recently proposed to identify locally consistent matches by analyzing neighborhood consensus patterns. They down-weight ambiguous matches by assessing the confidence of matching scores, but this is performed only with a hand-crafted criterion (e.g. mutual consistency) that may often produces high confidence scores even for unconfident pixels.

Alternatively, similar to stereo matching and optical flow estimation  [9, 50], one can simply discard ambiguous matches by constraining the search space within a predefined local region centered at the querying pixel  [20, 26], but these approaches disregard the possibility of non-local matches that often appear across the semantically similar images. To address this issue, dilation technique  [49] was utilized in  [23], but the number of ambiguous matches increases at the same time. Some methods alleviated this by limiting the search space based on the heuristic matching cues, e.g. computing the discrete argmax  [28] or starting with an image-level global transformation  [19] estimated from a full set of pairwise similarity scores. However, such heuristics are often violated under large intra-class variations where the feature representations are quite inconsistent to measure accurate matching similarity or non-rigid geometric deformations that cannot be modeled with a global transformation model.

In this paper, we propose a novel approach, dubbed as guided semantic flow, that reliably infers dense semantic correspondence fields under large intra-class variations, as illustrated in Fig. 1. Our key idea is based on two observations: sparse yet reliable matches can effectively capture non-rigid geometric variations, and these confident matches can guide the adjacent pixels to have similar solution spaces, reducing the matching ambiguities significantly. Our method realizes this idea through three different modules consisting of pruning, propagation, and matching. We first select confident matches from a complete set of pairwise matching candidates through deep networks, and then propagate their reliable information to invalid neighborhoods through a new differentiable upsampling layer inspired by moving least square (MLS) approach  [42]. Lastly, dense correspondence fields are reliably inferred from the refined correlation volume by constraining the search space with Gaussian parametric model that is centered at the interpolated displacement vector. Experimental results on various benchmarks demonstrate the effectiveness of the proposed model over the latest methods for dense semantic correspondence.

2 Related Works

Stereo Matching and Optical Flow Estimation. There have been numerous efforts on reducing the matching ambiguitiy for classical dense correspondence problems, i.e. stereo matching and optical flow estimation.

Based on the seminal work of PatchMatch  [2], the randomized search scheme has been utilized and extended in numerous literature thanks to its effectiveness in pruning the search space  [7, 15, 16]. Another popular idea is to leverage the spatial pyramid of an image, naturally imposing the hierarchical smoothness constraint in a coarse-to-fine manner  [5, 38, 45]. Also, in order to enhance matching scores, recent approaches for depth estimation  [35, 37] additionally exploit sparse yet reliable measurements retrieved from an external source (e.g. LiDAR). However, since these approaches are tailored to the specific problem constraints such as epipolar geometry and relatively small motion, they are not directly applicable to the semantic correspondence task where two images may have large variations in terms of appearance and geometry.

Semantic Correspondence. Most conventional methods for semantic correspondence that use hand-crafted features and regularization terms  [22, 30, 32] have provided limited performance due to a low discriminative power. Recent state-of-the-art approaches have used deep CNNs to extract their features  [11, 25, 27] and/or spatially regularize correspondence fields in an end-to-end manner  [19, 23, 39, 44].

To deal with large geometric deformations, several approaches  [17, 39,40,41, 44] first computed similarity scores with respect to all possible pairwise matching candidates and then predicted the semantic correspondence through deep networks. As a pioneering work, Rocco et al.  [39, 40] estimates a global geometric model such as an affine and thin plate spline (TPS) transformation through CNN architecture mimicking the traditional matching pipeline. Seo et al.  [44] proposed an offset-aware correlation kernel to put more attention to reliable similarity scores. Without the need of global geometric model, Rocco et al.  [41] proposed to identify sets of spatially consistent matches by analyzing neighborhood consensus patterns. Huang et al.  [17] extended this architecture by leveraging context-aware semantic representation to further resolve local ambiguities.

Rather than considering all possible matching candidates, some methods  [19, 20, 23, 26, 28] constrain matching candidates within pre-defined local regions, like stereo matching and optical flow approaches  [9, 50]. In  [20, 23, 26], locally-varying affine transformation fields are iteratively estimated within locally constrained cost volume. More recently, Lee et al.  [28] proposed to leverage a kernel soft argmax function to deal with multi-modal distribution within a correlation volume.

The most relevant method to ours is  [19] that utilizes intermediate results from the previous level to constrain the search space of the current level in a coarse-to-fine manner. However, they start with the global affine transformation estimation that often fails to capture reliable matches under large geometric variations with non-rigid transformation.

3 Problem Statement

Let us denote semantically similar source and target images as \(I^s\) and \(I^t\), respectively. The objective is to establish a two dimensional correspondence field \(\tau _{i}=[u_i,v_i]^T\) between the two images that is defined for each pixel \(i=[i_{\mathbf {x}},i_{\mathbf {y}}]^T\) in \(I^s\).

Analogously to the classical matching pipeline  [36], this objective involves first extracting dense feature maps from \(I^s\) and \(I^t\), denoted by \(F^s,F^t\in \mathbb {R}^{h \times w \times d}\) where (hw) denotes the spatial resolution of the image, and d the dimensionality of feature. Then, given two dense feature maps, a correlation volume C is computed by encoding the similarity as cosine distance:

$$\begin{aligned} C_{ij}(F^s, F^t)={\langle F^s_{i}, F^t_{j}\rangle }/{{\Vert F^s_{i}\Vert }_2 {\Vert F^t_{j}\Vert }_2} \end{aligned}$$
(1)

where i and j indicate the individual feature position in the source and target images, respectively.

Fig. 2.
figure 2

(a) Given an image pair and a reference pixel i, we visualize its corresponding match (\(j=\mathrm {argmax}_l(C_{il})\)) and correlation score map (\(C_{il}\)), computed with: (b) matching candidates \(\mathcal {J}^f\)  [17, 39,40,41, 44], (c) matching candidates \(\mathcal {J}_i^p\)  [19, 20, 23, 26, 28], and (d) the proposed method. (e) Our key observation is that sparse yet reliable matches can guide the adjacent pixels to have similar solution spaces, reducing matching ambiguities significantly.

In this stage, several methods  [17, 39,40,41, 44] construct a full correlation volume \(C^f\) considering a set of all possible matching candidates \(\mathcal {J}_i^f\), such that

$$\begin{aligned} \mathcal {J}_i^f = \{j|j_\mathbf {x} \in [1,...,w],j_\mathbf {y} \in [1,...,h]\}. \end{aligned}$$
(2)

Note that \(\mathcal {J}_i^f\) is independent to pixel i and identical for all i pixels. However, as exemplified in Fig. 2 (a), the similarity scores in \(C^f\) are not guaranteed to be accurate due to inconsistent feature representations under large semantic variations. To address this, several approaches  [39, 40, 44] design an additional module that can vote for the transformation candidates by regressing an image-level single transformation, but they treat the matching scores of all pixels evenly regardless of their confidence. While some methods  [17, 41] alleviate this by filtering the correlation volume with mutual consistency constraint, they assess the confidences based on a simple criterion such as maximum normalization which may lack the robustness that is attainable with deep CNNs.

Meanwhile, as shown in Fig. 2(b), some approaches  [19, 20, 23, 26, 28] construct a partial correlation volume \(C^p\) by constraining the search space of each reference pixel i as the restricted local region \(\mathcal {N}_{k}\) centered at the pixel k on the target image. Formally, denoting the pixel k that is dependent on pixel i as k(i), the constrained matching candidates \(\mathcal {J}_i^p\) can be defined as

$$\begin{aligned} \mathcal {J}_i^p = \{j|j\in \mathcal {N}_{k(i)}\}. \end{aligned}$$
(3)

The center of the local region, k(i), is determined in various ways; as a reference pixel i itself (\(k(i)=i\))  [20, 23, 26] or by finding the matching cues from the fully constructed correlation volume through applying the discrete argmax function  [28] (\(k(i)=\text {argmax}_j(C^f_{ij})\)) or estimating an image-level coarse transformation \(\tau ^g(C^f)\)  [19] (\(k(i)=i+\tau _i^g(C^f)\)). However, as exemplified in Fig. 2(b), these approaches often fail to constrain the search space correctly under the large intra-class variations where the feature representations between two input images are quite inconsistent to measure accurate matching scores or complex geometric deformations cannot be modeled with a global affine transformation model.

4 Guided Semantic Flow

The proposed method leverages guidance cues from the confident matches to generate reliable likelihood matching hypotheses, as illustrated in Fig. 2(c). Unlike the existing methods that alleviate matching ambiguities with inaccurately assessed matching confidences  [17, 41] or with the heuristically constrained search spaces  [19, 20, 23, 26, 28], we address this issue with a learning-based selection of confident matches and their propagation, reducing matching ambiguities significantly while maintaining the robustness to large geometric variations.

Fig. 3.
figure 3

(a) Our overall framework consists of pruning, propagation, and matching modules. (b) The pruning module takes a full correlation volume C as an input and predicts pairwise confidence scores \(Q'\) from it by retaining confident matches and rejecting ambiguous ones with the parameters \(\mathbf {W}_P\). The propagation module converts this volume \(Q'\) into a dense guidance map \(G'\) in a fully differentiable manner. The matching module refines the initial correlation volume C with the guidance map \(G'\) and then estimates a dense correspondence field \(\tau \) with the pararmeters \(\mathbf {W}_M\).

4.1 Network Architecture

The proposed method consists of three modules as illustrated in Fig. 3: pruning module that estimates the confidence probability volume \(Q'\), propagation module that converts the confidence probability volume into a guidance displacement map \(G'\), and matching module that refines the initial correlation volume and estimates dense correspondence fields \(\tau \) from it.

To extract convolutional feature maps of source and target images, the input images are passed through the shared feature extraction networks with parameters \(\mathbf {W}_F\) such that \(F=\mathcal {F}(I;\mathbf {W}_F)\) where \(\mathcal {F}\) denotes a feed-forward operation. The initial correlation volume \(C^f\) is then constructed considering all possible pairwise matching candidates, following (1) and (2), to consider the large intra-class geometric deformations.

Pruning Module. To establish an initial set of confidence probabilities over all pairwise matches, we adopt a differentiable mutual consistency criterion  [17, 41], such that

$$\begin{aligned} {Q_{ij}} = \frac{{{{({C_{ij}})}^2}}}{{{{\max }_i}{C_{ij}} \cdot {{\max }_j}{C_{ij}}}} \end{aligned}$$
(4)

where \({Q_{ij}}\) equals one if and only if the match between i and j satisfies the mutual consistency constraint, and becomes smaller than 1 otherwise. Recent works  [17, 41] utilized this confidence volume Q to filter their similarity scores C (e.g. \(Q \cdot C\)), but the confidence of each pixel is assessed only with the handcrafted criterion as in (4), thus often producing a high confidence score even for an unconfident pixel as exemplified in Fig. 4(a).

In this work, we propose to refine the initial confidence volume with the pruning networks that consist of an encoder-decoder style architecture and a sigmoid function, yielding a value in (0, 1) to suppress false positives, as exemplified in Fig. 4(b). Formally, the refined confidence probability volume \(Q'\) can be obtained by

$$\begin{aligned} {Q'_{ij}}=T(Q_{ij} \cdot [\mathcal {F}(Q;\mathbf {W}_P)]_{ij},\rho ) \end{aligned}$$
(5)

where \(\mathbf {W}_P\) is the parameters of the pruning networks and \(T(\cdot ,\rho )\) is a truncation function that discards a probability lower than a threshold \(\rho \) to retain only confident matches, such that \(T(X,\rho )=X\) if \(X>\rho \) and \(T(X,\rho )=0\) otherwise.

It should be noted that several works have also attempted to find the reliable correspondences from the full pairwise similarity scores by thresholding  [40], the correspondence consistency  [19], or learning with the probabilisitic model  [20]. However, these constraints are used in the loss functions only as a supervision for training their deep networks, and are not explicitly used to refine the correlation volume.

Propagation Module. Taking the refined confidence volume \(Q'\) as an input, our propagation module first extracts the displacement vectors of the confident matches that can guide nearby ambiguous ones to have similar solution space. Specifically, given a set of the collected confident pixels \(\mathcal {S} = \{ i|\sum \nolimits _j {{{Q}'_{ij}}} \ne 0\}\), our propagation module converts the confidence volume \(Q'\) into 2-dimensional displacement map G through a soft argmax layer  [21], such that

$$\begin{aligned} {G_i = \left\{ {\begin{array}{*{20}{l}} {\sum \nolimits _j {{{j \cdot \exp ({{ Q}'_{ij}})}}/{{\sum \nolimits _l {\exp ({{Q}'_{il}})} }}}-i ,}&{}{\mathrm{{if}} \quad i \in \mathcal {S}}\\ \mathrm{{invalid},}&{}{\mathrm{{otherwise.}}} \end{array}} \right. } \end{aligned}$$
(6)
Fig. 4.
figure 4

The effectiveness of the pruning networks: (a) matches that satisfy the mutual consistency criterion (i.e. \(Q_{ij}=1\)), and (b) matches from the refined confidence volume \(Q'\) (i.e. \(Q'_{ij}>\rho \)). Our pruning networks effectively suppress the false positive confidence matches that often occur at ambiguous regions.

The displacement map G can then be used to constrain the plausible search range from all possible matching candidates, but this guidance is valid only for confident pixels (\(i \in \mathcal {S}\)). To guide the search space of the invalid pixels (\(i \notin \mathcal {S}\)) with the help of confident pixels, we attempted to interpolate the sparse displacement map G using the existing bilinear upsampler of  [18]. However, this cannot be directly realized since the confident matches in \(\mathcal {S}\) are sparsely and irregularly distributed in the spatial dimension. In this work, we introduce a new differentiable upsampling layer that interpolates the sparse displacement map G into a dense guidance map \(G'\). Concretely, inspired by moving least square approach  [42], the displacement vector \(G'_i\) at a pixel i can be computed with a spatially-varying weight function w as

$$\begin{aligned} {G'_i} = \sum \nolimits _{s \in \mathcal {S}} {G_{s} \cdot w(s - i)}/\sum \nolimits _{s \in \mathcal {S}} {w(s - i)} \end{aligned}$$
(7)

where \(w(z) = \exp (-||z||^2 / {2{c_P}^2})\) is formed with a coefficient \(c_P\). The differentiability of this operator \(G'_i\) with respect to \(G_i\) can be easily derived similar to  [18].

Matching Module. With a favor of densely interpolated guidance displacements \(G'\), we refine the initial correlation volume C by maintaining only the similarity scores of highly probable matches. To be specific, we compute the refined correlation volume \(C'\) by modulating the original volume C with Gaussian parametric model centered at the guidance displacement vector \(G'\):

$$\begin{aligned} {C'_{ij}} = \exp ( -{{(j - G'_i)}^2}/{2{{c_M}^2}}) \cdot C_{ij} \end{aligned}$$
(8)

where \(c_M\) adjust the distribution of Gaussian model. Unlike the existing methods  [19, 20, 23, 26, 28] that constrain the search space with simple heuristics, our method leverages the reliable information propagated from the confident matches to effectively deal with large intra-class geometric variations.

With the resulting uni-modal likelihood hypotheses where matching ambiguities are significantly reduced, we subsequently formulate matching networks to regress residual displacements at sub-pixel level, facilitating fine-grained localization. The final dense correspondence field \(\tau \) is computed as

$$\begin{aligned} {\tau _i} = {G'_i} + [{\mathcal{F}(C';{\mathbf{{W}}_M})}]_i \end{aligned}$$
(9)

where \(\mathbf {W}_M\) is the parameters of our matching networks.

4.2 Objective Functions

To overcome the limitation of insufficient training data for semantic correspondence, our matching networks are learned using weak image-level supervision in a form of matching image pairs. Additionally, we expedite the learning process by allowing only the gradients of the foreground pixels to be backpropagated within object masks of the source and target images, similar to  [19, 23, 24, 28].

Pruning Networks. To train the pruning networks with the parameter \(\mathbf {W}_P\), we define a novel loss function that consists of silhouette consistency loss and geometry consistency loss, such that

$$\begin{aligned} {\mathcal {L}_\mathrm {P}} = {\mathcal {L}_\mathrm {sil}} +\lambda {\mathcal {L}_\mathrm {geo}} \end{aligned}$$
(10)

where \(\lambda \) is the weighting parameter.

With the intuition that local structures between source and target image features should be similar at the correct confident correspondences, we encourage the pruning networks to automatically discard the matches that do not satisfy the following local geometry consistency constraint

$$\begin{aligned} {\mathcal {L}_\mathrm {geo}} = \sum \nolimits _{i \in \mathcal {S}} {\sum \nolimits _{l \in \mathcal {N}_i}{||F_l^s-[G' \circ F^t]_l||_F^2}} \end{aligned}$$
(11)

where \(\mathcal {N}_i\) is a local window centered at the pixel i, \(\circ \) is a warping operator, and \(||\cdot ||_F^2\) denotes Frobenius norm. By aggregating the contextual information of \(\mathcal {N}_i\) through the parameters \(\mathbf {W}_P\), we can predict more accurate confidence scores than the handcrafted criterion of (4) that relies only on the pixel-level similarity scores.

Additionally, we formulate the silhouette consistency loss that encourages the refined confidence volume \(Q'\) to lie within the silhouette of the initial volume Q:

$$\begin{aligned} {\mathcal {L}_\mathrm {sil}} = {\sum \nolimits _{\{i,j\} \in \mathcal {S}^*}{|\log ({Q'_{ij}}/{Q_{ij}})|}} \end{aligned}$$
(12)

where \(\mathcal {S}^*=\{i,j|Q_{ij} > \rho \}\), hence \({Q'_{ij}}/{Q_{ij}}\) becomes \([\mathcal {F}(Q; \mathbf {W}_P)]_{ij}\). Note that similar loss function is used in the object landmark detection literature  [46] to encourage the landmarks to lie within the silhouette of the object of interest.

Matching Networks. Thanks to the guidance displacements \(G'\), most of geometric deformations are already resolved, and thus computing the residual transformation field \(\mathcal {F}(C';\mathbf {W}_M)\) with the weakly-supervised loss function of  [23] is tractable, such that

$$\begin{aligned} {\mathcal {L}_\mathrm {M}} = \sum \nolimits _{i}{-\log (P_i(\tau ))} \end{aligned}$$
(13)

where \(P(\tau )\) is the softmax matching probability defined with a local neighborhood \(\mathcal {M}_i\) as

$$\begin{aligned} P_i(\tau ) = \frac{{\exp ({<F^s_i,[\tau \circ {F^t}]_i>})}}{{\sum \nolimits _{l\in \mathcal {M}_i} {\exp ({<F^s_i,[\tau \circ {F^t}]_l>})} }}. \end{aligned}$$
(14)

This objective allows us to consider both positive and negative samples by maximizing the similarity score at the correct transformation while minimizing the scores of remaining candidates within local neighborhood \(\mathcal {M}_i\).

Final Objective Function. We additionally utilize \(L_1\) regularization loss \(\mathcal {L}_\mathrm {sm}\) for the spatial smootheness in the final correspondence field \(\tau \)  [26, 28]. A final objective is defined as a weighted summation of the presented three losses:

$$\begin{aligned} \mathcal {L}_\mathrm {final}=\lambda _\mathrm {P}\mathcal {L}_\mathrm {P} + \lambda _\mathrm {M}\mathcal {L}_\mathrm {M} + \lambda _\mathrm {sm}\mathcal {L}_\mathrm {sm}. \end{aligned}$$
(15)

4.3 Training Details

Inspired by recent works on finding good matches for wide-baseline stereo  [4, 34], we first freeze the network parameters \(\mathbf {W}_F\), \(\mathbf {W}_M\) and learn the pruning networks \(\mathbf {W}_P\) only with the gradients from \(\mathcal {L}_\mathrm {P}\). This allows the pruning networks to be converged stably by fixing the values Q of silhouette consistency loss (12). In second stage, we train the whole networks in an end-to-end manner with \(\mathcal {L}_\mathrm {final}\) where the properly selected confident matches from the pruning networks boost the convergence of the feature extraction and matching networks by providing well-defined negative samples within the neighborhood \(\mathcal {M}_i\) of matching loss (14).

Following  [20, 26, 40], this two-stage learning procedure first utilizes synthetically generated image pairs, by applying random synthetic transformations to a single image of PASCAL VOC 2012 segmentation dataset  [8] using the split in  [28]. Then, our networks are finetuned with semantically similar image pairs from PF-PASCAL dataset  [12] using the split in  [40].

5 Experimental Results

5.1 Implementation Details

For feature extraction, we used two CNNs as main backbone networks; ImageNet  [6]-pretrained ResNet 101  [14] and PASCAL VOC 2012  [8]-pretrained SFNet  [28], where activations are sampled at ‘conv4-23’ and ‘conv5-3’. The activations adapted from ‘conv5-3’ are upsampled using bilinear interpolation. We denote these backbone networks in the following evaluations as “Ours w/ResNet” and “Ours w/SFNet”. We set threshold \(\rho \) to 0.9, the variances \(\{c_P,c_M\}\) to \(\{7,5\}\), and Referring to the ablation study of  [23], the radius of local window \(\mathcal {M}_i\) is set to 5. More details about the implementation and the performance analysis with respect to the hyper-parameters are provided in the supplemental material.

5.2 Results

PF-WILLOW and PF-PASCAL Dataset. PF-WILLOW dataset  [11] includes 10 object sub-classes with 10 keypoint annotations for each image, providing 900 image pairs. PF-PASCAL dataset  [12] contains 1,351 image pairs over 20 object categories with PASCAL keypoint annotations  [3]. Following the split in  [13, 40], we used only 300 testing image pairs for the evaluation. We used a common metric of the percentage of correct keypoint (PCK) by computing the distance between flow-warped keypoints and the ground-truth ones  [31]. The warped keypoints are determined to be correct if they lie within \(\alpha \cdot \max (h,w)\) pixels from the ground-truth keypoints for \(\alpha \in [0,1]\), where h and w are the height and width of either an image (\(\alpha _\mathrm {img}\)) or an object bounding box (\(\alpha _\mathrm {bb}\)). PCK with \(\alpha _\mathrm {bb}\) is more stringent metric than that of \(\alpha _\mathrm {img}\)  [33]. In line with the previous works, we used \(\alpha _\mathrm {bb}\) for PF-WILLOW  [11] and \(\alpha _\mathrm {img}\) for PF-PASCAL  [12].

Table 1. Matching accuracy compared to state-of-the-art correspondence techniques on PF-WILLOW dataset  [11], PF-PASCAL dataset  [12], and Caltech-101 dataset  [29]. Results of  [13, 39,40,41, 44] are borrowed from  [33].

The average PCK scores are summarized in Table 1 showing that our model (“Ours w/ResNet”) exhibits a competitive performance to the latest weakly-supervised and even fully-supervised techniques for semantic correspondence, demonstrating the benefits of generating highly probable hypotheses based on the confident matches. When combined with sophisticate CNN features (“Ours w/SFNet”), the outstanding performance was attained.

Caltech-101 Dataset. We also evaluated our method on Caltech-101 dataset  [29] which provides the images of 101 object categories with ground-truth object masks. For the evaluation, we used the 1,515 image pairs used in  [13, 40], i.e. 15 image pairs for each object category. Compared to other datasets described above, the Caltech-101 dataset  [29] enable us to evaluate the performances under more general settings with the image pairs from more diverse classes. Following the experimental protocol in  [22], the matching accuracy was evaluated with two metrics: the label transfer accuracy (LT-ACC), and the intersection-over-union (IoU) metric.

In Table 1, our method achieves a competitive performance compared to state-of-the-art methods in terms of both LT-ACC and IoU metrics. In particular, our results show better performances with significant margins compared to the methods  [39,40,41, 44] that consider all possible matching scores.

This reveals the effectiveness of the proposed pruning and propagation modules where only reliable information is propagated and leveraged to reduce the matching ambiguity.

Fig. 5.
figure 5

Qualitative results of the semantic alignment on the testing pair of SPair-71k benchmark  [33]: (a) input image pairs and warped source images using correspondences obtained from our method, and (b) warped source images from state-of-the-art-methods; (left) RTNs  [23], (middle) NCNet  [41], (right) SFNet  [28].

Table 2. Matching accuracy compared to the state-of-the-art techniques on SPair-71k benchmark  [33]. Difficulty levels of viewpoints and scales are labeled ‘easy’, ‘medium’, and ‘hard’, while those of truncation and occlusion are indicated by ‘none’, ‘source’, ‘target’, and ‘both’. The performances are evaluated by fixing the levels of other variations as ‘easy’ and ‘none’. Results of  [39,40,41, 44] are borrowed from  [33].

SPair-71k Benchmark. The evaluation was also performed on the SPair-71k benchmark  [33] that includes 70,958 image pairs of 18 object categories from PASCAL 3D+  [48] and PASCAL VOC 2012  [8], providing 12,234 pairs for testing. This benchmark is more challenging than other datasets  [11, 12, 29] for semantic correspondence evaluation, as it covers significantly large variations of 4 factors as shown in Table 2. For the evaluation metric, we used the PCK setting the threshold with respect to the object bounding box to \(\alpha _\mathrm {bb}=0.1\).

Table 2 reports the quantitative performance with respect to different levels of four variation factors. The qualitative results are visualized in Fig. 5. As shown in Table 2 and Fig. 5, our results have shown highly improved performances qualitatively and quantitatively compared to the state-of-the-art techniques on all variation factors. In contrast to the methods  [23, 28] that cannot capture large geometric variations due to the simple heuristics used to constrain the search space, a large PCK gain for difficult image pairs in Table 2 indicates that our method is effective especially in the presence of severe appearance and shape variations thanks to the guidance by the confident matches learned from all matching candidates. Though the performance was evaluated only on the sparsely annotated keypoints provided from the benchmark, the qualitative results in Fig. 5 indicates that the objective measure can be significantly boosted if dense ground-truth annotations are given for evaluation.

5.3 Ablation Study

Lastly, we conducted an ablation study on different modules and losses in our model of “Ours w/ResNet” evaluating on the testing image pairs of SPair-71k benchmark  [33].

Network Architecture. We report the quantitative assessment when one of our modules is removed from the network architecture in Table 3(a) in terms of average PCK at \(\alpha _\mathrm {bb} = 0.1\). Interestingly, the guidance displacement map \(G'\), which is the result obtained with only the pruning and propagation modules, already outperforms state-of-the-art methods by a large margin as shown in Table 2. The performance degradation due to the lack of the pruning or propagation modules highlights the importance of the learning-based selection of confident matches and the MLS layer. Figure 6 shows the intermediate results of our method.

Training Loss. To validate the effectiveness of the utilized losses, we examined the performance of our model when learned with different loss functions. In Table 3(b), the first three rows compare the performances for the variants of the pruning networks. The performance gain from 25.1 to 28.5 with respect to \(\mathcal {L}_\mathrm {geo}\) indicates the effectiveness of imposing local geometry consistency constraint by aggregating the contextual information. On the other hand, with respect to \(\mathcal {L}_\mathrm {sil}\), the degraded performance from 28.5 to 24.3 demonstrates the importance of regularizing the refined confidence scores to be similar with the initial ones, so that the retained confident matches also satisfy mutual consistency.

Table 3. Ablation study on the testing pairs of SPair-71k benchmark  [33] for (a) different components and (b) different loss functions. Note that, in (a), when the ‘MLS layer’ in the propagation module is removed, the refined correlation volume \(C'\) is computed by applying Gaussian parametric model only on the confident pixels\(^{a}\).

\(C'_{ij} = \left\{ {\begin{array}{*{20}{l}} {\exp (-{{(j - G_i)}^2}/{2{{c_M}^2}}) \cdot C_{ij},}&{}{\mathrm{{if}} \quad i \in \mathcal {S}}\\ {C_{ij},}&{}{\mathrm{{otherwise.}}} \end{array}} \right. \)

Fig. 6.
figure 6

The visualization of the intermediate results: (a) source and target images, (b) the selected confident matches \(Q'\), (c) matching results with the guidance displacements \(G'\), and (d) matching results with the final correspondence fields \(\tau \).

The last two rows in Table 3(b) reveal the effect of the used two-stage learning process. The performance drop from 33.5 to 30.2 by removing the first stage highlights that the properly selected confident matches from the pruning networks can boost the convergence of our training by allowing only well-defined matching candidates to be utilized during the second stage.

6 Conclusion

We presented a novel framework, guided semantic flow, that reliably infers dense semantic correspondences under large appearance and spatial variations. Taking advantage of the reliable information of confident matches, we effectively handle severe non-rigid geometric deformations and reduce matching ambiguities. The outstanding performance was validated through extensive experiments on various benchmarks.