Guided Semantic Flow

Jeon, Sangryul; Min, Dongbo; Kim, Seungryong; Choe, Jihwan; Sohn, Kwanghoon

doi:10.1007/978-3-030-58604-1_38

Sangryul Jeon¹²,
Dongbo Min¹³,
Seungryong Kim¹⁴,
Jihwan Choe¹⁵ &
…
Kwanghoon Sohn¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12373))

Included in the following conference series:

European Conference on Computer Vision

3473 Accesses
14 Citations

Abstract

Establishing dense semantic correspondences requires dealing with large geometric variations caused by the unconstrained setting of images. To address such severe matching ambiguities, we introduce a novel approach, called guided semantic flow, based on the key insight that sparse yet reliable matches can effectively capture non-rigid geometric variations, and these confident matches can guide adjacent pixels to have similar solution spaces, reducing the matching ambiguities significantly. We realize this idea with learning-based selection of confident matches from an initial set of all pairwise matching scores and their propagation by a new differentiable upsampling layer based on moving least square concept. We take advantage of the guidance from reliable matches to refine the matching hypotheses through Gaussian parametric model in the subsequent matching pipeline. With the proposed method, state-of-the-art performance is attained on several standard benchmarks for semantic correspondence.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Weakly Supervised Learning of Dense Semantic Correspondences and Segmentation

ECO-TR: Efficient Correspondences Finding via Coarse-to-Fine Refinement

Confidence-Aware Adversarial Learning for Self-supervised Semantic Matching

Keywords

1 Introduction

Finding pixel-level correspondences across semantically similar images facilitates a variety of computer vision applications, including non-parametric scene parsing [22, 30, 52], image manipulation [10, 26, 51], visual localization [41, 47], and to name a few.

Classical approaches for dense correspondence take visually similar images taken under constraint settings, such as 1D epipolar line for stereo matching [43, 50] and 2D small motion for optical flow estimation [1, 9]. Contrarily, semantic correspondence has no such constraints on the input image pairs except that two images describe the same object or scene category, posing additional challenges due to large appearance and geometric intra-class variations. Recent state-of-the-art methods [17, 19, 20, 23, 26, 28, 39,40,41, 44] have attempted to address these challenges by carefully designing convolutional neural networks (CNNs) that mimic the classical matching pipeline [36]: feature extraction, similarity score computation, and correspondence estimation.

Since no viewpoint constraint is imposed on the source and target images, the search space for each pixel on the source image have to be defined with all pixels of the target image. However, searching over the full set of pairwise matching candidates inevitably increases the uncertainty in the matching pipeline, especially in the presence of non-rigid deformations and repetitive patterns.

One possible approach to this issue is to design additional modules that can vote for plausible transformation candidates from the full set of pairwise matches [17, 39,40,41, 44]. Following the pioneering work of [39], several methods [40, 44] attempted to directly regress an image-level global transformation (e.g. affine or thin plate spline) between images. However, all matching scores are equally treated regardless of how confident they are, thus these approaches are inherently vulnerable to inaccurate matching scores that are often produced under severe intra-class variations. Without the need of global geometry, some methods [17, 41] recently proposed to identify locally consistent matches by analyzing neighborhood consensus patterns. They down-weight ambiguous matches by assessing the confidence of matching scores, but this is performed only with a hand-crafted criterion (e.g. mutual consistency) that may often produces high confidence scores even for unconfident pixels.

Alternatively, similar to stereo matching and optical flow estimation [9, 50], one can simply discard ambiguous matches by constraining the search space within a predefined local region centered at the querying pixel [20, 26], but these approaches disregard the possibility of non-local matches that often appear across the semantically similar images. To address this issue, dilation technique [49] was utilized in [23], but the number of ambiguous matches increases at the same time. Some methods alleviated this by limiting the search space based on the heuristic matching cues, e.g. computing the discrete argmax [28] or starting with an image-level global transformation [19] estimated from a full set of pairwise similarity scores. However, such heuristics are often violated under large intra-class variations where the feature representations are quite inconsistent to measure accurate matching similarity or non-rigid geometric deformations that cannot be modeled with a global transformation model.

In this paper, we propose a novel approach, dubbed as guided semantic flow, that reliably infers dense semantic correspondence fields under large intra-class variations, as illustrated in Fig. 1. Our key idea is based on two observations: sparse yet reliable matches can effectively capture non-rigid geometric variations, and these confident matches can guide the adjacent pixels to have similar solution spaces, reducing the matching ambiguities significantly. Our method realizes this idea through three different modules consisting of pruning, propagation, and matching. We first select confident matches from a complete set of pairwise matching candidates through deep networks, and then propagate their reliable information to invalid neighborhoods through a new differentiable upsampling layer inspired by moving least square (MLS) approach [42]. Lastly, dense correspondence fields are reliably inferred from the refined correlation volume by constraining the search space with Gaussian parametric model that is centered at the interpolated displacement vector. Experimental results on various benchmarks demonstrate the effectiveness of the proposed model over the latest methods for dense semantic correspondence.

2 Related Works

Stereo Matching and Optical Flow Estimation. There have been numerous efforts on reducing the matching ambiguitiy for classical dense correspondence problems, i.e. stereo matching and optical flow estimation.

Based on the seminal work of PatchMatch [2], the randomized search scheme has been utilized and extended in numerous literature thanks to its effectiveness in pruning the search space [7, 15, 16]. Another popular idea is to leverage the spatial pyramid of an image, naturally imposing the hierarchical smoothness constraint in a coarse-to-fine manner [5, 38, 45]. Also, in order to enhance matching scores, recent approaches for depth estimation [35, 37] additionally exploit sparse yet reliable measurements retrieved from an external source (e.g. LiDAR). However, since these approaches are tailored to the specific problem constraints such as epipolar geometry and relatively small motion, they are not directly applicable to the semantic correspondence task where two images may have large variations in terms of appearance and geometry.

Semantic Correspondence. Most conventional methods for semantic correspondence that use hand-crafted features and regularization terms [22, 30, 32] have provided limited performance due to a low discriminative power. Recent state-of-the-art approaches have used deep CNNs to extract their features [11, 25, 27] and/or spatially regularize correspondence fields in an end-to-end manner [19, 23, 39, 44].

To deal with large geometric deformations, several approaches [17, 39,40,41, 44] first computed similarity scores with respect to all possible pairwise matching candidates and then predicted the semantic correspondence through deep networks. As a pioneering work, Rocco et al. [39, 40] estimates a global geometric model such as an affine and thin plate spline (TPS) transformation through CNN architecture mimicking the traditional matching pipeline. Seo et al. [44] proposed an offset-aware correlation kernel to put more attention to reliable similarity scores. Without the need of global geometric model, Rocco et al. [41] proposed to identify sets of spatially consistent matches by analyzing neighborhood consensus patterns. Huang et al. [17] extended this architecture by leveraging context-aware semantic representation to further resolve local ambiguities.

Rather than considering all possible matching candidates, some methods [19, 20, 23, 26, 28] constrain matching candidates within pre-defined local regions, like stereo matching and optical flow approaches [9, 50]. In [20, 23, 26], locally-varying affine transformation fields are iteratively estimated within locally constrained cost volume. More recently, Lee et al. [28] proposed to leverage a kernel soft argmax function to deal with multi-modal distribution within a correlation volume.

The most relevant method to ours is [19] that utilizes intermediate results from the previous level to constrain the search space of the current level in a coarse-to-fine manner. However, they start with the global affine transformation estimation that often fails to capture reliable matches under large geometric variations with non-rigid transformation.

3 Problem Statement

Let us denote semantically similar source and target images as $I^s$ and $I^t$, respectively. The objective is to establish a two dimensional correspondence field $\tau _{i}=[u_i,v_i]^T$ between the two images that is defined for each pixel $i=[i_{\mathbf {x}},i_{\mathbf {y}}]^T$ in $I^s$.

Analogously to the classical matching pipeline [36], this objective involves first extracting dense feature maps from $I^s$ and $I^t$, denoted by $F^s,F^t\in \mathbb {R}^{h \times w \times d}$ where (h, w) denotes the spatial resolution of the image, and d the dimensionality of feature. Then, given two dense feature maps, a correlation volume C is computed by encoding the similarity as cosine distance:

$$\begin{aligned} C_{ij}(F^s, F^t)={\langle F^s_{i}, F^t_{j}\rangle }/{{\Vert F^s_{i}\Vert }_2 {\Vert F^t_{j}\Vert }_2} \end{aligned}$$

(1)

where i and j indicate the individual feature position in the source and target images, respectively.

In this stage, several methods [17, 39,40,41, 44] construct a full correlation volume $C^f$ considering a set of all possible matching candidates $\mathcal {J}_i^f$, such that

$$\begin{aligned} \mathcal {J}_i^f = \{j|j_\mathbf {x} \in [1,...,w],j_\mathbf {y} \in [1,...,h]\}. \end{aligned}$$

(2)

Note that $\mathcal {J}_i^f$ is independent to pixel i and identical for all i pixels. However, as exemplified in Fig. 2 (a), the similarity scores in $C^f$ are not guaranteed to be accurate due to inconsistent feature representations under large semantic variations. To address this, several approaches [39, 40, 44] design an additional module that can vote for the transformation candidates by regressing an image-level single transformation, but they treat the matching scores of all pixels evenly regardless of their confidence. While some methods [17, 41] alleviate this by filtering the correlation volume with mutual consistency constraint, they assess the confidences based on a simple criterion such as maximum normalization which may lack the robustness that is attainable with deep CNNs.

Meanwhile, as shown in Fig. 2(b), some approaches [19, 20, 23, 26, 28] construct a partial correlation volume $C^p$ by constraining the search space of each reference pixel i as the restricted local region $\mathcal {N}_{k}$ centered at the pixel k on the target image. Formally, denoting the pixel k that is dependent on pixel i as k(i), the constrained matching candidates $\mathcal {J}_i^p$ can be defined as

$$\begin{aligned} \mathcal {J}_i^p = \{j|j\in \mathcal {N}_{k(i)}\}. \end{aligned}$$

(3)

The center of the local region, k(i), is determined in various ways; as a reference pixel i itself ($k(i)=i$) [20, 23, 26] or by finding the matching cues from the fully constructed correlation volume through applying the discrete argmax function [28] ($k(i)=\text {argmax}_j(C^f_{ij})$) or estimating an image-level coarse transformation $\tau ^g(C^f)$ [19] ($k(i)=i+\tau _i^g(C^f)$). However, as exemplified in Fig. 2(b), these approaches often fail to constrain the search space correctly under the large intra-class variations where the feature representations between two input images are quite inconsistent to measure accurate matching scores or complex geometric deformations cannot be modeled with a global affine transformation model.

4 Guided Semantic Flow

The proposed method leverages guidance cues from the confident matches to generate reliable likelihood matching hypotheses, as illustrated in Fig. 2(c). Unlike the existing methods that alleviate matching ambiguities with inaccurately assessed matching confidences [17, 41] or with the heuristically constrained search spaces [19, 20, 23, 26, 28], we address this issue with a learning-based selection of confident matches and their propagation, reducing matching ambiguities significantly while maintaining the robustness to large geometric variations.

4.1 Network Architecture

The proposed method consists of three modules as illustrated in Fig. 3: pruning module that estimates the confidence probability volume $Q'$, propagation module that converts the confidence probability volume into a guidance displacement map $G'$, and matching module that refines the initial correlation volume and estimates dense correspondence fields $\tau $ from it.

To extract convolutional feature maps of source and target images, the input images are passed through the shared feature extraction networks with parameters $\mathbf {W}_F$ such that $F=\mathcal {F}(I;\mathbf {W}_F)$ where $\mathcal {F}$ denotes a feed-forward operation. The initial correlation volume $C^f$ is then constructed considering all possible pairwise matching candidates, following (1) and (2), to consider the large intra-class geometric deformations.

Pruning Module. To establish an initial set of confidence probabilities over all pairwise matches, we adopt a differentiable mutual consistency criterion [17, 41], such that

$$\begin{aligned} {Q_{ij}} = \frac{{{{({C_{ij}})}^2}}}{{{{\max }_i}{C_{ij}} \cdot {{\max }_j}{C_{ij}}}} \end{aligned}$$

(4)

where ${Q_{ij}}$ equals one if and only if the match between i and j satisfies the mutual consistency constraint, and becomes smaller than 1 otherwise. Recent works [17, 41] utilized this confidence volume Q to filter their similarity scores C (e.g. $Q \cdot C$), but the confidence of each pixel is assessed only with the handcrafted criterion as in (4), thus often producing a high confidence score even for an unconfident pixel as exemplified in Fig. 4(a).

In this work, we propose to refine the initial confidence volume with the pruning networks that consist of an encoder-decoder style architecture and a sigmoid function, yielding a value in (0, 1) to suppress false positives, as exemplified in Fig. 4(b). Formally, the refined confidence probability volume $Q'$ can be obtained by

$$\begin{aligned} {Q'_{ij}}=T(Q_{ij} \cdot [\mathcal {F}(Q;\mathbf {W}_P)]_{ij},\rho ) \end{aligned}$$

(5)

where $\mathbf {W}_P$ is the parameters of the pruning networks and $T(\cdot ,\rho )$ is a truncation function that discards a probability lower than a threshold $\rho $ to retain only confident matches, such that $T(X,\rho )=X$ if $X>\rho $ and $T(X,\rho )=0$ otherwise.

It should be noted that several works have also attempted to find the reliable correspondences from the full pairwise similarity scores by thresholding [40], the correspondence consistency [19], or learning with the probabilisitic model [20]. However, these constraints are used in the loss functions only as a supervision for training their deep networks, and are not explicitly used to refine the correlation volume.

Propagation Module. Taking the refined confidence volume $Q'$ as an input, our propagation module first extracts the displacement vectors of the confident matches that can guide nearby ambiguous ones to have similar solution space. Specifically, given a set of the collected confident pixels $\mathcal {S} = \{ i|\sum \nolimits _j {{{Q}'_{ij}}} \ne 0\}$, our propagation module converts the confidence volume $Q'$ into 2-dimensional displacement map G through a soft argmax layer [21], such that

$$\begin{aligned} {G_i = \left\{ {\begin{array}{*{20}{l}} {\sum \nolimits _j {{{j \cdot \exp ({{ Q}'_{ij}})}}/{{\sum \nolimits _l {\exp ({{Q}'_{il}})} }}}-i ,}&{}{\mathrm{{if}} \quad i \in \mathcal {S}}\\ \mathrm{{invalid},}&{}{\mathrm{{otherwise.}}} \end{array}} \right. } \end{aligned}$$

(6)

The displacement map G can then be used to constrain the plausible search range from all possible matching candidates, but this guidance is valid only for confident pixels ($i \in \mathcal {S}$). To guide the search space of the invalid pixels ($i \notin \mathcal {S}$) with the help of confident pixels, we attempted to interpolate the sparse displacement map G using the existing bilinear upsampler of [18]. However, this cannot be directly realized since the confident matches in $\mathcal {S}$ are sparsely and irregularly distributed in the spatial dimension. In this work, we introduce a new differentiable upsampling layer that interpolates the sparse displacement map G into a dense guidance map $G'$. Concretely, inspired by moving least square approach [42], the displacement vector $G'_i$ at a pixel i can be computed with a spatially-varying weight function w as

$$\begin{aligned} {G'_i} = \sum \nolimits _{s \in \mathcal {S}} {G_{s} \cdot w(s - i)}/\sum \nolimits _{s \in \mathcal {S}} {w(s - i)} \end{aligned}$$

(7)

where $w(z) = \exp (-||z||^2 / {2{c_P}^2})$ is formed with a coefficient $c_P$. The differentiability of this operator $G'_i$ with respect to $G_i$ can be easily derived similar to [18].

Matching Module. With a favor of densely interpolated guidance displacements $G'$, we refine the initial correlation volume C by maintaining only the similarity scores of highly probable matches. To be specific, we compute the refined correlation volume $C'$ by modulating the original volume C with Gaussian parametric model centered at the guidance displacement vector $G'$:

$$\begin{aligned} {C'_{ij}} = \exp ( -{{(j - G'_i)}^2}/{2{{c_M}^2}}) \cdot C_{ij} \end{aligned}$$

(8)

where $c_M$ adjust the distribution of Gaussian model. Unlike the existing methods [19, 20, 23, 26, 28] that constrain the search space with simple heuristics, our method leverages the reliable information propagated from the confident matches to effectively deal with large intra-class geometric variations.

With the resulting uni-modal likelihood hypotheses where matching ambiguities are significantly reduced, we subsequently formulate matching networks to regress residual displacements at sub-pixel level, facilitating fine-grained localization. The final dense correspondence field $\tau $ is computed as

$$\begin{aligned} {\tau _i} = {G'_i} + [{\mathcal{F}(C';{\mathbf{{W}}_M})}]_i \end{aligned}$$

(9)

where $\mathbf {W}_M$ is the parameters of our matching networks.

4.2 Objective Functions

To overcome the limitation of insufficient training data for semantic correspondence, our matching networks are learned using weak image-level supervision in a form of matching image pairs. Additionally, we expedite the learning process by allowing only the gradients of the foreground pixels to be backpropagated within object masks of the source and target images, similar to [19, 23, 24, 28].

Pruning Networks. To train the pruning networks with the parameter $\mathbf {W}_P$, we define a novel loss function that consists of silhouette consistency loss and geometry consistency loss, such that

$$\begin{aligned} {\mathcal {L}_\mathrm {P}} = {\mathcal {L}_\mathrm {sil}} +\lambda {\mathcal {L}_\mathrm {geo}} \end{aligned}$$

(10)

where $\lambda $ is the weighting parameter.

With the intuition that local structures between source and target image features should be similar at the correct confident correspondences, we encourage the pruning networks to automatically discard the matches that do not satisfy the following local geometry consistency constraint

$$\begin{aligned} {\mathcal {L}_\mathrm {geo}} = \sum \nolimits _{i \in \mathcal {S}} {\sum \nolimits _{l \in \mathcal {N}_i}{||F_l^s-[G' \circ F^t]_l||_F^2}} \end{aligned}$$

(11)

where $\mathcal {N}_i$ is a local window centered at the pixel i, $\circ $ is a warping operator, and $||\cdot ||_F^2$ denotes Frobenius norm. By aggregating the contextual information of $\mathcal {N}_i$ through the parameters $\mathbf {W}_P$, we can predict more accurate confidence scores than the handcrafted criterion of (4) that relies only on the pixel-level similarity scores.

Additionally, we formulate the silhouette consistency loss that encourages the refined confidence volume $Q'$ to lie within the silhouette of the initial volume Q:

$$\begin{aligned} {\mathcal {L}_\mathrm {sil}} = {\sum \nolimits _{\{i,j\} \in \mathcal {S}^*}{|\log ({Q'_{ij}}/{Q_{ij}})|}} \end{aligned}$$

(12)

where $\mathcal {S}^*=\{i,j|Q_{ij} > \rho \}$, hence ${Q'_{ij}}/{Q_{ij}}$ becomes $[\mathcal {F}(Q; \mathbf {W}_P)]_{ij}$. Note that similar loss function is used in the object landmark detection literature [46] to encourage the landmarks to lie within the silhouette of the object of interest.

Matching Networks. Thanks to the guidance displacements $G'$, most of geometric deformations are already resolved, and thus computing the residual transformation field $\mathcal {F}(C';\mathbf {W}_M)$ with the weakly-supervised loss function of [23] is tractable, such that

$$\begin{aligned} {\mathcal {L}_\mathrm {M}} = \sum \nolimits _{i}{-\log (P_i(\tau ))} \end{aligned}$$

(13)

where $P(\tau )$ is the softmax matching probability defined with a local neighborhood $\mathcal {M}_i$ as

$$\begin{aligned} P_i(\tau ) = \frac{{\exp ({<F^s_i,[\tau \circ {F^t}]_i>})}}{{\sum \nolimits _{l\in \mathcal {M}_i} {\exp ({<F^s_i,[\tau \circ {F^t}]_l>})} }}. \end{aligned}$$

(14)

This objective allows us to consider both positive and negative samples by maximizing the similarity score at the correct transformation while minimizing the scores of remaining candidates within local neighborhood $\mathcal {M}_i$.

Final Objective Function. We additionally utilize $L_1$ regularization loss $\mathcal {L}_\mathrm {sm}$ for the spatial smootheness in the final correspondence field $\tau $ [26, 28]. A final objective is defined as a weighted summation of the presented three losses:

$$\begin{aligned} \mathcal {L}_\mathrm {final}=\lambda _\mathrm {P}\mathcal {L}_\mathrm {P} + \lambda _\mathrm {M}\mathcal {L}_\mathrm {M} + \lambda _\mathrm {sm}\mathcal {L}_\mathrm {sm}. \end{aligned}$$

(15)

4.3 Training Details

Inspired by recent works on finding good matches for wide-baseline stereo [4, 34], we first freeze the network parameters $\mathbf {W}_F$, $\mathbf {W}_M$ and learn the pruning networks $\mathbf {W}_P$ only with the gradients from $\mathcal {L}_\mathrm {P}$. This allows the pruning networks to be converged stably by fixing the values Q of silhouette consistency loss (12). In second stage, we train the whole networks in an end-to-end manner with $\mathcal {L}_\mathrm {final}$ where the properly selected confident matches from the pruning networks boost the convergence of the feature extraction and matching networks by providing well-defined negative samples within the neighborhood $\mathcal {M}_i$ of matching loss (14).

Following [20, 26, 40], this two-stage learning procedure first utilizes synthetically generated image pairs, by applying random synthetic transformations to a single image of PASCAL VOC 2012 segmentation dataset [8] using the split in [28]. Then, our networks are finetuned with semantically similar image pairs from PF-PASCAL dataset [12] using the split in [40].

5 Experimental Results

5.1 Implementation Details

For feature extraction, we used two CNNs as main backbone networks; ImageNet [6]-pretrained ResNet 101 [14] and PASCAL VOC 2012 [8]-pretrained SFNet [28], where activations are sampled at ‘conv4-23’ and ‘conv5-3’. The activations adapted from ‘conv5-3’ are upsampled using bilinear interpolation. We denote these backbone networks in the following evaluations as “Ours w/ResNet” and “Ours w/SFNet”. We set threshold $\rho $ to 0.9, the variances $\{c_P,c_M\}$ to $\{7,5\}$, and Referring to the ablation study of [23], the radius of local window $\mathcal {M}_i$ is set to 5. More details about the implementation and the performance analysis with respect to the hyper-parameters are provided in the supplemental material.

5.2 Results

PF-WILLOW and PF-PASCAL Dataset. PF-WILLOW dataset [11] includes 10 object sub-classes with 10 keypoint annotations for each image, providing 900 image pairs. PF-PASCAL dataset [12] contains 1,351 image pairs over 20 object categories with PASCAL keypoint annotations [3]. Following the split in [13, 40], we used only 300 testing image pairs for the evaluation. We used a common metric of the percentage of correct keypoint (PCK) by computing the distance between flow-warped keypoints and the ground-truth ones [31]. The warped keypoints are determined to be correct if they lie within $\alpha \cdot \max (h,w)$ pixels from the ground-truth keypoints for $\alpha \in [0,1]$, where h and w are the height and width of either an image ($\alpha _\mathrm {img}$) or an object bounding box ($\alpha _\mathrm {bb}$). PCK with $\alpha _\mathrm {bb}$ is more stringent metric than that of $\alpha _\mathrm {img}$ [33]. In line with the previous works, we used $\alpha _\mathrm {bb}$ for PF-WILLOW [11] and $\alpha _\mathrm {img}$ for PF-PASCAL [12].

Table 1. Matching accuracy compared to state-of-the-art correspondence techniques on PF-WILLOW dataset [11], PF-PASCAL dataset [12], and Caltech-101 dataset [29]. Results of [13, 39,40,41, 44] are borrowed from [33].

Full size table

The average PCK scores are summarized in Table 1 showing that our model (“Ours w/ResNet”) exhibits a competitive performance to the latest weakly-supervised and even fully-supervised techniques for semantic correspondence, demonstrating the benefits of generating highly probable hypotheses based on the confident matches. When combined with sophisticate CNN features (“Ours w/SFNet”), the outstanding performance was attained.

Caltech-101 Dataset. We also evaluated our method on Caltech-101 dataset [29] which provides the images of 101 object categories with ground-truth object masks. For the evaluation, we used the 1,515 image pairs used in [13, 40], i.e. 15 image pairs for each object category. Compared to other datasets described above, the Caltech-101 dataset [29] enable us to evaluate the performances under more general settings with the image pairs from more diverse classes. Following the experimental protocol in [22], the matching accuracy was evaluated with two metrics: the label transfer accuracy (LT-ACC), and the intersection-over-union (IoU) metric.

In Table 1, our method achieves a competitive performance compared to state-of-the-art methods in terms of both LT-ACC and IoU metrics. In particular, our results show better performances with significant margins compared to the methods [39,40,41, 44] that consider all possible matching scores.

This reveals the effectiveness of the proposed pruning and propagation modules where only reliable information is propagated and leveraged to reduce the matching ambiguity.

Table 2. Matching accuracy compared to the state-of-the-art techniques on SPair-71k benchmark [33]. Difficulty levels of viewpoints and scales are labeled ‘easy’, ‘medium’, and ‘hard’, while those of truncation and occlusion are indicated by ‘none’, ‘source’, ‘target’, and ‘both’. The performances are evaluated by fixing the levels of other variations as ‘easy’ and ‘none’. Results of [39,40,41, 44] are borrowed from [33].

Full size table

SPair-71k Benchmark. The evaluation was also performed on the SPair-71k benchmark [33] that includes 70,958 image pairs of 18 object categories from PASCAL 3D+ [48] and PASCAL VOC 2012 [8], providing 12,234 pairs for testing. This benchmark is more challenging than other datasets [11, 12, 29] for semantic correspondence evaluation, as it covers significantly large variations of 4 factors as shown in Table 2. For the evaluation metric, we used the PCK setting the threshold with respect to the object bounding box to $\alpha _\mathrm {bb}=0.1$.

Table 2 reports the quantitative performance with respect to different levels of four variation factors. The qualitative results are visualized in Fig. 5. As shown in Table 2 and Fig. 5, our results have shown highly improved performances qualitatively and quantitatively compared to the state-of-the-art techniques on all variation factors. In contrast to the methods [23, 28] that cannot capture large geometric variations due to the simple heuristics used to constrain the search space, a large PCK gain for difficult image pairs in Table 2 indicates that our method is effective especially in the presence of severe appearance and shape variations thanks to the guidance by the confident matches learned from all matching candidates. Though the performance was evaluated only on the sparsely annotated keypoints provided from the benchmark, the qualitative results in Fig. 5 indicates that the objective measure can be significantly boosted if dense ground-truth annotations are given for evaluation.

5.3 Ablation Study

Lastly, we conducted an ablation study on different modules and losses in our model of “Ours w/ResNet” evaluating on the testing image pairs of SPair-71k benchmark [33].

Network Architecture. We report the quantitative assessment when one of our modules is removed from the network architecture in Table 3(a) in terms of average PCK at $\alpha _\mathrm {bb} = 0.1$. Interestingly, the guidance displacement map $G'$, which is the result obtained with only the pruning and propagation modules, already outperforms state-of-the-art methods by a large margin as shown in Table 2. The performance degradation due to the lack of the pruning or propagation modules highlights the importance of the learning-based selection of confident matches and the MLS layer. Figure 6 shows the intermediate results of our method.

Training Loss. To validate the effectiveness of the utilized losses, we examined the performance of our model when learned with different loss functions. In Table 3(b), the first three rows compare the performances for the variants of the pruning networks. The performance gain from 25.1 to 28.5 with respect to $\mathcal {L}_\mathrm {geo}$ indicates the effectiveness of imposing local geometry consistency constraint by aggregating the contextual information. On the other hand, with respect to $\mathcal {L}_\mathrm {sil}$, the degraded performance from 28.5 to 24.3 demonstrates the importance of regularizing the refined confidence scores to be similar with the initial ones, so that the retained confident matches also satisfy mutual consistency.

Table 3. Ablation study on the testing pairs of SPair-71k benchmark [33] for (a) different components and (b) different loss functions. Note that, in (a), when the ‘MLS layer’ in the propagation module is removed, the refined correlation volume $C'$ is computed by applying Gaussian parametric model only on the confident pixels$^{a}$.

Full size table

$C'_{ij} = \left\{ {\begin{array}{*{20}{l}} {\exp (-{{(j - G_i)}^2}/{2{{c_M}^2}}) \cdot C_{ij},}&{}{\mathrm{{if}} \quad i \in \mathcal {S}}\\ {C_{ij},}&{}{\mathrm{{otherwise.}}} \end{array}} \right. $

The last two rows in Table 3(b) reveal the effect of the used two-stage learning process. The performance drop from 33.5 to 30.2 by removing the first stage highlights that the properly selected confident matches from the pruning networks can boost the convergence of our training by allowing only well-defined matching candidates to be utilized during the second stage.

6 Conclusion

We presented a novel framework, guided semantic flow, that reliably infers dense semantic correspondences under large appearance and spatial variations. Taking advantage of the reliable information of confident matches, we effectively handle severe non-rigid geometric deformations and reduce matching ambiguities. The outstanding performance was validated through extensive experiments on various benchmarks.

References

Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. Int. J. Comput. Vision 92(1), 1–31 (2011)
Article Google Scholar
Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B.: PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Trans. Graph. (ToG) 28, 24 (2009)
Article Google Scholar
Bourdev, L., Malik, J.: Poselets: body part detectors trained using 3D human pose annotations. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1365–1372. IEEE (2009)
Google Scholar
Brachmann, E., et al.: DSAC-differentiable RANSAC for camera localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6684–6692 (2017)
Google Scholar
Chang, J.R., Chen, Y.S.: Pyramid stereo matching network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5418 (2018)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Duggal, S., Wang, S., Ma, W.C., Hu, R., Urtasun, R.: DeepPruner: learning efficient stereo matching via differentiable PatchMatch. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vision 88(2), 303–338 (2010)
Article Google Scholar
Fischer, P., et al.: FlowNet: learning optical flow with convolutional networks. In: ICCV (2015)
Google Scholar
HaCohen, Y., Shechtman, E., Goldman, D.B., Lischinski, D.: Non-rigid dense correspondence with applications for image enhancement. ACM Trans. Graph. (TOG) 30(4), 70 (2011)
Article Google Scholar
Ham, B., Cho, M., Schmid, C., Ponce, J.: Proposal flow. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3475–3484 (2016)
Google Scholar
Ham, B., Cho, M., Schmid, C., Ponce, J.: Proposal flow: semantic correspondences from object proposals. IEEE Trans. PAMI 40(7), 1711–1725 (2018)
Article Google Scholar
Han, K., et al.: SCNet: learning semantic correspondence. In: ICCV (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Heise, P., Klose, S., Jensen, B., Knoll, A.: PM-Huber: PatchMatch with Huber regularization for stereo matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2360–2367 (2013)
Google Scholar
Hu, Y., Song, R., Li, Y.: Efficient coarse-to-fine PatchMatch for large displacement optical flow. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5704–5712 (2016)
Google Scholar
Huang, S., Wang, Q., Zhang, S., Yan, S., He, X.: Dynamic context correspondence network for semantic alignment. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in Neural Information Processing Systems, pp. 2017–2025 (2015)
Google Scholar
Jeon, S., Kim, S., Min, D., Sohn, K.: PARN: pyramidal affine regression networks for dense semantic correspondence. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 355–371. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_22
Chapter Google Scholar
Jeon, S., Min, D., Kim, S., Sohn, K.: Joint learning of semantic alignment and object landmark detection. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Kendall, A., et al.: End-to-end learning of geometry and context for deep stereo regression. In: Proceedings of the International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Kim, J., Liu, C., Sha, F., Grauman, K.: Deformable spatial pyramid matching for fast dense correspondences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2307–2314 (2013)
Google Scholar
Kim, S., Lin, S., Jeon, S., Min, D., Sohn, K.: Recurrent transformer networks for semantic correspondence. In: Advances in Neural Information Processing Systems (2018)
Google Scholar
Kim, S., Min, D., Ham, B., Jeon, S., Lin, S., Sohn, K.: FCSS: fully convolutional self-similarity for dense semantic correspondence. In: CVPR (2017)
Google Scholar
Kim, S., Min, D., Ham, B., Lin, S., Sohn, K.: FCSS: fully convolutional self-similarity for dense semantic correspondence. IEEE Trans. Pattern Anal. Mach. Intell. 41, 581–595 (2018)
Article Google Scholar
Kim, S., Min, D., Jeong, S., Kim, S., Jeon, S., Sohn, K.: Semantic attribute matching networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12339–12348 (2019)
Google Scholar
Kim, S., Min, D., Lin, S., Sohn, K.: DCTM: discrete-continuous transformation matching for semantic flow. In: ICCV (2017)
Google Scholar
Lee, J., Kim, D., Ponce, J., Ham, B.: SFNet: learning object-aware semantic correspondence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2278–2287 (2019)
Google Scholar
Li, F.F., Fergus, R., Perona, P.: One-shot learning of object categories. IEEE Trans. PAMI 28(4), 594–611 (2006)
Article Google Scholar
Liu, C., Yuen, J., Torralba, A.: SIFT Flow: dense correspondence across scenes and its applications. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 978–994 (2010)
Article Google Scholar
Long, J.L., Zhang, N., Darrell, T.: Do convnets learn correspondence? In: Advances in Neural Information Processing Systems, pp. 1601–1609 (2014)
Google Scholar
Lu, J., Li, Y., Yang, H., Min, D., Eng, W., Do, M.N.: PatchMatch filter: edge-aware filtering meets randomized search for visual correspondence. IEEE Trans. PAMI 39(9), 1866–1879 (2017)
Article Google Scholar
Min, J., Lee, J., Ponce, J., Cho, M.: Hyperpixel flow: semantic correspondence with multi-layer neural features. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Moo Yi, K., Trulls, E., Ono, Y., Lepetit, V., Salzmann, M., Fua, P.: Learning to find good correspondences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2666–2674 (2018)
Google Scholar
Park, K., Kim, S., Sohn, K.: High-precision depth estimation with the 3D lidar and stereo fusion. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 2156–2163. IEEE (2018)
Google Scholar
Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2007)
Google Scholar
Poggi, M., Pallotti, D., Tosi, F., Mattoccia, S.: Guided stereo matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 979–988 (2019)
Google Scholar
Ranjan, A., Black, M.J.: Optical flow estimation using a spatial pyramid network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4161–4170 (2017)
Google Scholar
Rocco, I., Arandjelović, R., Sivic, J.: Convolutional neural network architecture for geometric matching. In: CVPR (2017)
Google Scholar
Rocco, I., Arandjelović, R., Sivic, J.: End-to-end weakly-supervised semantic alignment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6917–6925 (2018)
Google Scholar
Rocco, I., Cimpoi, M., Arandjelović, R., Torii, A., Pajdla, T., Sivic, J.: Neighbourhood consensus networks. In: Advances in Neural Information Processing Systems, pp. 1658–1669 (2018)
Google Scholar
Schaefer, S., McPhail, T., Warren, J.: Image deformation using moving least squares. ACM Trans. Graph. (TOG) 25, 533–540 (2006)
Article Google Scholar
Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vision 47(1–3), 7–42 (2002)
Article Google Scholar
Seo, P.H., Lee, J., Jung, D., Han, B., Cho, M.: Attentive semantic alignment with offset-aware correlation kernels. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 367–383. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_22
Chapter Google Scholar
Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8934–8943 (2018)
Google Scholar
Suwajanakorn, S., Snavely, N., Tompson, J.J., Norouzi, M.: Discovery of latent 3D keypoints via end-to-end geometric reasoning. In: Advances in Neural Information Processing Systems, pp. 2059–2070 (2018)
Google Scholar
Taira, H., et al.: Is this the right place? Geometric-semantic pose verification for indoor visual localization. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Xiang, Y., Mottaghi, R., Savarese, S.: Beyond PASCAL: a benchmark for 3D object detection in the wild. In: IEEE Winter Conference on Applications of Computer Vision, pp. 75–82. IEEE (2014)
Google Scholar
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR (2016)
Google Scholar
Zbontar, J., LeCun, Y.: Stereo matching by training a convolutional neural network to compare image patches. JMLR 17(1), 2287–2318 (2016)
MATH Google Scholar
Zhang, B., et al.: Deep exemplar-based video colorization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8052–8061 (2019)
Google Scholar
Zhou, T., Jae Lee, Y., Yu, S.X., Efros, A.A.: FlowWeb: joint image set alignment by weaving consistent, pixel-wise correspondences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1191–1200 (2015)
Google Scholar

Download references

Acknowledgements

This research was supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science and ICT (NRF2017M3C4A7069370).

Author information

Authors and Affiliations

Yonsei University, Seoul, South Korea
Sangryul Jeon & Kwanghoon Sohn
Ewha Womans University, Seoul, South Korea
Dongbo Min
Korea University, Seoul, South Korea
Seungryong Kim
Samsung, Suwon, South Korea
Jihwan Choe

Authors

Sangryul Jeon
View author publications
You can also search for this author in PubMed Google Scholar
Dongbo Min
View author publications
You can also search for this author in PubMed Google Scholar
Seungryong Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jihwan Choe
View author publications
You can also search for this author in PubMed Google Scholar
Kwanghoon Sohn
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kwanghoon Sohn .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 13125 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jeon, S., Min, D., Kim, S., Choe, J., Sohn, K. (2020). Guided Semantic Flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12373. Springer, Cham. https://doi.org/10.1007/978-3-030-58604-1_38

Download citation

DOI: https://doi.org/10.1007/978-3-030-58604-1_38
Published: 03 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58603-4
Online ISBN: 978-3-030-58604-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Guided Semantic Flow

Abstract

Similar content being viewed by others

Weakly Supervised Learning of Dense Semantic Correspondences and Segmentation

ECO-TR: Efficient Correspondences Finding via Coarse-to-Fine Refinement

Confidence-Aware Adversarial Learning for Self-supervised Semantic Matching

Keywords

1 Introduction

2 Related Works

3 Problem Statement