1 Introduction

Estimating dense optical flow is a fundamental problem in computer vision. Despite significant progress over the last decades, realistic scenes with displacements of several hundred pixels, strong changes in illumination and textureless or specular regions remain challenging to date [9, 17]. Traditionally, dense optical flow estimation has been formulated as a continuous optimization problem [20, 25], and many of today’s most successful methods leverage elaborate variants of the original formulation, allowing for more robust penalties or improving optimization. As continuous methods typically require linearizing the highly non-convex data term, they only permit the estimation of very small displacements up to a few pixels. Thus, in order to handle large displacements in real-world videos, a simple heuristic is often employed: Optical flow is estimated in a coarse-to-fine manner, thereby guaranteeing an upper bound to the maximal displacement at each level of the image pyramid. Unfortunately, this strategy is highly susceptible to local minima as small structures and textural details vanish at coarse image resolutions, leading to oversmoothing artifacts in the estimated flow field.

In contrast to optical flow, the most successful approaches to stereo matching typically rely on discrete inference in graphical models. While such models are loopy by nature and thus lead to NP-hard optimization problems, good approximate solutions can often be efficiently computed using graph cuts, belief propagation or mean field approximations. Importantly, no image pyramids are required as the full data cost volume is considered at the same time during inference. Unfortunately, the application of discrete methods to the problem of optical flow is not straightforward and hence there exists only relatively little work in this direction. The main reason for this is the huge size of the label space which needs to be considered for the 2D large-displacement optical flow problem as opposed to the 1D stereo problem.

Fig. 1.
figure 1

Strategies for Efficient Discrete Optical Flow. Left: We create a large set of diverse flow proposals per pixel (red node) by combining nearest neighbors in feature space from a set of grid cells (green nodes) with winner-takes-all solutions from neighboring pixels (blue nodes). The red square indicates the search region. Middle: We apply block coordinate descent, iteratively optimizing all image rows and columns (red) conditioned on neighboring blocks (white) via dynamic programming. Right: Taking advantage of robust penalties, we reduce pairwise computation costs by pre-computing the set of non-truncated (\(<\tau _\psi \)) neighboring flow proposals (black) for each flow vector (red) (Color figure online).

In this paper, we propose three different strategies to make discrete inference in a pairwise conditional random field (CRF) applicable to the estimation of optical flow, see Fig. 1 for an illustration. First, we restrict the label set by considering only the L most likely matches per pixel which we obtain via approximate nearest neighbor search in feature space subject to non-maxima suppression constraints. To validate this restriction, we experimentally show that the oracle solution of the restricted set outperforms all existing optical flow techniques by a significant margin. Second, our inference scheme takes advantage of efficient convergent block coordinate descent (BCD) and iteratively updates all image rows and columns conditioned on the remaining variables via dynamic programming. Third, we exploit the special form of the pairwise potentials used by our formulation to further decrease computational complexity, thereby making very large unordered label sets with hundreds of labels tractable. Upon convergence, we remove outliers (e.g., in occluded regions) using strategies borrowed from the stereo literature. Finally, we regress a real-valued dense flow field from our semi-dense integer flow estimate using variational techniques [32, 42]. We experimentally validate the proposed method on two challenging benchmarks: MPI Sintel [9] and KITTI [17]. Our experiments show that the proposed method attains state-of-the-art performance. Importantly, our results indicate that discrete optimization can be a powerful tool for solving optical flow problems - even when considering pairwise flow priors only. Our code and supplementary material are available from our project page: http://www.cvlibs.net/projects/discrete_flow.

2 Related Work

Global estimation of optical flow has traditionally been formulated as a continuous variational optimization problem with linearized data terms [4, 20] and many of the most successful works still follow the same paradigm to date [8, 12, 30, 34, 35, 37, 40, 43]. To estimate displacements larger than 1 pixel, as required by modern benchmarks [1, 9, 17] and for many applications, continuous methods typically rely on image pyramids [7]. Unfortunately, this heuristic is prone to local minima as texture details and fine structures vanish at small scales. As a consequence, methods leveraging sparse feature correspondences to guide the optimization process have been popularized [5, 6, 32, 38, 42], incorporating feature matches into initialization or as an additional data term. Our approach shares similarity with these methods in the sense that we also refine our integer-valued flow result to sub-pixel accuracy as proposed in [32, 42]. However, in contrast to the above-mentioned works, our integral matches are obtained via discrete optimization with optical flow priors. This allows our algorithm to establish denser correspondences than possible with independently matched sparse features and, in combination with sub-pixel refinement, leads to better results.

Alternatively, optical flow can be viewed as a discrete optimization problem. In the absence of regularization constraints, this corresponds to a complete search of the discretized 2D flow space which has to be carried out for each pixel [2, 47]. While local support windows [33] alleviate the effect of border bleeding, they generally cannot compete with methods that leverage global regularization. Incorporating smoothness constraints, however, is much more difficult for discrete optical flow than for the related stereo matching problem due to the extremely large label space of the discretized 2D flow field. To avoid this difficulty, a number of approaches formulate optical flow as a segmentation problem: Based on a small set of dense flow field proposals, the most likely flow field at each pixel subject to regularization constraints is approximated via quadratic pseudo-boolean optimization (QPBO) or belief propagation (BP) [11, 23, 44, 45, 48]. A reduced label set for all pixels is proposed by [28] restricting the method to scenes with little and non-complex motion. In contrast, here we pursue a more direct approach to discrete optical flow which does not rely on proposed flow fields or a global label set but allows for an arbitrary set of flow proposals per pixel. Compared to the approaches of Steinbrücker et al. [36] and Liu et al. [24], we neither require a complete search nor image pyramids [24]. Our approach is also related to particle-based methods such as particle BP [46] or PatchMatch BP [3, 21]. In contrast to those methods, our algorithm is not restricted to a relatively small set of particles or local resampling which can lead to local minima. Instead, we maintain diverse distributions over much larger state spaces throughout the whole inference process.

Very recently, object recognition and feature learning have been exploited as a powerful source of information for optical flow. Notable examples are the data-driven flow transfer method of Wei et al. [41] and FlowNets based on deep neural networks by Fischer et al. [15]. In this paper our focus lies on a more generic model without the need for large annotated training sets. However, these ideas could be easily incorporated into the proposed model, e.g., via additional unary terms, promising further gains in performance in the future.

3 Model

Let us assume two input images \(\mathbf {I}\) and \(\mathbf {I}'\) of size \(W\times H\). Our goal is to assign as many pixels as possible from the reference image \(\mathbf {I}\) to pixels in the target image \(\mathbf {I}'\). In other words, we aim for a semi-dense integer-valued flow field, establishing correspondences which are visible in both views. Flow vectors are selected from a diverse set of proposals (Sect. 3.1) based on a CRF model (Sect. 3.2) which is efficiently optimized (Sect. 3.3). Interpolation, extrapolation and sub-pixel refinement are then addressed in a subsequent post-processing step (Sect. 3.4).

3.1 Diverse Flow Proposals

Unfortunately, naïve pixel-wise discretization of all possible flow vectors would lead to an intractably large set of labels at each pixel. Consider for instance a flow range of \(\pm 250\) pixels in u- and v-direction. While this range is sufficiently large for the datasets tackled in this paper (see supplementary material), it leads to more than 60, 000 labels per pixel for which even the data term alone would be challenging to calculate and to store. In this section, we therefore propose a mechanism which extracts a subset of L flow proposals (\(L=500\) in our experiments) while maintaining a high recall of ground truth optical flow.

Towards this goal, we partition the target image into cells of equal size as illustrated in Fig. 1 (left). For each cell, we apply the randomized k-d tree algorithm [29] on the feature descriptorsFootnote 1 of the pixels falling into that cell, yielding an efficient approximate search structure. Next, for each pixel in the reference image (red node in Fig. 1), we find all relevant cells in the target image according to the desired optical flow range (red shaded cells in Fig. 1) and concatenate all M flow vectors corresponding to the K nearest neighbor matches from the k-d tree of each cell (green nodes in Fig. 1). In contrast to a single search structure constructed for the entire target image, this strategy has two advantages: First, arbitrary optical flow search ranges can be implemented with little computational overhead by searching only in relevant cells. Second, the retrieved flow vectors respect an approximate non-maximal-suppression constraint, i.e., we obtain exactly K flow vectors per cell.

As neighboring pixels often exhibit similar optical flow, we additionally sample N random pixels (blue nodes in Fig. 1) from a local Gaussian distribution centered at the reference pixel and add the respective winner-takes-all solution (i.e., the flow vector corresponding to the best match at the sampled pixel) to the proposal set of the current pixel. In case it is already present in the proposal set, we proceed with the next best flow vector, ensuring that the final set of \(L=M+N\) flow vectors is unique. In our experiments we use \(M=300\) nearest neighbors and \(N=200\) proposals from neighboring pixels.

3.2 Random Field Model

We associate a discrete label \(l_\mathbf {p}\in \{1,\dots ,L\}\) with each pixel \(\mathbf {p}=(x,y)\) in the reference image, corresponding to the (unique) integer-valued proposal flow vectors \(\mathbf {f}_\mathbf {p}(l_\mathbf {p})\in \mathbb {Z}^2\) described in Sect. 3.1. We consider optical flow estimation as MAP inference in a pairwise CRF. More specifically, we aim at minimizing

$$\begin{aligned} E(\mathbf {l}) = \lambda \sum _{\mathbf {p}\in \mathcal {P}} \underbrace{\varphi _\mathbf {p}(l_\mathbf {p})}_{\text {data}} \,+\, \sum _{\mathbf {p}\sim \mathbf {q}} \underbrace{\psi _{\mathbf {p},\mathbf {q}}(l_\mathbf {p},l_\mathbf {q})}_{\text {smoothness}} \end{aligned}$$
(1)

with respect to the set of image labels \(\mathbf {l}=\{l_\mathbf {p}|\mathbf {p}\in \mathcal {P}\}\) where \(\mathcal {P}\) denotes the set of pixels in image \(\mathbf {I}\). Here, we have dropped the dependency on the input images \(\mathbf {I}\) and \(\mathbf {I}'\) for clarity and \(\sim \) denotes all neighbors on a 4-connected image grid. The relative weight between the data term (measuring data fidelity) and the smoothness term (encouraging smooth flow fields) is defined by \(\lambda \).

The data term \(\varphi _\mathbf {p}(l_\mathbf {p})\) encodes the cost at pixel \(\mathbf {p}\) given label \(l_\mathbf {p}\). We model it as the truncated \(\ell _1\)-penalty evaluating the difference between feature descriptors

$$\begin{aligned} \varphi _\mathbf {p}(l_\mathbf {p}) = \min \left( {\Vert \mathbf {d}_\mathbf {p}- \mathbf {d}'_\mathbf {p}(l_\mathbf {p}) \Vert }_1,\tau _\varphi \right) \end{aligned}$$
(2)

where \(\mathbf {d}_\mathbf {p}\in \mathbb {R}^D\) denotes the feature descriptor at pixel \(\mathbf {p}\) in the reference image, \(\mathbf {d}'_\mathbf {p}(l_\mathbf {p})\in \mathbb {R}^D\) denotes the feature descriptor associated with pixel \(\mathbf {p}\) and label \(l_\mathbf {p}\) in the target image, and \(\tau _\varphi \) is the truncation threshold of the data term.

The second term in Eq. 1 encourages smooth flow fields and is modeled as the weighted truncated \(\ell _1\)-distance of neighboring optical flow vectors

$$\begin{aligned} \psi _{\mathbf {p},\mathbf {q}}(l_\mathbf {p},l_\mathbf {q}) = w_{\mathbf {p},\mathbf {q}} ~ \min \left( {\Vert \mathbf {f}_\mathbf {p}(l_\mathbf {p}) - \mathbf {f}_\mathbf {q}(l_\mathbf {q}) \Vert }_1,\tau _\psi \right) \end{aligned}$$
(3)

where \(\mathbf {f}_\mathbf {p}(l_\mathbf {p})\in \mathbb {R}^2\) denotes the flow vector at pixel \(\mathbf {p}\) associated with label \(l_\mathbf {p}\) and \(\tau _\psi \) is the truncation threshold of the smoothness term. We remark that a truncated penalty is not only robust against outliers and thus preferable from a statistical point of view [34], but it is also critical for tractable inference in our model as described in Sect. 3.3. Moreover, note that Eq. 3 is a generalization of the pairwise potential proposed for stereo matching by Hirschmüller [19] to arbitrary truncation thresholds \(\tau _\psi \). As we expect flow boundaries to coincide with image edges, we additionally weight the smoothness term by a weight factor \(w_{\mathbf {p},\mathbf {q}}=\exp \left( -\alpha \,\kappa _{\mathbf {p},\mathbf {q}}^2\right) \). Here, \(\kappa _{\mathbf {p},\mathbf {q}} \in \left[ 0,1\right] \) measures the strength of the edge between pixel \(\mathbf {p}\) and pixel \(\mathbf {q}\). We calculate \(\kappa _{\mathbf {p},\mathbf {q}}\) using structured edge detection [13].

3.3 Inference

Despite the simplicity of the model described in Sect. 3.2 and the label set of size \(L=500\), inference using max-product loopy BP is still prohibitively slow. To enable efficient inference, we therefore exploit two additional strategies. First, instead of loopy belief propagation, we perform block coordinate descent (BCD). More specifically, we follow [10] and iteratively update alternating image rows and columns conditioned on the MAP solution of the remaining variables as illustrated in Fig. 1 (middle). Second, we exploit the truncated form of the pairwise potentials in Eq. 3. This is illustrated in Fig. 1 (right). In combination, both steps reduce computational complexity by about four orders of magnitude.

Without loss of generality, consider the optimization of image row y. The naïve dynamic programming algorithm recursively fills the cumulative cost matrix \(\mathbf {C}\) for each x from 1 to W using the following update equation:

$$\begin{aligned} \mathbf {C}(x,l)= & {} \lambda \, \varphi _{(x,y)}(l) + \psi _{(x,y),(x,y-1)}(l,l^*_{x,y-1}) + \psi _{(x,y),(x,y+1)}(l,l^*_{x,y+1})\nonumber \\+ & {} \min _{0\le k<L} \left( \psi _{(x,y),(x-1,y)}(l,k) + \mathbf {C}(x-1,k) \right) \end{aligned}$$
(4)

Here, \(l^*_\mathbf {p}\) denotes the assignment of the fixed variables, i.e., the variables outside row y. While the global problem is NP-hard and can be solved only approximately, each sub-problem corresponds to a chain MRF for which we obtain an optimal solution via dynamic programming (backtracking). This leads to the desirable property that this algorithm (unlike loopy BP) is guaranteed to converge.

In case of ordered label sets (e.g., in stereo matching [14] or depth reconstruction [10]) the efficient distance transform [14] can be employed to lower the complexity of this algorithm from \(O(WL^2)\) to O(WL) by calculating the expression in the last row of Eq. 4 in linear time. Unfortunately, in our case the flow vectors involved in \(\psi _{(x,y),(x-1,y)}(l,k)\) are sparse and unordered, therefore prohibiting the application of this trick. However, we are still able to utilize the fact that our pairwise potentials \(\psi \) are truncated at \(\tau _\psi \): First, note that for practical truncation thresholds (i.e., \(\tau _\psi <15\)), the majority of the \(L^2\) pairwise terms evaluates to \(\tau _\psi \). We exploit this observation by partitioning the labels of each neighboring pixel pair \((\mathbf {p},\mathbf {q})\) into sets

$$\begin{aligned} \mathcal {K}_{\mathbf {p},\mathbf {q},l} = \{k\in \{1,\dots ,L\}\,|\,{\Vert \mathbf {f}_\mathbf {p}(l) - \mathbf {f}_\mathbf {q}(k) \Vert }_1<\tau _\psi \} \end{aligned}$$
(5)

which contain all labels k at pixel \(\mathbf {q}\) for which the flow \(\mathbf {f}_\mathbf {q}(k)\) is within \(\tau _\psi \) from the flow \(\mathbf {f}_\mathbf {p}(l)\) associated with label l at pixel \(\mathbf {p}\). Figure 1 (right) illustrates \(\mathcal {K}_{\mathbf {p},\mathbf {q}}\) for a single flow vector at pixel \(\mathbf {p}\) (shown in red) and a set of flow vectors at pixel \(\mathbf {q}\) (black+gray). Given this definition, the last term in Eq. 4 can be written as

$$\begin{aligned} \min \left( \min _{k \in \mathcal {K}_{(x,y),(x-1,y),l}}~ \left( \psi _{(x,y),(x-1,y)}(l,k) + \mathbf {C}(x-1,k) \right) \!,c \right) \end{aligned}$$
(6)

where the constant c is given by

$$\begin{aligned} c = \min _{0\le k <L} \left( w_{(x,y),(x-1,y)} \tau _\psi + \mathbf {C}(x-1,k) \right) \end{aligned}$$
(7)

As c does not depend on l and can be calculated in O(L), Eq. 6 can be evaluated (for all l) in \(O(\sum _l|\mathcal {K}_{\mathbf {p},\mathbf {q},l}|)\) instead of \(O(L^2)\), where \(|\mathcal {K}_{\mathbf {p},\mathbf {q},l}| \ll L\) in practice due to the diversity of the proposal set as evidenced by our experimental evaluation in Sect. 4. It is important to note that the sets \(\mathcal {K}_{\mathbf {p},\mathbf {q},l}\) can be pre-computed and reused during all BCD iterations. In our implementation, we further accelerate this pre-computation step using hash maps for efficiently retrieving flow vectors.

3.4 Postprocessing

The inference algorithm described above solves the correspondence problem, i.e., it assigns each pixel in the reference image to a pixel in the target image. As we do not model occlusions explicitly, also occluded pixels and pixels leaving the image domain are assigned a label. We therefore remove outliers from our result, borrowing ideas from the stereo matching literature [19]. More specifically, we perform forward-backward consistency checking, i.e., we calculate the optical flow forwards and backwards in time and retain only flow vectors which are consistent. In addition, we remove small isolated segments which often correspond to wrong optical flow estimates using connected component labeling with a minimum segment size of 100 px and a flow consistency threshold of 10 px. Finally, in order to obtain sub-pixel flow values and to inter-/extrapolate into unmatched regions, we refine our results using the state-of-the-art flow refinement techniques of DeepFlow and EpicFlow.

Table 1. Pilot Study on MPI Sintel and KITTI Training Sets. See text.

4 Experimental Evaluation

We evaluate our approach on the challenging MPI Sintel [9] and KITTI [17] datasets. Our evaluation starts with a pilot study where we experimentally validate the claims from the introduction, followed by a quantitative and qualitative evaluation of our model. The parameters in our model are set via block coordinate descent on the respective training set (see supplementary material). To further increase efficiency we use a stride of 4 pixels. For subpixel interpolation of our results, we leverage the interpolation stages of DeepFlow (“Ours+DeepFlow”) and EpicFlow (“Ours+EpicFlow”). On average, our MATLAB/C++ implementation requires about 3 min for one 0.5 megapixel color image pair from the KITTI training set (10 % descriptor extraction, 40 % proposal generation, 35 % BCD, 15 % postprocessing and overhead).

Fig. 2.
figure 2

Frequency of Neighboring Flow Proposals wrt. Distance. This figure shows the frequency of neighboring flow proposal vectors with respect to their endpoint distance on MPI Sintel (a) and KITTI (b) in red. The blue plots depict the cumulative frequency corresponding to \(|\mathcal {K}_{\mathbf {p},\mathbf {q},l}|/L\) over the distance threshold \(\tau _\psi \), respectively (Color figure online).

Table 2. Evaluation on MPI Sintel Test Set. This table lists the top 10 out of 44 ranked methods on the MPI Sintel flow benchmark in terms of endpoint error (EPE) in all, matched (Noc) and unmatched (Occ) regions. The results on the clean images are shown in (a). The results on the final set are shown in (b).

4.1 Pilot Study

Table 1 shows results on the non-occluded parts of the MPI Sintel and KITTI training sets in terms of endpoint error (EPE) and outliers (threshold: 3 px). We show (from top-to-bottom) the results of DeepFlow [42], EpicFlow [32], our results in combination with the refinement stages of DeepFlow [42] and EpicFlow [32], as well as an “Oracle” which refers to the flow map obtained by selecting the flow with the smallest EPE at each pixel from our proposal set. Note that our approach uses only the refinement stages of [32, 42] and not the DeepMatches (DM) which form the foundation for [32, 42]. As evidenced by this experiment, our method obtains state-of-the-art performance, outperforming [32, 42] in most of the error metrics. Furthermore, our proposal set contains flow vectors close to the ground truth for most of the pixels while being diverse enough to avoid local minima and reduce the computational complexity of the pairwise terms. Figure 2 shows histograms of the frequency of neighboring flow proposals with respect to the distance of their endpoints for \(L=500\). While the pairwise term in Eq. 3 can take \(L^2=250,000\) states in total, only \(2\,\%\) of them fall below the truncation value of 15 px and need to be evaluated in Eq. 6. Thus, pre-calculating the partitions \(\mathcal {K}_{\mathbf {p},\mathbf {q},l}\) saves almost two orders of magnitude in computation time.

Table 3. Evaluation on KITTI Test Set. This table lists the top 15 out of 57 ranked methods on the KITTI optical flow benchmark in terms of outliers and endpoint error (EPE) in non-occluded and all regions. For comparability, only pure optical flow methods are shown, excluding motion stereo methods and techniques which use stereo information or more than two frames as input.

4.2 Quantitative Results

We also compare our method to the state-of-the-art on the challenging held-out test data of MPI Sintel [9] and KITTI [17]. As requested by the evaluation protocol, we only submitted our strongest entry (“Ours+EpicFlow”) for evaluation on the test set. Tables 2 and 3 compare our results to the state-of-the-art in two-frame optical flow. On MPI Sintel we obtain rank 1 on the “clean” set and rank 2 on the “final” set out of 44 ranked methods in total. On KITTI our method ranks 6th (out of 57 submitted results) for an outlier threshold of 3 pixels and 1st when using a threshold of 5 pixels. Note that we have excluded motion stereo methods and methods using additional information from Table 3 for a fair comparison. For full results, we refer to the online leaderboardsFootnote 2.

4.3 Qualitative Results

Figure 3 depicts qualitative results of our method on a subset of MPI Sintel (top four images) and KITTI (bottom four images). As we require ground truth for visualizing the error maps we utilize the training sets for this purpose. We show the input image (left), our results in combination with EpicFlow postprocessing (middle) and the color-coded error map with respect to the ground truth flow field (right). At the top of each error map we specify the percentage of outliers and the endpoint error (EPE). The color coding visualizes outliers (\(>3\) px EPE) in red and inliers (\(<3\) px EPE) in blue on a logarithmic color scale. Pixels without ground truth value are shown in black. Note how our method is able to capture fine details in the optical flow. The last result of each dataset shows a failure case. For MPI Sintel, errors can be mostly attributed to homogeneous regions, occlusions and large changes in motion blur. On KITTI, we identified large perspective distortions, illumination artifacts and saturated or reflective regions as the primary sources of error.

Fig. 3.
figure 3

Qualitative Results on MPI Sintel and KITTI Training Sets. From left-to-right: Input image, “Ours+EpicFlow”, inliers (blue)/outliers (red) (Color figure online).

5 Conclusions

We presented a discrete solution to the estimation of optical flow by exploiting three strategies which limit computational and memory requirements. Our experiments show that discrete optimization can lead to state-of-the-art optical flow results using a relatively simple prior model. In the future, we plan to integrate richer optical flow priors into our approach. In particular, we aim at jointly reasoning about several image scales and at incorporating semantic information [18, 41] into optical flow estimation. Furthermore, we plan to extend our approach to the estimation of 3D scene flow [26, 27].