Asynchronous, Photometric Feature Tracking Using Events and Frames

Gehrig, Daniel; Rebecq, Henri; Gallego, Guillermo; Scaramuzza, Davide

doi:10.1007/978-3-030-01258-8_46

Daniel Gehrig¹⁷,
Henri Rebecq¹⁷,
Guillermo Gallego¹⁷ &
…
Davide Scaramuzza¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11216))

Included in the following conference series:

European Conference on Computer Vision

2370 Accesses
71 Citations
10 Altmetric

Abstract

We present a method that leverages the complementarity of event cameras and standard cameras to track visual features with low-latency. Event cameras are novel sensors that output pixel-level brightness changes, called “events”. They offer significant advantages over standard cameras, namely a very high dynamic range, no motion blur, and a latency in the order of microseconds. However, because the same scene pattern can produce different events depending on the motion direction, establishing event correspondences across time is challenging. By contrast, standard cameras provide intensity measurements (frames) that do not depend on motion direction. Our method extracts features on frames and subsequently tracks them asynchronously using events, thereby exploiting the best of both types of data: the frames provide a photometric representation that does not depend on motion direction and the events provide low-latency updates. In contrast to previous works, which are based on heuristics, this is the first principled method that uses raw intensity measurements directly, based on a generative event model within a maximum-likelihood framework. As a result, our method produces feature tracks that are both more accurate (subpixel accuracy) and longer than the state of the art, across a wide variety of scenes.

Multimedia Material. A supplemental video for this work is available at https://youtu.be/A7UfeUnG6c4.

You have full access to this open access chapter, Download conference paper PDF

EKLT: Asynchronous Photometric Feature Tracking Using Events and Frames

Article 22 August 2019

Continuous-Time Intensity Estimation Using Event Cameras

EMVS: Event-Based Multi-View Stereo—3D Reconstruction with an Event Camera in Real-Time

Article 07 November 2017

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Event cameras, such as the Dynamic Vision Sensor (DVS) [1], work very differently from traditional cameras (Fig. 1). They have independent pixels that send information (called “events”) only in presence of brightness changes in the scene at the time they occur. Thus, their output is not an intensity image but a stream of asynchronous events. Event cameras excel at sensing motion, and they do so with very low-latency (1 $\upmu $s). However, they do not provide absolute intensity measurements, rather they measure only changes of intensity. Conversely, standard cameras provide direct intensity measurements for every pixel, but with comparatively much higher latency (10–20 ms). Event cameras and standard cameras are, thus, complementary, which calls for the development of novel algorithms capable of combining the specific advantages of both cameras to perform computer vision tasks with low-latency. In fact, the Dynamic and Active-pixel Vision Sensor (DAVIS) [2] was recently introduced (2014) in that spirit. It is a sensor comprising an asynchronous event-based sensor and a standard frame-based camera in the same pixel array.

We tackle the problem of feature tracking using both events and frames, such as those provided by the DAVIS. Our goal is to combine both types of intensity measurements to maximize tracking accuracy and age, and for this reason we develop a maximum likelihood approach based on a generative event model.

Feature tracking is an important research topic in computer vision, and has been widely studied in the last decades. It is a core building block of numerous applications, such as object tracking [3] or Simultaneous Localization and Mapping (SLAM) [4,5,6,7]. While feature detection and tracking methods for frame-based cameras are well established, they cannot track in the blind time between consecutive frames, and are expensive because they process information from all pixels, even in the absence of motion in the scene. Conversely, event cameras acquire only relevant information for tracking and respond asynchronously, thus, filling the blind time between consecutive frames.

In this work we present a feature tracker which works by extracting corners in frames and subsequently tracking them using only events. This allows us to take advantage of the asynchronous, high dynamic range and low-latency nature of the events to produce feature tracks with high temporal resolution. However, this asynchronous nature means that it becomes a challenge to associate individual events coming from the same object, which is known as the data association problem. In contrast to previous works which used heuristics to solve for data association, we propose a maximum likelihood approach based on a generative event model that uses the photometric information from the frames to solve the problem. In summary, our contributions are the following:

We introduce the first feature tracker that combines events and frames in a way that (i) fully exploits the strength of the brightness gradients causing the events, (ii) circumvents the data association problem between events and pixels of the frame, and (iii) leverages a generative model to explain how events are related to brightness patterns in the frames.
We provide a comparison with state-of-the-art methods [8, 9], and show that our tracker provides feature tracks that are both more accurate and longer.
We thoroughly evaluate the proposed tracker using scenes from the publicly available Event Camera Dataset [10], and show its performance both on man-made environments with large contrast and in natural scenes.

2 Related Work

Feature detection and tracking with event cameras is a major research topic [8, 9, 12,13,14,15,16,17,18], where the goal is to unlock the capabilities of event cameras and use them to solve these classical problems in computer vision in challenging scenarios inaccessible to standard cameras, such as low-power, high-speed and high dynamic range (HDR) scenarios. Recently, extensions of popular image-based keypoint detectors, such as Harris [19] and FAST [20], have been developed for event cameras [17, 18]. Detectors based on the distribution of optical flow [21] for recognition applications have also been proposed for event cameras [16]. Finally, most event-based trackers use binary feature templates, either predefined [13] or built from a set of events [9], to which they align events by means of iterative point-set–based methods, such as iterative closest point (ICP) [22].

Our work is most related to [8], since both combine frames and events for feature tracking. The approach in [8] detects patches of Canny edges around Harris corners in the grayscale frames and then tracks such local edge patterns using ICP on the event stream. Thus, the patch of Canny edges acts as a template to which the events are registered to yield tracking information. Under the simplifying assumption that events are mostly generated by strong edges, the Canny edgemap template is used as a proxy for the underlying grayscale pattern that causes the events. The method in [8] converts the tracking problem into a geometric, point-set alignment problem: the event coordinates are compared against the point template given by the pixel locations of the Canny edges. Hence, pixels where no events are generated are, efficiently, not processed. However, the method has two drawbacks: (i) the information about the strength of the edges is lost (since the point template used for tracking is obtained from a binary edgemap) (ii) explicit correspondences (i.e., data association) between the events and the template need to be established for ICP-based registration. The method in [9] can be interpreted as an extension of [8] with (i) the Canny-edge patches replaced by motion-corrected event point sets and (ii) the correspondences computed in a soft manner using Expectation-Maximization (EM)-ICP.

Like [8, 9], our method can be used to track generic features, as opposed to constrained edge patterns. However, our method differs from [8, 9] in that (i) we take into account the strength of the edge pattern causing the events and (ii) we do not need to establish correspondences between the events and the edgemap template. In contrast to [8, 9], which use a point-set template for event alignment, our method uses the spatial gradient of the raw intensity image, directly, as a template. Correspondences are implicitly established as a consequence of the proposed image-based registration approach (Sect. 4), but before that, let us motivate why establishing correspondences is challenging with event cameras.

3 The Challenge of Data Association for Feature Tracking

The main challenge in tracking scene features (i.e., edge patterns) with an event camera is that, because this sensor responds to temporal changes of intensity (caused by moving edges on the image plane), the appearance of the feature varies depending on the motion, and thus, continuously changes in time (see Fig. 2). Feature tracking using events requires the establishment of correspondences between events at different times (i.e., data association), which is difficult due to the above-mentioned varying feature appearance (Fig. 2).

Instead, if additional information is available, such as the absolute intensity of the pattern to be tracked (i.e., a time-invariant representation or “map” of the feature), such as in Fig. 2(a), then event correspondences may be established indirectly, via establishing correspondences between the events and the additional map. This, however, additionally requires to continuously estimate the motion (optic flow) of the pattern. This is in fact an important component of our approach. As we show in Sect. 4, our method is based on a model to generate a prediction of the time-varying event-feature appearance using a given frame and an estimate of the optic flow. This generative model has not been considered in previous feature tracking methods, such as [8, 9].

4 Methodology

An event camera has independent pixels that respond to changes in the continuous brightness signal^{Footnote 1} $L(\mathbf {u},t)$. Specifically, an event $e_k=(x_k,y_k,t_k,p_k)$ is triggered at pixel $\mathbf {u}_k=(x_k,y_k)^\top $ and at time $t_k$ as soon as the brightness increment since the last event at the pixel reaches a threshold $\pm C$ (with $C > 0$):

$$\begin{aligned} \varDelta L(\mathbf {u}_k,t_k) \doteq L(\mathbf {u}_k,t_k) - L(\mathbf {u}_k,t_k-\varDelta t_k) = p_k C, \end{aligned}$$

(1)

where $\varDelta t_k$ is the time since the last event at the same pixel, $p_k\in \{-1,+1\}$ is the event polarity (i.e., the sign of the brightness change). Equation (1) is the event generation equation of an ideal sensor [23, 24].

4.1 Brightness-Increment Images from Events and Frames

Pixel-wise accumulation of event polarities over a time interval $\varDelta \tau $ produces an image $\varDelta L(\mathbf {u})$ with the amount of brightness change that occurred during the interval (Fig. 3a),

$$\begin{aligned} \varDelta L(\mathbf {u}) = \sum _{t_k\in \varDelta \tau } p_k C\, \delta (\mathbf {u}-\mathbf {u}_k), \end{aligned}$$

(2)

where $\delta $ is the Kronecker delta due to its discrete argument (pixels on a lattice).

For small $\varDelta \tau $, such as in the example of Fig. 3a, the brightness increments (2) are due to moving edges according to the formula^{Footnote 2}:

$$\begin{aligned} \varDelta L(\mathbf {u}) \approx - \nabla L(\mathbf {u}) \cdot \mathbf {v}(\mathbf {u}) \varDelta \tau , \end{aligned}$$

(3)

that is, increments are caused by brightness gradients $\nabla L(\mathbf {u}) = \bigl (\frac{\partial L}{\partial x}, \frac{\partial L}{\partial y}\bigr )^\top $ moving with velocity $\mathbf {v}(\mathbf {u})$ over a displacement $\varDelta \mathbf {u}\doteq \mathbf {v}\varDelta \tau $ (see Fig. 3b). As the dot product in (3) conveys, if the motion is parallel to the edge ($\mathbf {v}\perp \nabla L$), the increment vanishes, i.e., no events are generated. From now on (and in Fig. 3b) we denote the modeled increment (3) using a hat, $\varDelta \hat{L}$, and the frame by $\hat{L}$.

4.2 Optimization Framework

Following a maximum likelihood approach, we propose to use the difference between the observed brightness changes $\varDelta L$ from the events (2) and the predicted ones $\varDelta \hat{L}$ from the brightness signal $\hat{L}$ of the frames (3) to estimate the motion parameters that best explain the events according to an optimization score.

More specifically, we pose the feature tracking problem using events and frames as that of image registration [25, 26], between images (2) and (3). Effectively, frames act as feature templates with respect to which events are registered. As is standard, let us assume that (2) and (3) are compared over small patches ($\mathcal {P}$) containing distinctive patterns, and further assume that the optic flow $\mathbf {v}$ is constant for all pixels in the patch (same regularization as [25]).

Letting $\hat{L}$ be given by an intensity frame at time $t=0$ and letting $\varDelta L$ be given by events in a space-time window at a later time t (see Fig. 4), our goal is to find the registration parameters $\mathbf {p}$ and the velocity $\mathbf {v}$ that maximize the similarity between $\varDelta L(\mathbf {u})$ and $\varDelta \hat{L}(\mathbf {u};\mathbf {p},\mathbf {v}) = -\nabla \hat{L}(\mathbf {W}(\mathbf {u};\mathbf {p}))\cdot \mathbf {v}\varDelta \tau $, where $\mathbf {W}$ is the warping map used for the registration. We explicitly model optic flow $\mathbf {v}$ instead of approximating it by finite differences of past registration parameters to avoid introducing approximation errors and to avoid error propagation from past noisy feature positions. A block diagram showing how both brightness increments are computed, including the effect of the warp $\mathbf {W}$, is given in Fig. 5. Assuming that the difference $\varDelta L- \varDelta \hat{L}$ follows a zero-mean additive Gaussian distribution with variance $\sigma ^2$ [1], we define the likelihood function of the set of events $\mathcal {E}\doteq \{e_k\}_{k=1}^{N_e}$ producing $\varDelta L$ as

$$\begin{aligned} p(\mathcal {E} \,|\, \mathbf {p},\mathbf {v}, \hat{L}) = \frac{1}{\sqrt{2\pi \sigma ^2}}\exp \left( -\frac{1}{2\sigma ^2}\int _{\mathcal {P}} \bigl (\varDelta L(\mathbf {u}) - \varDelta \hat{L}(\mathbf {u};\mathbf {p},\mathbf {v})\bigr )^2 d\mathbf {u}\right) . \end{aligned}$$

(4)

Maximizing this likelihood with respect to the motion parameters $\mathbf {p}$ and $\mathbf {v}$ (since $\hat{L}$ is known) yields the minimization of the $L^2$ norm of the photometric residual,

$$\begin{aligned} \min _{\mathbf {p},\mathbf {v}} \Vert \varDelta L(\mathbf {u}) - \varDelta \hat{L}(\mathbf {u};\mathbf {p},\mathbf {v}) \Vert ^2_{L^2(\mathcal {P})} \end{aligned}$$

(5)

where $\Vert f(\mathbf {u})\Vert ^2_{L^2(\mathcal {P})} \doteq \int _{\mathcal {P}}f^2(\mathbf {u})d\mathbf {u}$. However, the objective function (5) depends on the contrast sensitivity C (via (2)), which is typically unknown in practice. Inspired by [26], we propose to minimize the difference between unit-norm patches:

$$\begin{aligned} \min _{\mathbf {p},\mathbf {v}}\, \left\| \frac{\varDelta L(\mathbf {u})\qquad }{\Vert \varDelta L(\mathbf {u})\Vert _{L^2(\mathcal {P})}} - \frac{\varDelta \hat{L}(\mathbf {u};\mathbf {p},\mathbf {v})\qquad }{\Vert \varDelta \hat{L}(\mathbf {u};\mathbf {p},\mathbf {v})\Vert _{L^2(\mathcal {P})}} \right\| ^2_{L^2(\mathcal {P})}, \end{aligned}$$

(6)

which cancels the terms in C and $\varDelta \tau $, and only depends on the direction of the feature velocity $\mathbf {v}$. In this generic formulation, the same type of parametric warps $\mathbf {W}$ as for image registration can be considered (projective, affine, etc.). For simplicity, we consider warps given by rigid-body motions in the image plane,

$$\begin{aligned} \mathbf {W}(\mathbf {u};\mathbf {p}) = \mathtt {R}(\mathbf {p}) \mathbf {u}+ \mathbf {t}(\mathbf {p}), \end{aligned}$$

(7)

where $(\mathtt {R},\mathbf {t})\in SE(2)$. The objective function (6) is optimized using the non-linear least squares framework provided in the Ceres software [27].

4.3 Discussion of the Approach

One of the most interesting characteristics of the proposed method (6) is that it is based on a generative model for the events (3). As shown in Fig. 5, the frame $\hat{L}$ is used to produce a registration template $\varDelta \hat{L}$ that changes depending on $\mathbf {v}$ (weighted according to the dot product) in order to best fit the motion-dependent event data $\varDelta L$, and so does our method not only estimate the warping parameters of the event-feature but also its optic flow. This optic flow dependency was not explicitly modeled in previous works, such as [8, 9]. Moreover, for the template, we use the full gradient information of the frame $\nabla \hat{L}$, as opposed to its Canny (i.e., binary-thresholded) version [8], which provides higher accuracy and the ability to track less salient patterns.

Another characteristic of our method is that it does not suffer from the problem of establishing event-to-feature correspondences, as opposed to ICP methods [8, 9]. We borrow the implicit pixel-to-pixel data association typical of image registration methods by creating, from events, a convenient image representation. Hence, our method has smaller complexity (establishing data association in ICP [8] has quadratic complexity) and is more robust since it is less prone to be trapped in local minima caused by data association (as will be shown in Sect. 5.3). As optimization iterations progress, all event correspondences evolve jointly as a single entity according to the evolution of the warped pixel grid.

Additionally, monitoring the evolution of the minimum cost values (6) provides a sound criterion to detect feature track loss and, therefore, initialize new feature tracks (e.g., in the next frame or by acquiring a new frame on demand).

4.4 Algorithm

The steps of our asynchronous, low-latency feature tracker are summarized in Algorithm 1, which consists of two phases: (i) initialization of the feature patch and (ii) tracking the pattern in the patch using events according to (6). Multiple patches are tracked independently from one another. To compute a patch $\varDelta L(\mathbf {u})$, (2), we integrate over a given number of events $N_e$ [28,29,30,31] rather than over a fixed time $\varDelta \tau $ [32, 33]. Hence, tracking is asynchronous, as soon as $N_e$ events are acquired on the patch (2), which typically happens at rates higher than the frame rate of the standard camera ($\sim 10$ times higher). The supplementary material provides an analysis of the sensitivity of the method with respect to $N_e$ and a formula to compute a sensible value, to be used in Algorithm 1.

5 Experiments

To illustrate the high accuracy of our method, we first evaluate it on simulated data, where we can control scene depth, camera motion, and other model parameters. Then we test our method on real data, consisting of high-contrast and natural scenes, with challenging effects such as occlusions, parallax and illumination changes. Finally, we show that our tracker can operate using frames reconstructed from a set of events [34, 35], which have higher dynamic range than those of standard cameras, thus opening the door to feature tracking in high dynamic range (HDR) scenarios.

For all experiments we use patches $\varDelta L(\mathbf {u})$ of 25 $\times $ 25 pixel size^{Footnote 3} and the corresponding events falling within the patches as the features moved on the image plane. On the synthetic datasets, we use the 3D scene model and camera poses to compute the ground truth feature tracks. On the real datasets, we use KLT [25] as ground truth. Since our feature tracks are produced at a higher temporal resolution than the ground truth, interpolating ground truth feature positions may lead to wrong error estimates if the feature trajectory is not linear in between samples. Therefore, we evaluate the error by comparing each ground truth sample with the feature location given by linear interpolation of the two closest feature locations in time and averaging the Euclidean distance between ground truth and the estimated positions.

5.1 Simulated Data. Assessing Tracking Accuracy

By using simulated data we assess the accuracy limits of our feature tracker. To this end, we used the event camera simulator presented in [10] and 3D scenes with different types of texture, objects and occlusions (Fig. 6). The tracker’s accuracy can be assessed by how the average feature tracking error evolves over time (Fig. 6(c)); the smaller the error, the better. All features were initialized using the first frame and then tracked until discarded, which happened if they left the field of view or if the registration error (6) exceeded a threshold of 1.6. We define a feature’s age as the time elapsed between its initialization and its disposal. The longer the features survive, the more robust the tracker.

The results for simulated datasets are given in Fig. 6 and Table 1. Our method tracks features with a very high accuracy, of about 0.4 pixel error on average, which can be regarded as a lower bound for the tracking error (under noise-free conditions). The remaining error is likely due to the linearization approximation in (3). Note that feature age is just reported for completeness, since simulation time cannot be compared to the physical time of real data (Sect. 5.2).

Table 1. Average pixel error and average feature age for simulated data.

Full size table

5.2 Real Data

We compare our method against the state-of-the-art [8, 9]. The methods were evaluated on several datasets. For [8] the same set of features extracted on frames was tracked, while for [9] features were initialized on motion-corrected event images and tracked with subsequent events. The results are reported in Fig. 7 and in Table 2. The plots in Fig. 7 show the mean tracking error as a function of time (center line). The width of the colored band indicates the proportion of features that survived up to that point in time. The width of the band decreases with time as feature tracks are gradually lost. The wider the band, the more robust the feature tracker. Our method outperforms [8] and [9] in both tracking accuracy and length of the tracks.

Table 2. Average pixel error and average feature age for various datasets.

Full size table

In simple, black and white scenes (Fig. 7(a) and (d)), such as those in [8], our method is, on average, twice as accurate and produces tracks that are almost three times longer than [8]. Compared to [9] our method is also more accurate and robust. For highly textured scenes (Fig. 7(b) and (e)), our tracker maintains the accuracy even though many events are generated everywhere in the patch, which leads to significantly high errors in [8, 9]. Although our method and [9] achieve similar feature ages, our method is more accurate. Similarly, our method performs better than [8] and is more accurate than [9] on natural scenes (Fig. 7(c) and (f)). For these scenes [9] exhibits the highest average feature age. However, being a purely event-based method, it suffers from drift due to changing event appearance, as is most noticeable in Fig. 7(f). Our method does not drift since it uses a time invariant template and a generative model to register events, as opposed to an event-based template [9]. Additionally, unlike previous works, our method also exploits the full range of the brightness gradients instead of using simplified, point-set–based edge maps, thus yielding higher accuracy. A more detailed comparison with [8] is further explored in Sect. 5.3, where we show that our objective function is better behaved.

The tracking error of our method on real data is larger than that on synthetic data, which is likely due to modeling errors concerning the events, including noise and dynamic effects (such as unequal contrast thresholds for events of different polarity). Nevertheless, our tracker achieves subpixel accuracy and consistently outperforms previous methods, leading to more accurate and longer tracks.

5.3 Objective Function Comparison Against ICP-Based Method [8]

As mentioned in Sect. 4, one of the advantages of our method is that data association between events and the tracked feature is implicitly established by the pixel-to-pixel correspondence of the compared patches (2) and (3). This means that we do not have to explicitly estimate it, as was done in [8, 9], which saves computational resources and prevents false associations that would yield bad tracking behavior. To illustrate this advantage, we compare the cost function profiles of our method and [8], which minimizes the alignment error (Euclidean distance) between two 2D point sets: $\{\mathbf {p}_i\}$ from the events (data) and $\{\mathbf {m}_j\}$ from the Canny edges (model),

$$\begin{aligned} \{\mathtt {R}, \mathbf {t}\} = \arg \min _{\mathtt {R}, \mathbf {t}} \sum _{(\mathbf {p}_i, \mathbf {m}_i) \in \text {Matches}}b_i \left\| \mathtt {R}\mathbf {p}_i + \mathbf {t}- \mathbf {m}_i \right\| ^2. \end{aligned}$$

(8)

Here, $\mathtt {R}$ and $\mathbf {t}$ are the alignment parameters and $b_i$ are weights. At each step, the association between events and model points is done by assigning each $\mathbf {p}_i$ to the closest point $\mathbf {m}_j$ and rejecting matches which are too far apart ($> {3}$pixel). By varying the parameter $\mathbf t $ around the estimated value while fixing $\mathtt {R}$ we obtain a slice of the cost function profile. The resulting cost function profiles for our method (6) and (8) are shown in Fig. 8.

For simple black and white scenes (first row of Fig. 8), all events generated belong to strong edges. In contrast, for more complex, highly-textured scenes (second row), events are generated more uniformly in the patch. Our method clearly shows a convex cost function in both situations. In contrast, [8] exhibits several local minima and very broad basins of attraction, making exact localization of the optimal registration parameters challenging. The broadness of the basin of attraction, together with the multitude of local minima can be explained by the fact that data association changes for each alignment parameter. This means that there are several alignment parameters which may lead to partial overlapping of the point-clouds resulting in a suboptimal solution.

To show how non-smooth cost profiles affect tracking performance, we show the feature tracks in the last column of Fig. 8. The ground truth derived from KLT is marked in green. Our tracker (in blue) is able to follow the ground truth with high accuracy. On the other hand [8] (in red) exhibits jumping behavior leading to early divergence from ground truth.

5.4 Tracking Using Frames Reconstructed from Event Data

Recent research [34,35,36,37] has shown that events can be combined to reconstruct intensity frames that inherit the outstanding properties of event cameras (high dynamic range (HDR) and lack of motion blur). In the next experiment, we show that our tracker can be used on such reconstructed images, thus removing the limitations imposed by standard cameras. As an illustration, we focus here on demonstrating feature tracking in HDR scenes (Fig. 9). However, our method could also be used to perform feature tracking during high-speed motions by using motion-blur–free images reconstructed from events.

Standard cameras have a limited dynamic range (60 dB), which often results in under- or over-exposed areas of the sensor in scenes with a high dynamic range (Fig. 9(b)), which in turn can lead to tracking loss. Event cameras, however, have a much larger dynamic range (140 dB) (Fig. 9(b)), thus providing valuable tracking information in those problematic areas. Figure 9(c)–(d) show qualitatively how our method can exploit HDR intensity images reconstructed from a set of events [34, 35] to produce feature tracks in such difficult conditions. For example, Fig. 9(d) shows that some feature tracks were initialized in originally overexposed areas, such as the top right of the image (Fig. 9). Note that our tracker only requires a limited number of reconstructed images since features can be tracked for several seconds. This complements the computationally-demanding task of image reconstruction.

Supplementary Material. We encourage the reader to inspect the video, additional figures, tables and experiments provided in the supplementary material.

6 Discussion

While our method advances event-based feature tracking in natural scenes, there remain directions for future research. For example, the generative model we use to predict events is an approximation that does not account for severe dynamic effects and noise. In addition, our method assumes uniform optical flow in the vicinity of features. This assumption breaks down at occlusions and at objects undergoing large flow distortions, such as motion along the camera’s optical axis. Nevertheless, as shown in the experiments, many features in a variety of scenes and motions do not suffer from such effects, and are therefore tracked well (with sub-pixel accuracy). Finally, we demonstrated the method using a Euclidean warp since it was more stable than more complex warping models (e.g., affine). Future research includes ways to make the method more robust to sensor noise and to use more accurate warping models.

7 Conclusion

We presented a method that leverages the complementarity of event cameras and standard cameras to track visual features with low-latency. Our method extracts features on frames and subsequently tracks them asynchronously using events. To achieve this, we presented the first method that relates events directly to pixel intensities in frames via a generative event model. We thoroughly evaluated the method on a variety of sequences, showing that it produces feature tracks that are both more accurate (subpixel accuracy) and longer than the state of the art. We believe this work will open the door to unlock the advantages of event cameras on various computer vision tasks that rely on accurate feature tracking.

Notes

1.
Event cameras such as the DVS [1] respond to logarithmic brightness changes, i.e., $L\doteq \log I$, with brightness signal I, so that (1) represents logarithmic changes.
2.
Equation (3) can be shown [24] by substituting the brightness constancy assumption (i.e., optical flow constraint) $ \frac{\partial L}{\partial t}(\mathbf {u}(t),t) + \nabla L(\mathbf {u}(t),t) \cdot \dot{\mathbf {u}}(t) = 0, $ with image-point velocity $\mathbf {v}\equiv \dot{\mathbf {u}}$, in Taylor’s approximation $\varDelta L(\mathbf {u},t) \doteq L(\mathbf {u},t) - L(\mathbf {u},t - \varDelta \tau ) \approx \frac{\partial L}{\partial t}(\mathbf {u},t) \varDelta \tau $.
3.
A justification of the choice of patch size can be found in the supplementary material.

References

Lichtsteiner, P., Posch, C., Delbruck, T.: A $128 \times 128$ 120 dB 15 $\upmu $s latency asynchronous temporal contrast vision sensor. IEEE J. Solid-State Circ. 43(2), 566–576 (2008)
Article Google Scholar
Brandli, C., Berner, R., Yang, M., Liu, S.C., Delbruck, T.: A $240 \times 180$ 130 dB 3 $\upmu $s latency global shutter spatiotemporal vision sensor. IEEE J. Solid-State Circ. 49(10), 2333–2341 (2014)
Article Google Scholar
Zhou, H., Yuan, Y., Shi, C.: Object tracking using SIFT features and mean shift. Comput. Vis. Image. Und. 113(3), 345–352 (2009)
Article Google Scholar
Klein, G., Murray, D.: Parallel tracking and mapping on a camera phone. In: IEEE ACM International Symposium on Mixed and Augmented Reality (ISMAR) (2009)
Google Scholar
Forster, C., Zhang, Z., Gassner, M., Werlberger, M., Scaramuzza, D.: SVO: semidirect visual odometry for monocular and multicamera systems. IEEE Trans. Robot. 33(2), 249–265 (2017)
Article Google Scholar
Mur-Artal, R., Montiel, J.M.M., Tardós, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans. Robot. 31(5), 1147–1163 (2015)
Article Google Scholar
Rosinol Vidal, A., Rebecq, H., Horstschaefer, T., Scaramuzza, D.: Ultimate SLAM? Combining events, images, and IMU for robust visual SLAM in HDR and high speed scenarios. IEEE Robot. Autom. Lett. 3(2), 994–1001 (2018)
Article Google Scholar
Kueng, B., Mueggler, E., Gallego, G., Scaramuzza, D.: Low-latency visual odometry using event-based feature tracks. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Korea, pp. 16–23, October 2016
Google Scholar
Zhu, A.Z., Atanasov, N., Daniilidis, K.: Event-based feature tracking with probabilistic data association. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 4465–4470 (2017)
Google Scholar
Mueggler, E., Rebecq, H., Gallego, G., Delbruck, T., Scaramuzza, D.: The event-camera dataset and simulator: event-based data for pose estimation, visual odometry, and SLAM. Int. J. Robot. Res. 36, 142–149 (2017)
Article Google Scholar
Mueggler, E., Huber, B., Scaramuzza, D.: Event-based, 6-DOF pose tracking for high-speed maneuvers. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2761–2768 (2014). Event camera animation: https://youtu.be/LauQ6LWTkxM?t=25
Ni, Z., Bolopion, A., Agnus, J., Benosman, R., Regnier, S.: Asynchronous event-based visual shape tracking for stable haptic feedback in microrobotics. IEEE Trans. Robot. 28, 1081–1089 (2012)
Article Google Scholar
Lagorce, X., Meyer, C., Ieng, S.H., Filliat, D., Benosman, R.: Asynchronous event-based multikernel algorithm for high-speed visual features tracking. IEEE Trans. Neural Netw. Learn. Syst. 26(8), 1710–1720 (2015)
Article MathSciNet Google Scholar
Clady, X., Ieng, S.H., Benosman, R.: Asynchronous event-based corner detection and matching. Neural Netw. 66, 91–106 (2015)
Article Google Scholar
Tedaldi, D., Gallego, G., Mueggler, E., Scaramuzza, D.: Feature detection and tracking with the dynamic and active-pixel vision sensor (DAVIS). In: International Conference on Event-Based Control, Communication, and Signal Processing (EBCCSP), pp. 1–7 (2016)
Google Scholar
Clady, X., Maro, J.M., Barré, S., Benosman, R.B.: A motion-based feature for event-based pattern recognition. Front. Neurosci. 10, 594 (2017)
Article Google Scholar
Vasco, V., Glover, A., Bartolozzi, C.: Fast event-based Harris corner detection exploiting the advantages of event-driven cameras. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2016)
Google Scholar
Mueggler, E., Bartolozzi, C., Scaramuzza, D.: Fast event-based corner detection. In: British Machine Vision Conference (BMVC) (2017)
Google Scholar
Harris, C., Stephens, M.: A combined corner and edge detector. In: Proceedings of Fourth Alvey Vision Conference, Manchester, UK, vol. 15, pp. 147–151 (1988)
Google Scholar
Rosten, E., Drummond, T.: Machine learning for high-speed corner detection. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 430–443. Springer, Heidelberg (2006). https://doi.org/10.1007/11744023_34
Chapter Google Scholar
Chaudhry, R., Ravichandran, A., Hager, G., Vidal, R.: Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1932–1939, June 2009
Google Scholar
Besl, P.J., McKay, N.D.: A method for registration of 3-D shapes. IEEE Trans. Pattern Anal. Mach. Intell. 14(2), 239–256 (1992)
Article Google Scholar
Gallego, G., Lund, J.E.A., Mueggler, E., Rebecq, H., Delbruck, T., Scaramuzza, D.: Event-based, 6-DOF camera tracking from photometric depth maps. IEEE Trans. Pattern Anal. Machi. Intell. 40(10), 2402–2412 (2017)
Article Google Scholar
Gallego, G., Forster, C., Mueggler, E., Scaramuzza, D.: Event-based camera pose tracking using a generative event model. arXiv:1510.01972 (2015)
Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 674–679 (1981)
Google Scholar
Evangelidis, G.D., Psarakis, E.Z.: Parametric image alignment using enhanced correlation coefficient maximization. IEEE Trans. Pattern Anal. Mach. Intell. 30(10), 1858–1865 (2008)
Article Google Scholar
Agarwal, A., Mierle, K., et al.: Ceres solver. http://ceres-solver.org
Gallego, G., Scaramuzza, D.: Accurate angular velocity estimation with an event camera. IEEE Robot. Autom. Lett. 2, 632–639 (2017)
Article Google Scholar
Gallego, G., Rebecq, H., Scaramuzza, D.: A unifying contrast maximization framework for event cameras, with applications to motion, depth, and optical flow estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3867–3876 (2018)
Google Scholar
Rebecq, H., Gallego, G., Mueggler, E., Scaramuzza, D.: EMVS: event-based multi-view stereo–3D reconstruction with an event camera in real-time. Int. J. Comput. Vis. 1–21 (2017)
Google Scholar
Rebecq, H., Horstschaefer, T., Scaramuzza, D.: Real-time visual-inertial odometry for event cameras using keyframe-based nonlinear optimization. In: British Machine Vision Conference (BMVC), September 2017
Google Scholar
Maqueda, A.I., Loquercio, A., Gallego, G., García, N., Scaramuzza, D.: Event-based vision meets deep learning on steering prediction for self-driving cars. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5419–5427 (2018)
Google Scholar
Bardow, P., Davison, A.J., Leutenegger, S.: Simultaneous optical flow and intensity estimation from an event camera. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 884–892 (2016)
Google Scholar
Kim, H., Handa, A., Benosman, R., Ieng, S.H., Davison, A.J.: Simultaneous mosaicing and tracking with an event camera. In: British Machine Vision Conference (BMVC) (2014)
Google Scholar
Rebecq, H., Horstschäfer, T., Gallego, G., Scaramuzza, D.: EVO: a geometric approach to event-based 6-DOF parallel tracking and mapping in real-time. IEEE Robot. Autom. Lett. 2, 593–600 (2017)
Article Google Scholar
Reinbacher, C., Graber, G., Pock, T.: Real-time intensity-image reconstruction for event cameras using manifold regularisation. In: British Machine Vision Conference (BMVC) (2016)
Google Scholar
Munda, G., Reinbacher, C., Pock, T.: Real-time intensity-image reconstruction for event cameras using manifold regularisation. Int. J. Comput. Vis. 1–13 (2018)
Google Scholar

Download references

Acknowledgment

This work was supported by the DARPA FLA program, the Swiss National Center of Competence Research Robotics, through the Swiss National Science Foundation, and the SNSF-ERC starting grant.

Author information

Authors and Affiliations

Departments of Informatics and Neuroinformatics, University of Zurich and ETH Zurich, Zürich, Switzerland
Daniel Gehrig, Henri Rebecq, Guillermo Gallego & Davide Scaramuzza

Authors

Daniel Gehrig
View author publications
You can also search for this author in PubMed Google Scholar
Henri Rebecq
View author publications
You can also search for this author in PubMed Google Scholar
Guillermo Gallego
View author publications
You can also search for this author in PubMed Google Scholar
Davide Scaramuzza
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2493 KB)

Supplementary material 2 (mp4 74492 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gehrig, D., Rebecq, H., Gallego, G., Scaramuzza, D. (2018). Asynchronous, Photometric Feature Tracking Using Events and Frames. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11216. Springer, Cham. https://doi.org/10.1007/978-3-030-01258-8_46

Download citation

DOI: https://doi.org/10.1007/978-3-030-01258-8_46
Published: 06 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01257-1
Online ISBN: 978-3-030-01258-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics