Keywords

1 Introduction

Camera motion estimation is an important technology with many applications in automation, smart transportation, and assistive technologies. However, despite the fact that a certain level of maturity has already been reached, we keep facing challenges in scenarios with high dynamics, low texture distinctiveness, or challenging illumination conditions [5, 9]. Event cameras—also called dynamic vision sensors—present an interesting alternative in this regard, as they pair HDR with high temporal resolution. The potential advantages and challenges behind event-based vision are well explained by the original work of Brandli et al.  [3] as well as the recent survey by Gallego et al.  [10].

Our work considers fronto-parallel motion estimation of an event camera. The flow of the events is hereby modelled by a general homographic warping in a space-time volume, and motion may be estimated by maximisation of contrast in the image of unwarped events  [12]. Various reward functions that maximise contrast have been presented and analysed in the recent works of Gallego et al.  [11] and Stoffregen and Kleeman  [29], and successfully used for solving a variety of problems with event cameras such as optical flow  [12, 28, 32, 34, 36, 37], segmentation  [21, 27, 28], 3D reconstruction  [26, 32, 35, 37], and motion estimation  [12, 13]. Our work focuses on the latter problem of camera motion estimation. However—different from many of the aforementioned works—we propose the first globally optimal solution to the underlying contrast maximisation problem, an important point given its generally non-convex nature.

Our detailed contributions are as follows:

  • We solve the global maximisation of contrast functions via Branch and Bound.

  • We derive bounds for six different contrast estimation functions. The bounds are furthermore calculated recursively, which enables efficient processing.

  • We successfully apply this strategy to Autonomous Ground Vehicle (AGV) planar motion estimation with a downward facing event camera (cf. Fig. 1), a problem that is complicated by motion blur, challenging illumination conditions, and indistinctive, noisy textures. We prove that using an event camera can solve these challenges, hence outperforming alternatives given by regular cameras.

Fig. 1.
figure 1

(a): AGV equipped with a downward facing event camera for vehicle motion estimation. (b)–(d): collected image with detectable corners, image of warped events with \(\varvec{\theta } = \mathbf {0}\), and image of warped events with optimal parameters \(\hat{\varvec{\theta }}\).

2 Contrast Maximisation

Gallego et al.  [12] recently introduced contrast maximisation as a unifying framework allowing the solution of several important problems for dynamic vision sensors, in particular motion estimation problems in which the effect of camera motion may be described by a homography (e.g. motion in front of a plane, pure rotation). Our work relies on contrast maximisation, which we therefore briefly review in the following.

An event camera outputs a sequence of events denoting temporal logarithmic brightness changes above a certain threshold. An event \(e = \left\{ \mathbf {x},\ t,\ s \right\} \) is described by its pixel position \(\mathbf {{x}} = [x\text { }y]^T\), timestamp t, and polarity s (the latter indicates whether the brightness is increasing or decreasing, and is ignored in the present work). The core idea of contrast maximisation is relatively straightforward: The flow of the events is modelled by a time-parametrised homography. Given its position and time-stamp, every event may therefore be warped back along a point-trajectory into a reference view with timestamp \(t_{\mathrm {ref}}\). Since events are more likely to be generated by high-gradient edges, correct homographic warping parameters will likely lead to a sharp Image of Warped Events (IWE) in which events align along a crisp edge-map. Gallego et al.  [12] simply propose to consider the contrast of the IWE as a reward function to identify the correct homographic warping parameters. Note that homographic warping functions include 2D affine and Euclidean transformations, and thus can be used in a variety of vision problems such as optical flow, feature tracking, or fronto-parallel motion estimation.

Suppose we are given a set of N events \(\mathcal {E} = \{e_{k}\}_{k=1}^{N}\). We define a general warping function \({\mathbf {{x}}_{k}^{\prime }} = W(\mathbf {{x}}_k,t_{k};\varvec{\theta })\) that returns the position \({\mathbf {{x}}_{k}^{\prime }}\) of an event \(e_k\) in the reference view at time \(t_{\mathrm {ref}}\). \(\varvec{\theta }\) is a vector of warping parameters. The IWE is generated by accumulating warped events at each discrete pixel location:

$$\begin{aligned} I(\mathbf {p}_{ij};\varvec{\theta }) = \sum _{k = 1}^{N}\mathbf {1}(\mathbf {p}_{ij}-{\mathbf {{x}}_{k}^{\prime }}) = \sum _{k = 1}^{N}\mathbf {1}(\mathbf {p}_{ij}-W(\mathbf {{x}}_k,t_{k};\varvec{\theta })) \,, \end{aligned}$$
(1)

where \(\mathbf {1}(\cdot )\) is an indicator function that counts 1 if the absolute value of \((\mathbf {p}_{ij} - {\mathbf {{x}}_{k}^{\prime }})\) is less than a threshold \(\epsilon \) in each coordinate, and otherwise 0. \(\mathbf {p}_{ij}\) is a pixel in the IWE with coordinates \([i\text { }j]^T\), and we refer to it as an accumulator location. We set \(\epsilon = 0.5\) such that each warped event will increment one accumulator only.

Existing approaches replace the indicator function with a Gaussian kernel to make the IWE a smooth function of the warped events, and thus solve contrast maximisation problems via local optimisation methods (cf. [11,12,13]). In contrast, we show how our proposed method is able to find the global optimum of the above, discrete objective function.

As introduced in [11, 29], reward functions for event un-warping all rely on the idea of maximising the contrast or sharpness of the IWE (they have also been denoted as focus loss functions). They proceed by integration over the entire set of accumulators, which we denote \(\mathcal {P}\). The most relevant ones for us are summarized in Table 1. Note that for \(L_{\mathrm {Var}}\), \(\mu _{I}\) is the mean value of \(I(\mathbf {p}_{ij};\varvec{\theta })\) over all pixels (a function of \(\varvec{\theta }\) itself), and \(N_{p}\) is the total number of accumulators in I. For \(L_{\mathrm {SoSA}}\), \(\delta \) is a design parameter called the shift factor. Different from other objectives functions, locations with few accumulations will contribute more to \(L_{\mathrm {SoSA}}\). The intuition here is that more empty locations again mean more events that are concentrated at fewer accumulators. \(L_{\mathrm {SoEaS}}\) is a combination of \(L_{\mathrm {SoS}}\) and \(L_{\mathrm {SoE}}\). Similarly, \(L_{\mathrm {SoSAaS}}\) is a combination of \(L_{\mathrm {SoS}}\) and \(L_{\mathrm {SoSA}}\).

Let us now proceed to the main contribution of our work, which is a derivation of bounds on the above objectives as required by Branch and Bound.

Table 1. Contrast functions evaluated in this work

3 Globally Maximised Contrast Using Branch and Bound

Figure 2 illustrates how contrast maximisation for motion estimation is in general a non-convex problem, meaning that local optimisation may be sensitive to the initial parameters and not find the global optimum. We tackle this problem by introducing a globally optimal solution to contrast maximisation using Branch and Bound (BnB) optimisation. BnB is an algorithmic paradigm in which the solution space is subdivided into branches in which we then find upper and lower bounds for the maximal objective value. The globally optimal solution is isolated by an iterative search in which entire branches are discarded if their upper bound for the maximum objective value remains lower than the corresponding lower bound in another branch. The most important factor deciding the effectiveness of this approach is given by the tightness of the bounds.

Our core contribution is given by a recursive method to efficiently calculate upper and lower bounds for the maximum value of a contrast maximisation function over a given branch. In short, the main idea is given by expressing a bound over \((N+1)\) events as a function of the bound over N events plus the contribution of one additional event. The strategy can be similarly applied to all six aforementioned contrast functions, which is why we limit the exposition to the derivation of bounds for \(L_{\mathrm {SoS}}\). Detailed derivations for all loss functions are provided in the supplementary material.

Fig. 2.
figure 2

Visualization of the Sum of Squares contrast function. The camera is moving in front of a plane, and the motion parameters are given by translational and rotational velocity (cf. Sect. 4). The sub-figures from left to right are functions with increasing Noise-to-Events (N/E) ratios. Note that contrast functions are non-convex.

3.1 Objective Function

In the following, we assume that \(L=L_{\mathrm {SoS}}\). The maximum objective function value over all N events in a given time interval \([t_{\mathrm {ref}}, t_{\mathrm {ref}}+\Delta T]\) is given by

$$\begin{aligned} L_{N} = \max _{\begin{array}{c} \varvec{\theta }\in \varvec{\varTheta } \end{array}} \sum _\mathbf{p _{ij}\in \mathcal {P}} \left[ \sum _{k = 1}^{N}\textit{\textbf{1}} \left( \mathbf{p} _{ij}-W(\mathbf {{x}}_k,t_{k};\varvec{\theta }) \right) \right] ^2, \end{aligned}$$
(2)

where \(\varvec{\varTheta }\) is the search space (i.e. branch or sub-branch) over which we want to maximise the objective. Most globally optimal methods for geometric computer vision problems find bounds by a spatial division of the problem into individual, simpler maximisation sub-problems (cf. [6]). However, the contrast maximisation objective is related to the distribution over the entire IWE and not just individual accumulators, which complicates this strategy.

3.2 Upper and Lower Bound

The bounds are calculated recursively by processing the events and one-by-one, each time updating the IWE. The event are notably processed in temporal order with increasing timestamps.

For the lower bound, it is readily given by evaluating the contrast function at an arbitrary point on the interval \(\varvec{\varTheta }\), which is commonly picked as the interval center \(\varvec{\theta }_{0}\). We present a recursive rule to efficiently evaluate the lower bound.

Theorem 1

For search space \(\varvec{\varTheta }\) centered at \(\varvec{\theta }_{0}\), the lower bound of SoS-based contrast maximisation may be given by

$$\begin{aligned} \underline{L_{N+1}} = \underline{L_{N}}+1+2 I^{N}(\varvec{\eta }_{N+1}^{\theta _0};\varvec{\theta }_{0}), \end{aligned}$$
(3)

where \(I^N(\mathbf {p}_{ij};\varvec{\theta }_{0})\) is the incrementally constructed IWE, its exponent N denotes the number of events that have already been taken into account, and

$$\begin{aligned} \varvec{\eta }_{N+1}^{\theta _0} = \mathrm {round}(W(\mathbf {{x}}_{N+1},t_{N+1};\varvec{\theta }_{0})) \end{aligned}$$
(4)

returns the accumulator closest to the warped position of the \((N+1)\)-th event.

Proof

According to the definition of sum of the square focus loss function,

$$\begin{aligned} \begin{aligned} \underline{L_{N+1}}&= \sum _\mathbf{p _{ij}\in \mathcal {P}} \left[ \sum _{k = 1}^{N+1}\textit{\textbf{1}} \left( \mathbf{p} _{ij}-W(\mathbf {{x}}_k,t_{k};\varvec{\theta _0}) \right) \right] ^2 \\&= \sum _{\mathbf {p}_{ij} \in \mathcal {P}} \left[ I^{N}(\mathbf {p}_{ij};\varvec{\theta }_0) + \textit{\textbf{1}} \left( \mathbf {p}_{ij}-W(\mathbf {{x}}_{N+1},t_{N+1};\varvec{\theta _0}) \right) \right] ^2 \\&= a+b+c \, \text {, where} \end{aligned} \end{aligned}$$
(5)
$$\begin{aligned} \begin{aligned}&a = \sum _{\mathbf {p}_{ij} \in \mathcal {P}} I^{N}(\mathbf {p}_{ij};\varvec{\theta }_0)^2 \,, \\&b = 2\sum _{\mathbf {p}_{ij}\in \mathcal {P}} \left[ \textit{\textbf{1}}(\mathbf {p}_{ij}-W(\mathbf {{x}}_{N+1},t_{N+1};\varvec{\theta }_0)) I^{N}(\mathbf {p}_{ij};\varvec{\theta }_0) \right] \,, \\&c = \sum _{\mathbf {p}_{ij}\in \mathcal {P}} \left[ \textit{\textbf{1}}(\mathbf {p}_{ij}-W(\mathbf {{x}}_{N+1},t_{N+1};\varvec{\theta }_0)) \right] ^2. \end{aligned} \end{aligned}$$

It is clear that \(a = \underline{L_{N}}\). In c, owing to the definition of our indicator function, only the \(\mathbf {p}_{ij}\) which is closest to \(W(\mathbf {{x}}_{N+1},t_{N+1};\varvec{\theta }_0)\) makes a contribution, thus we have \(c = 1\). For b, the term \(\textit{\textbf{1}}(\mathbf {p}_{ij}-W(\mathbf {{x}}_{N+1},t_{N+1};\varvec{\theta }_0))\) is simply zero unless we are considering an accumulator \(\mathbf {p}_{ij} = \varvec{\eta }_{N+1}^{\theta _0}\), which gives \(b = 2I^{N}(\varvec{\eta }_{N+1}^{\theta _0};\varvec{\theta }_{0})\). Thus we obtain (3). Note that the IWE is iteratively updated by incrementing the accumulator which locates closest to \( \varvec{\eta }_{N+1}^{\theta _0}\).

Fig. 3.
figure 3

(a) Incremental update of the IWE. For each new event e, we choose and increment the currently maximal accumulator in the bounding box \(\mathcal {P}^{\varvec{\varTheta }}\) around all possible locations \(W(\mathbf {{x}},t;\varvec{\theta }\in \varvec{\varTheta })\). We simply increment the center of the bounding box if no other accumulator exists. (b) Bounding boxes of two temporally distinct events generated by the same point in 3D.

We now proceed to our main contribution, a recursive upper bound for the contrast maximisation problem. Let us define \(\mathcal {P}^{\varvec{\varTheta }}_{i}\) as the bounding box around all possible locations \(W(\mathbf {{x}}_i,t_i;\varvec{\theta }\in \varvec{\varTheta })\) of the un-warped event. Lemma 1 is introduced as follows.

Lemma 1

Given a search space \(\varvec{\theta } \in \varvec{\varTheta }\), for a small enough time interval, if \(W(\mathbf {{x}}_i,t_i;\varvec{\theta })\) = \(W(\mathbf {{x}}_j,t_j;\varvec{\theta })\) and \(0< i < j \le N\), we have \(\mathcal {P}^{\varvec{\varTheta }}_{i} \subseteq \mathcal {P}^{\varvec{\varTheta }}_{j}\). An intuitive explanation is given in Fig. 3(b).

Lemma 1 now enables us to derive our recursive upper bound.

Theorem 2

The upper bound of the objective function \(L_N\) for SoS-based contrast maximisation satisfies

$$\begin{aligned} L_{N+1}= & {} L_{N}+1+2I^{N}(\varvec{\eta }_{N+1}^{\hat{\theta }};\hat{\varvec{\theta }}) \end{aligned}$$
(6)
$$\begin{aligned}\le & {} \overline{L_{N}}+1+2Q^N = \overline{L_{N+1}}, \, \\ \textit{where } Q^N= & {} \max _{\begin{array}{c} \mathbf {p}_{ij}\in \mathcal {P}_{N+1}^{\varvec{\varTheta }} \end{array}} \overline{I}^{N}(\mathbf {p}_{ij}) \ge I^{N}(\varvec{\eta }_{N+1}^{\hat{\theta }};\hat{\varvec{\theta }}) \, \nonumber \end{aligned}$$
(7)

\(\mathcal {P}_{N+1}^{\varvec{\varTheta }}\) is a bounding box for the \((N+1)\)-th event. \(\hat{\varvec{\theta }}\) is the optimal parameter set that maximises \(L_{N+1}\) over the interval \(\varvec{\varTheta }\). \(\overline{I}^{N}(\mathbf {p}_{ij})\) is the value of pixel \(\mathbf {p}_{ij}\) in the upper bound IWE, a recursively constructed image in which we always increment the maximum accumulator within the bounding box \(\mathcal {P}_{N+1}^{\varvec{\varTheta }}\) (i.e. the one that we used to define the value of \(Q^N\). The incremental construction of \(\overline{I}^{N}(\mathbf {p}_{ij})\) is illustrated in Fig. 3(a).

Proof

(6) is straightforwardly derived from (3). The proof of inequation (7) then proceeds by mathematical induction.

For N = 0, it is obvious that \(L_{0} = \overline{L_{0}}= 0\). Similarly, for N = 1, \(L_{1} = 1 \le \overline{L_{0}} + 1 + 0\), and \(Q^0 = I^{0}(\varvec{\eta }_{1}^{\hat{\theta }};\hat{\varvec{\theta }}) = 0\) (which satisfies Theorem 2). We now assume that \(\overline{L_n}\) as well as the corresponding upper bound IWE \(\overline{I}^{n}\) are given for all \(0<n\le N\). We furthermore assume that they satisfy Theorem 2. Our aim is to prove that (7) holds for the \((N+1)\)-th event. It is clear that \(\overline{L_N} \ge L_N\), and we only need to prove that \(Q^N \ge I^{N}(\varvec{\eta }_{N+1}^{\hat{\theta }};\hat{\varvec{\theta }})\), for which we will make use of Lemma 1. There are two cases to be distinguished:

  • The first case is if there exists an event \(\epsilon _k\) with \(0< k < N+1\) and for which \(\varvec{\eta }_{k}^{\hat{\theta }} = \varvec{\eta }_{N+1}^{\hat{\theta }}\). In other words, the k-th and the \((N+1)\)-th event are warped to a same accumulator if choosing the locally optimal parameters. Note that if there are multiple previous events for which this condition holds, the k-th event is chosen to be the most recent one. Given our assumptions, \(\overline{L_{k-1}}\) as well as the \((k-1)\)-th constructed upper bound IWE satisfy Theorem 2, which means that \(Q^{k-1} \ge I^{k-1}(\varvec{\eta }_{k}^{\hat{\theta }};\hat{\varvec{\theta }})\). Let \(\mathbf {p}_k \in \mathcal {P}^{\varTheta }_k\) now be the pixel location with maximum intensity in \(\overline{I}^{k-1}(\mathbf {p}_k)\). Then, the k-th updated IWE satisfies \(\overline{I}^k(\mathbf {p}_k) = Q^{k-1}+1 \ge I^{k-1}(\varvec{\eta }_{k}^{\hat{\theta }};\hat{\varvec{\theta }}) +1 \). According to Lemma 1, we have \(\mathcal {P}^{\varTheta }_k \subseteq \mathcal {P}^{\varTheta }_{N+1}\), therefore \(\mathbf {p}_k \subseteq \mathcal {P}^{\varTheta }_{N+1}\), and \(Q^N \ge \overline{I}^k(\mathbf {p}_k) \ge I^{k-1}(\varvec{\eta }_{k}^{\hat{\theta }};\hat{\varvec{\theta }})+1 \). With optimal warp parameters \(\hat{\varvec{\theta }}\), events with indices from \(k+1\) to N will not locate at \(\varvec{\eta }_{N+1}^{\hat{\theta }} \), and therefore \(I^{k-1}(\varvec{\eta }_{k}^{\hat{\theta }};\hat{\varvec{\theta }})+1 = I^{N}(\varvec{\eta }_{N+1}^{\hat{\theta }};\hat{\varvec{\theta }}) \le Q^N \).

  • If there is no such a event, it is obvious that \(Q^N \ge I^{N}(\varvec{\eta }_{N+1}^{\hat{\theta }};\hat{\varvec{\theta }})\).

With the basic cases and the induction step proven, we conclude our proof that Theorem 2 holds for all natural numbers N.

We apply the proposed strategy to derive upper and lower bounds for all six aforementioned contrast functions, and list them in Table 2. Note that the initial case varies for different loss functions. The globally-optimal contrast maximisation framework (GOCMF) is outlined in Algorithm 1 and Algorithm 2. We propose a nested strategy for calculating upper bounds, in which the outer layer RB evaluates the objective function, while the inner layer BB estimates the bounding box \(\mathcal {P}_{N}^{\varvec{\varTheta }}\) and depends on the specific motion parametrisation.

Table 2. Recursive Upper and Lower Bounds
figure a

4 Application to Visual Odometry with a Downward-Facing Event Camera

Motion estimation for planar Autonomous Ground Vehicles (AGVs) is an important problem in intelligent transportation  [17, 24, 30]. An interesting alternative is given by employing a downward instead of a forward facing camera, thus permitting direct observation of the ground plane with known depth. This largely simplifies the geometry of the problem and notably turns the image-to-image warping into a homographic mapping that is linear in homogeneous space. The strategy is widely used in relevant applications such as sweeping robots and factory AGVs, and a good review is presented in  [1]. However, the method is affected by potentially severe challenges given by the image appearance: a) reliable feature matching or even extraction may be difficult for certain noisy ground textures, b) fast motion may easily lead to motion blur, and c) stable appearance may require artificial illumination. Many existing methods therefore do not employ feature correspondences but aim at a correspondence-less alignment or even a full photometric image alignment. Besides more classical RANSAC-based hypothesise-and-test schemes  [7], the community therefore has also developed appearance-based template matching approaches  [8, 15, 22, 23, 33], solvers based on efficient second-order minimisation  [18, 20, 38], and methods exploiting the Fast Fourier Transform  [2, 25], the Fourier-Mellin Transform  [16, 19], or the Improved Fourier Mellin Invariant  [4, 31]. In an attempt to tackle highly self-similar ground textures, Dille et al.  [8] propose to use an optical flow sensor instead of a regular CMOS camera.

A critical question is given by the position of the camera. The camera may hang in the front or rear of the vehicle, which gives increased distance to the ground plane and in turn reduces motion blur. However, it also causes moving shadows in the image, and generally complicates the stabilisation of the image appearance and thus repeatable feature detection or region-based matching. A common alternative therefore is given by installing the camera underneath the vehicle paired with an artificial light source (e.g.  [2, 8]). However, the short distance to the ground plane may easily lead to unwanted motion blur. We therefore consider an event camera as a highly interesting and much more dynamic alternative visual sensor for this particular scenario.

Fig. 4.
figure 4

Left: The Ackermann steering model with the ICR  [14]. Both a left and a right turn are illustrated. Right: Connections between vehicle displacement, extrinsic transformation, and relative camera pose.

4.1 Homographic Mapping and Bounding Box Extraction

We rely the globally-optimal BnB solver for correspondence-less AGV motion presented in [14], which also employs a normal, downward facing camera. We employ the two-dimensional Ackermann steering model describing the commonly non-holonomic motion of an AGV. Employing this 2-DoF model leads to benefits in BnB, the complexity of which strongly depends on the dimensionality of the solution space. As illustrated in Fig. 4, the Ackermann model constrains the motion of the vehicle to follow a circular-arc trajectory about an Instantaneous Centre of Rotation (ICR). The motion between successive frames can be conveniently described at the hand of two parameters: the half-angle of the relative rotation angle \(\phi \), and the baseline between the two views \(\rho \). However, the alignment of the events requires a temporal parametrisation of the relative pose, which is why we employ the angular velocity \(\omega = \frac{\theta }{t} = \frac{2\phi }{t}\) as well as the translational velocity \(v = \omega r = \omega \rho \frac{1}{2\sin (\phi )}\) in our model. The relative transformation from vehicle frame \(v'\) back to v is therefore given by

(8)

Further details about the derivation are given in the supplementary material.

In practice the vehicle frame hardly coincides with the camera frame. The orientation and the height of the origin can be chosen to be identical, and the camera may be laterally mounted in the centre of the vehicle. However, there is likely to be a displacement along the forward direction, which we denote by the signed variable s. In other words, \(\mathbf {R}_{v}^{c}=\mathbf {I}_{3\times 3}\) and \(\mathbf {t}_{v}^{c}=\left[ \begin{matrix} 0&s&0 \end{matrix} \right] ^T\). As illustrated in Fig. 4, the transformation from camera pose \(c'\) (at an arbitrary future timestamp) to c (at the initial timestamp \(t_{\mathrm {ref}}\)) is therefore given by

$$\begin{aligned} \begin{aligned}&\mathbf {R}_c = \mathbf {R}_{v}^{cT}\mathbf {R}_v \mathbf {R}_{v}^{c} \,, \\&\mathbf {t}_c = -\mathbf {R}_{v}^{cT}\mathbf {t}_{v}^{c} + \mathbf {R}_{v}^{cT}\mathbf {t}_v + \mathbf {R}_{v}^{cT}\mathbf {R}_v\mathbf {t}_{v}^{c}. \end{aligned} \end{aligned}$$
(9)

Using the known plane normal vector \(\mathbf {n}=\left[ \begin{matrix}0&0&-1 \end{matrix}\right] ^T\) and depth-of-plane d, the image warping function \(W(\mathbf {x}_k,t_k;[\omega \text { }v]^T)\) that permits the transfer of an event \(e_k = \{ \mathbf {x}_k,t_k,s_k \}\) into the reference view at \(t_{\mathrm {ref}}\) is finally given by the planar homography equation

(10)

Note that \(\mathbf {K}\) here denotes a regular perspective camera calibration matrix with homogeneous focal length f, zero skew, and a principal point at \(\left[ \begin{matrix} u_0&v_0 \end{matrix} \right] ^T\). Note further that the substituted time parameter needs to be equal to \(t=t_k-t_{\mathrm {ref}}\), and that the result needs to be dehomogenised. After expansion, we easily obtain

$$\begin{aligned} \mathbf {x}^{\prime }_{k}= & {} W(\mathbf {x}_k,t_k;[\omega \text { }v]^T) = \left[ \begin{matrix} x_{k}^{\prime }&y_{k}^{\prime } \end{matrix} \right] ^T \\= & {} \left[ \begin{matrix} - [y_k - v_0 + s \frac{f}{d}] \sin (\omega t) + [x_k - u_0 - \frac{f}{d} (\frac{v}{w})] \cos (\omega t) + \frac{f}{d} (\frac{v}{w}) + u_0 \\ [x_k - u_0 - \frac{f}{d} (\frac{v}{w})] \sin (\omega t) + [y_k - v_0 + s \frac{f}{d}] \cos (\omega t) - s \frac{f}{d} + v_0 \end{matrix} \right] . \nonumber \end{aligned}$$
(11)

Finally, the bounding box \(\mathcal {P}_{k }^{\varvec{\varTheta }}\) is found by bounding the values of \(x_{k}^{\prime }\) and \(y_{k}^{\prime }\) over the intervals \(\omega \in \mathcal {W}=\left[ \omega _{\mathrm {min}};\omega _{\mathrm {max}}\right] \) and \(v\in \mathcal {V}=\left[ v_{\mathrm {min}};v_{\mathrm {max}}\right] \). The bounding is easily achieved if simply considering monotonicity of functions over given sub-branches. For example, if \(\omega _{\mathrm {min}} \ge 0\), \(v_{\mathrm {min}} \ge 0\), \(x_k \ge u_0\), and \(y_k \ge v_0 - s \frac{f}{d}\), we obtain

$$\begin{aligned} \underline{x_{k}^{\prime }}= & {} - [y_k - v_0 + s \frac{f}{d}] \sin (\omega _{\mathrm {max}} t) + [x_k - u_0 - \frac{f}{d} (\frac{v_{\mathrm {min}}}{w_{\mathrm {max}}})] \cos (\omega _{\mathrm {max}} t) + \frac{f}{d} (\frac{v_{\mathrm {min}}}{w_{\mathrm {max}}}) + u_0, \nonumber \\ \overline{x_{k}^{\prime }}= & {} - [y_k - v_0 + s \frac{f}{d}] \sin (\omega _{\mathrm {min}} t) + [x_k - u_0 - \frac{f}{d} (\frac{v_{\mathrm {max}}}{w_{\mathrm {min}}})] \cos (\omega _{\mathrm {min}} t) + \frac{f}{d} (\frac{v_{\mathrm {max}}}{w_{\mathrm {min}}}) + u_0, \nonumber \\ \underline{y_{k}^{\prime }}= & {} [x_k - u_0 - \frac{f}{d} (\frac{v_{\mathrm {max}}}{w_{\mathrm {min}}})] \sin (\omega _{\mathrm {min}} t) + [y_k - v_0 + s \frac{f}{d}] \cos (\omega _{\mathrm {max}} t) - s \frac{f}{d} + v_0, \text { and} \nonumber \\ \overline{y_{k}^{\prime }}= & {} [x_k - u_0 - \frac{f}{d} (\frac{v_{\mathrm {min}}}{w_{\mathrm {max}}})] \sin (\omega _{\mathrm {max}} t) + [y_k - v_0 + s \frac{f}{d}] \cos (\omega _{\mathrm {min}} t) - s \frac{f}{d} + v_0. \end{aligned}$$
(12)

We kindly refer the reader to the supplementary material for all further cases.

5 Experimental Evaluation

We present two suites of experiments. The first one validates the global optimality, accuracy and robustness of our solver on simulated data. The second one then applies it to the real-world scenario of AGV motion estimation.

5.1 Accuracy and Robustness of Globally Optimal Motion Estimation

We start by evaluating the accuracy of the motion estimation with contrast maximisation function \(L_{\mathrm {SoS}}\) over synthetic data. As already implied in  [11], \(L_{\mathrm {SoS}}\) can be considered as a solid starting point for the evaluation. Our synthetic data consists of randomly generated horizontal and vertical line segments on a plane at a depth of 2.0 m. We consider Ackermann motion with an angular velocity \(\omega = 28.6479^{\circ }\)/s (0.5 rad/s) and a linear velocity \(v = 0.5\) m/s. Events are generated by randomly choosing a 3D point on a line, and reprojecting it into a random camera pose sampled by a random timestamp within the interval \([0, 0.1\,\mathrm{s}]\). The result of our method is finally evaluated by running BnB over the search space \(\mathcal {W}=[0.4,0.6]\) and \(\mathcal {V}=[0.4,0.6]\), and comparing the retrieved solution against the result of an exhaustive search with sampling points every \(\delta \omega =0.001\) rad/s and \(\delta v=0.001\) m/s. BnB is furthermore configured to terminate the search if \(|\omega _{max}-\omega _{min}| \le 0.00078\) rad/s or \(|v_{max}-v_{min}| \le 0.00078\) m/s. The experiment is repeated 1000 times.

Figures 5(a) and 5(b) illustrate the distribution of the errors for both methods in the noise-free case. The standard deviation of the exhaustive search and BnB are \(\sigma _{\omega }=1.0645^{\circ }\)/s, \(\sigma _{v}=0.0151\)m/s and \(\sigma _{\omega }=1.305^{\circ }\)/s, \(\sigma _{v}=0.0150\)m/s, respectively. While this result suggests that BnB works well and sustainably returns a result very close to the optimum found by exhaustive search, we still note that the optimum identified by both methods has a bias with respect to ground truth, even in the noise-free case. Note however that this is related to the nature of the contrast maximisation function, and not our globally optimal solution strategy.

Fig. 5.
figure 5

Simulation results. (a) and (b) indicate the error distribution for \(\omega \) and v over all experiments for both our proposed method as well as an exhaustive search. (c) and (d) visualise the average error of the estimated parameters caused by additional salt and pepper noise on the event stream. Results are averaged over 1000 random experiments. Note that our proposed method has excellent robustness even for N/E ratios up to 40%.

In order to analyse robustness, we randomly add salt and pepper noise to the event stream with noise-to-event (N/E) ratios between 0 and 0.4 (Example objective functions for different N/E ratios have already been illustrated in Fig. 2). Figure 5(c) and 5(d) show the error for each noise level again averaged over 1000 experiments. As can be observed, the errors are very similar and behave more or less independently of the amount of added noise. The latter result underlines the high robustness of our approach.

5.2 Application to Real Data and Comparison Against Alternatives

We apply our method to real data collected by a DAVIS346 event camera, which outputs events streams with a maximum time resolution of 1\(\mu \)s as well as regular frames at a frame rate of 30 Hz. Images have a resolution of 346 \(\times \) 260. We mount the camera on the front of a XQ-4 Pro robot and let it face downward. The displacement from the non-steering axis to the camera is \(s = -0.45\) m, and the height difference between camera and ground is \(d = 0.23\) m. We recorded several motion sequences on a wood grain foam which has highly self-similar texture and poses a challenge to reliably extract and match features. Ground truth is obtained via an Optitrack optical motion tracking system. Our algorithm is working in undistorted coordinates, which is why normalisation and undistortion are computed in advance. The following aspects are evaluated:

Different objective functions: We test the algorithm with all aforementioned six contrast functions over various types of motions, including a straight line, a circle, and an arbitrarily curved trajectory. Table 3 shows the RMS errors of the estimated dynamic parameters, and compares the accuracy of all six alternatives. We furthermore apply two state-of-the-art approaches for regular images, namely the correspondence-less globally optimal feature-based approach (GOVO) from  [14], as well as the Improved Fourier Mellin Invariant transform (IFMI) in  [4, 31]. Even though these alternatives use the same non-holonomic or planar motion models, event-based motion estimation methods significantly outperform the intensity-camera-based alternatives (\(L_{\mathrm {SoSAaS}}\) on top, and \(L_{\mathrm {SoS}}\) and \(L_{\mathrm {Var}}\) also have good performance).

Table 3. RMS errors for different datasets and methods.

Event-Based vs Frame-Based: GOVO  [14] and IFMI  [31] are frame-based algorithms specifically designed for planar AGV motion estimation under featureless conditions. Figure 1 shows an example frame of the wood grain foam texture, and Fig. 7 the results obtained for all methods. As can be observed, GOVO finds as little as three corner features for some of the images, thus making it difficult to accurately recover the vehicle displacement despite the globally-optimal correspondence-less nature of the algorithm. Both IFMI and GOVO occasionally lose tracking (especially for linear motion), which leaves our proposed globally-optimal event-based method using \(L_{\mathrm {SoSAaS}}\) as the best method.

Fig. 6.
figure 6

Estimated trajectories by our method (SoS), gradient ascent with various initializations, and ground truth (gt). The table indicates the RMS errors for the best performing gradient ascent run and SoS.

BnB vs Gradient Ascent: We apply both gradient descent as well as BnB to the Foam dataset with curved motion. For the first temporal interval and the local search method, we vary the initial angular velocity \(\omega \) and linear velocity v between −1 and 0.8 with steps of 0.2 (rad/s or m/s, respectively). For later intervals, we use the previous local optimum. Figure 6 illustrates the estimated trajectories for all initial values, compared against ground truth and a BnB search using \(L_{\mathrm {SoS}}\). RMS errors are also indicated. As clearly shown, even the best initial guess eventually diverges under a local search strategy, thus leading to clearly inferior results compared to our globally optimal search.

Fig. 7.
figure 7

Results for all methods over different datasets. The first two columns are errors over time for \(\omega \) and v, and the third column illustrates a bird’s eye view onto the integrated trajectories.

Various Textures: More results over datasets with other ground floor textures can be found in the supplementary material.

6 Discussion

We have introduced the first globally optimal solution to contrast maximisation for un-warped event streams. To the best of our knowledge, we are also the first to apply the idea of homography estimation via contrast maximisation to the real-world case of non-holonomic motion estimation with a downward facing camera mounted on an AGV. The challenging conditions in this scenario favorise dynamic vision sensors over regular frame-based cameras, a claim that is supported by our experimental results. The latter furthermore prove that global solutions are important and significantly outperform incremental local refinement. The recursive formulation of our bounds lets us find the global optimum over event streams of 0.04 s within less than one minute, a respectable achievement given the typically low computational efficiency of BnB solvers.