1 Introduction

Event cameras are novel biologically-inspired sensors that differ from conventional cameras. Rather than capturing intensity frames at a fixed rate, event cameras respond independently and asynchronously to intensity changes for each pixel, resulting in streams of events that reflect spatiotemporal coordinates and polarity (sign) of corresponding brightness changes. Event cameras offer several promising properties, such as low power consumption, high temporal resolution (in the order of microseconds), and high dynamic range (up to 140 dB), making them a valuable alternative and complementary sensor to conventional cameras in challenging scenarios (Gallego et al., 2020).

Since event cameras produce inherently sparse and asynchronous output, it would be desirable to transform events into representations compatible with existing computer vision techniques. Spiking neural networks (SNNs) are well-suited for event data due to their minimal latency and high temporal resolution, but training them is challenging due to the lack of efficient backpropagation, making them less effective for event-driven applications (Dongsung & Terrence, 2018). In contrast, modern deep learning architectures such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) can easily handle event data by aggregating sequences of events into grid-like representations. However, this approach has some limitations. First, grid-like representations are constructed by merging and stacking event data within a small time interval, sacrificing temporal information and leading to irreversible and lossy mapping, which limits the performance of computer vision algorithms. Second, grid-like representations do not consider the sparsity property of event data, which leads to redundant computation and may fail to highlight details. Although some active research areas (Yusuke et al., 2019; Jiancheng et al., 2019; Simon et al., 2022; Baldwin et al., 2022) focus on designing for sparse data, they introduce some negative effects such as computational inefficiency (Jiancheng et al., 2019; Baldwin et al., 2022) and failure to generalize to low-level tasks in the image plane (Yusuke et al., 2019; Jiancheng et al., 2019; Simon et al., 2022). (See Fig 1)

Fig. 1
figure 1

CES example generated from shapes_6dof in the Event Camera Dataset (Mueggler et al., 2017). CES volumes are a low-loss representation that records the random frequency components of event signals

Table 1 Comparison of various event representations used for event-driven applications

In this paper, we propose a novel event data representation called compressed event sensing (CES) volumes. Motivated by compressed sensing technique which can accurately represent a signal sparse in some domain with few samples, we utilize the sparsity property of events and compressed sensing scheme to compress event signals with a minimal loss of information. Specifically, we model the events at each pixel as a discrete-time signal. We construct the CES volumes by mapping the event signal into the frequency domain using Fourier transform and randomly selecting a subset of frequency components. The resulting volumes can be easily processed using existing computer vision algorithms while retaining the high fidelity of the original event data. The downstream applications with CES volumes reach superior performances.

The main contributions of this paper are summarized as follows:

  • We propose CES volumes as an effective and efficient representation of event data. Our CES volumes hinge on the sparsity property of events and compressed sensing scheme.

  • A theoretical analysis is provided showing that the proposed CES representation, integrated with a deep learning framework, demonstrates greater expressive power, using the neural tangent kernel approximation.

  • We show the merits of the proposed CES, including the restricted isometry property of the sensing matrix and the high fidelity through raw event signal recovery, which are verified by empirical evidence.

  • We qualitatively and quantitatively evaluate the proposed CES representation on a phantom validation of dense frame regression as well as two downstream applications: intensity-image reconstruction and object recognition.

2 Related Work

Event Representation

Owing to the asynchronous and sparse nature of event data, event-driven processing differs considerably from traditional image-driven one. Event representation is the first and critical step in the event-driven computer vision pipeline. The raw event data is transformed into an intermediary format compatible with existing techniques that implement applications such as classification, image reconstruction, and motion tracking. A comparison of event-based representations and their design choices is summarized in Table 1.

Existing algorithms designed for event cameras have traded off high temporal resolution and prediction performance. Prior approaches based on individual events gain minimal latency and high temporal resolution, such as probabilistic filters (Orchard et al., 2015; Lagorce et al., 2016) and Spiking Neural Networks (SNNs) (Zhao et al., 2014; Gehrig et al., 2020). Filter-based algorithms process events sequentially by merging the information from past events with the incoming one using a continuous-time model. However, as they hinge on handcrafted design and parameter tuning, they are less effective on more complex high-level tasks (Orchard et al., 2015; Amos et al., 2018). As for Spiking Neural Networks (SNNs) (Zhao et al., 2014; Gehrig et al., 2020), although they are more flexible by extending manual filters in a data-driven fashion, they are difficult for convergence because commercially available SNNs are still an immature technique. Moreover, these representations based on individual events are computationally intensive due to the per-event update.

To efficiently and effectively aggregate information, much attention has been paid to aggregating batches of events into grid-like representations (Zhu et al., 2019; Rebecq et al., 2019; Gehrig et al., 2019) which can be fed to traditional computer vision algorithms designed for images. Each voxel in the grid represents the accumulation of event information, e.g. count and polarity, at a particular pixel and time interval. According to the working principle of event cameras, the grid-like representations intuitively interpret scenes, such as brightness increment and edges. Moreover, they benefit from their data structure compatible with computer vision algorithms and fast inference on commodity graphics hardware. Our work is highly related to the grid-like representations above.

Recently, some efforts have been devoted to treating batches of events as a set of spatiotemporal points and learning features from sparse events directly using point-based geometric processing methods, such as PointNet (Yusuke et al., 2019), transformer (Jiancheng et al., 2019), and Graph Neural Networks (GNNs) (Simon et al., 2022). These algorithms preserve the sparsity and high temporal resolution of events without redundancy, thus showing promising results on several tasks, such as object classification and detection. However, from the perspective of 3D space rather than 2D spatial one, these algorithms are not suitable for low-level tasks, such as intensity image reconstruction and optical flow estimation.

Compressed Sensing

Compressed sensing (CS) has gained tremendous interest from academic and industrial communities. By exploiting the sparsity prior, the CS scheme can reconstruct the underlying signals from much fewer samples than the Shannon-Nyquist rate (Nyquist, 1928), with the theoretical guarantee (Candès et al., 2006; Donoho, 2006). Existing works exploit various sparsity-imposing “norms”, e.g. \(l_p\) or \(l_1\) norms, for sparse coding and representation of the signals, while sensing design is guided by the mutual coherence condition or Restricted Isometry Property (RIP) (Nguyen et al., 2013). In practice, many natural signals are found to be sparsifiable, such as image (Basarab et al., 2013), geophysics data (Zhang et al., 2013), biological data (Mohtashemi et al., 2010), and communication (Bajwa et al., 2010; Eldar et al., 2012), thus CS can be applied and deployed in the corresponding applications. However, no work to date has investigated the CS of event cameras and data representation, or proposed any practical event CS schemes.

Fig. 2
figure 2

Sensing matrixes \(\Psi \) of the existing grid-like event representation. Existing works are mostly based on varied sliding window schemes to compress event data, such as (a) direct sum (Songnan et al., 2020; Lin et al., 2019; Song et al., 2020; Zhu et al., 2018; Anton et al., 2019), (b) fixed weighted sum (Zhu et al., 2019; Rebecq et al., 2019), and (c) learnt weighted sum (Gehrig et al., 2019). The \(\Psi \) in (c) is recorded by generating the look-up table from the mapping learnt in the object recognition task. Our representation (d) transforms events to the frequency domain with more temporal information preserved. The real part (cosine wave) is denoted by a solid line, and the imaginary part (sine wave) by a dashed line. \(M=5\), \(N=10\) in this example

3 Methodology

In this section, we provide the problem statement of event representation, followed by the formulation of the proposed compressed event sensing (CES) method.

3.1 Event Data

Event cameras trigger an event once they detect a log-intensity change within a pixel above a preset threshold. Within a time interval, a stream of events is recorded as a set of tuples:

$$\begin{aligned} \Xi = \{{\textbf {e}}_k\}_{k=1}^K = \{({\textbf {u}}_k, t_k, p_k)\}_{k=1}^K, \end{aligned}$$
(1)

where \({\textbf {e}}_k\) denotes the \(k^{th}\) event, \({\textbf {u}}_k=(u_k,v_k)\) and \(t_k\) are the spatiotemporal coordinates, \(p_k\in \{-1, 1\}\) indicates the direction (decrease or increase) of the log-intensity change.

As mentioned above, to utilize the high learning capacity of convolutional neural networks (CNNs), it is necessary to convert the event set into a grid-like representation. Ideally, this mapping should preserve the event stream’s spatial structure and high temporal resolution.

3.2 Discrete-Time Event Signals

Intuitively, an event set can be viewed as several points in a four-dimensional manifold spanned by polarity and spatiotemporal coordinates. Since event cameras record events independently for each pixel, we consider an event subset \(\Xi ^{{\textbf {u}}}\) at a specific pixel \({\textbf {u}}\). Within a time interval [0, T], the subset \(\Xi ^{{\textbf {u}}}\) can be modeled as a discrete-time signal sampled with time resolution \(\tau \):

$$\begin{aligned} e^{{\textbf {u}}}(t) = \sum _{n=0}^{N-1} x^{{\textbf {u}}}(n\tau )\delta (t-n\tau ), \end{aligned}$$
(2)

where \(N = \frac{T}{\tau }\) denotes the sample length, \(\delta (\cdot )\) is Dirac pulse, and \(x^{{\textbf {u}}}(n\tau )\) is a function defined on the domain of events:

$$\begin{aligned} x^{{\textbf {u}}}(n\tau ) = \left\{ \begin{aligned} x_k&\;&if~~e_k\;appear\;at\;({\textbf {u}}, n\tau ), \\ 0~~&\;&if~~no\;events\;at\;({\textbf {u}}, n\tau ), \\ \end{aligned} \right. \end{aligned}$$
(3)

where \(x_k\) denotes a measurement to event \(e_k\). This model is used in various representations in the literature. Examples of such measurements are the event polarity (Lin et al., 2019; Songnan et al., 2020; Song et al., 2020) \(x_k=p_k\), the event count (Zhu et al., 2018; Anton et al., 2019; Zhu et al., 2019; Rebecq et al., 2019) \(x_k=1\), and the normalized timestamp (Mitrokhin et al., 2018; Alonso & Murillo, 2019; Gehrig et al., 2019) \(x_k=t_k / T\). We use \(x_k=p_k\) in this work.

A simple approach to representing event data is to construct a 3D grid with dimensions \(N\times H\times W\), where H and W denote the spatial resolution of the event camera, and N is set as the number of channels in the grid. However, due to the high temporal resolution of event cameras (\(\tau \approx 1 \mu s\)), the resulting number of channels can be extremely large, and the sparse storage of events can lead to inefficiencies in both storage and processing. Therefore, it is desirable to compress event signals into a representation with fewer channels, while ensuring minimal loss of information.

3.3 Compressed Event Sensing

Given an event stream \(\vec {x^{{\textbf {u}}}} \in {\mathbb {R}}^{N}\) which is the vector format of \(x^{{\textbf {u}}}\), event compression can be summarized as a matrix multiplication:

$$\begin{aligned} \vec {y^{{\textbf {u}}}}=\Psi ^{\textsf{T}} \vec {x^{{\textbf {u}}}}, \end{aligned}$$
(4)

in which \(\vec {y^{{\textbf {u}}}} \in {\mathbb {R}}^{M}\) is the compressed result and \(\Psi = [\psi _0 ~ \psi _1 ~... ~ \psi _{M-1}] \in {\mathbb {R}}^{N\times M}\) is a sensing matrix, \(M \ll N\).

Previous works mainly adopt varied sliding window techniques to compress event data. One common approach (Songnan et al., 2020; Lin et al., 2019; Song et al., 2020; Zhu et al., 2018; Anton et al., 2019) divides an event stream into M equal-scale portions, and sums up the event measurements to form an M-channel grid, whose sensing matrix is illustrated in Fig. 2 (a). To maintain more temporal information, some methods (Zhu et al., 2019; Rebecq et al., 2019) merge events using linearly weighted summation similar to bilinear interpolation, as shown in Fig. 2 (b). Gehrig et al. (2019) further utilize a data-driven multilayer perceptron (MLP) to directly learn the best weights end-to-end for event compression. However, as shown in Fig. 2 (c), it tends to predict a simple mapping from event timestamps to weights, similar to Fig. 2 (b), and thus fails to make full use of the MLP’s model capacity. Overall, according to Nyquist-Shannon sampling theorem (Nyquist, 1928), the above event compressions in the time domain inevitably discard too much temporal information, making neural networks less effective for event-driven applications.

In practice, event signals are highly sparse in the original time domain, i.e. \(\Vert \vec {x^{{\textbf {u}}}} \Vert _0 \ll N\). Hence, compressed sensing can be utilized to compress event signals compactly while ensuring minimal information loss with an appropriate sensing matrix \(\Psi \). In this study, we employ random Fourier transform to generate the sensing matrix \(\Psi \), which is widely used in natural signal processing, such as geophysical data analysis (Zhang et al., 2013), communications (Bajwa et al., 2010), and medical image processing (Basarab et al., 2013). Specifically, the Fourier transform matrix \(\Psi \) maps an input event signal \(\vec {x^{{\textbf {u}}}} \in {\mathbb {R}}^{N}\) into a set of compressive measurements \(\vec {y^{{\textbf {u}}}} \in {\mathbb {R}}^{M}\) at M random frequencies \(\{f_m\}_{m=0}^{M-1}\). Each column of \(\Psi \) is defined as \(\psi _m = (e^{-i2\pi f_m n\tau })^{\textsf{T}}_{n=0,...,N-1}\), or equivalently:

$$\begin{aligned} \psi _m = [1 ~~~ \varpi _m ~~~ \varpi _m^2 ~~~ \varpi _m^3 ~~~ \cdots ~~~ \varpi _m^{N-1}]^{\textsf{T}}, \end{aligned}$$
(5)

where \(\varpi _m = e^{-i2\pi f_m \tau }\). To avoid complex operations in following neural networks, we split the complex \(\vec {y^{{\textbf {u}}}}\) into its real and imaginary terms and obtain the compressed result \(\vec {z^{{\textbf {u}}}} \in {\mathbb {R}}^{C}\), where C denotes the channel of representation and \(C=2M\). Therefore, the \(c^{th}\) element in the representation \(\vec {z^{{\textbf {u}}}_c}\) is calculated by:

$$\begin{aligned} \vec {z^{{\textbf {u}}}_c} = \left\{ \begin{aligned} \sum _{n=0}^{N-1} x^{{\textbf {u}}}(n\tau ) cos(2\pi f_m n\tau ) ~~{} & {} if ~~c=2m,~~~~~\\ \sum _{n=0}^{N-1} x^{{\textbf {u}}}(n\tau ) sin(-2\pi f_m n\tau ){} & {} if ~~c=2m+1. \\ \end{aligned} \right. \end{aligned}$$
(6)

Because random Fourier transform satisfies the Restricted Isometry Property (RIP), it ensures unique and stable full signal reconstruction from fewer samples (Nguyen et al., 2013), and thus can retain much temporal information of event data.

Moreover, the frequencies can be tuned to match downstream tasks. As shown in Fig. 3, the power spectral density of event streams on “shapes_6dof" (Mueggler et al., 2017) exhibits typical characteristics of event data in the frequency domain, suggesting that the sparsely-sampled frequency components have the potential to represent events effectively. Besides, it is noteworthy that although low frequency components exhibit high energy, high frequency ones may benefit downstream tasks as well, which are not considered in the previous works (Songnan et al., 2020; Lin et al., 2019; Song et al., 2020; Zhu et al., 2018; Anton et al., 2019). Instead, our method provides greater flexibility by manually designing frequencies.

Fig. 3
figure 3

Visualization on power spectral density of event data on “shapes_6dof" data (Mueggler et al., 2017). Before statistical spectral analysis, the event streams between two adjacent frames are first discretized to 1000 (\(N=1000\)). Event signals show a strong structure in the frequency domain

3.4 CES Implementation

After introducing our CES framework, we provide some strategies for effective and efficient implementation.

As mentioned above, the event representation with Fourier transform is tunable for downstream tasks. By choosing the frequency sampling in terms of distribution, numbers, and sampling range, it is possible to dramatically change the performance of the resulting networks. Gaussian distribution shows good flexibility by adjusting the variance \(\sigma ^2\), and thus we adopt Gaussian in the downstream applications:

  • Gaussian: Randomly sample M frequencies using Gaussian distribution \(f_m \sim {\mathcal {N}}(0, \sigma ^2)\).

In principle, users can choose other distributions commonly used in the compressed sensing field (Candès et al., 2006; Nguyen et al., 2013), such as:

  • Naive FFT: Select the first M low-order Fourier basis.

  • Uniform: Random sample M frequencies using uniform distribution \(f_m \sim {\mathcal {U}}[0, N]\), where N is the length of the event vectors.

Sect. 5 will provide thorough discussion on frequency sampling.

To perform the Fourier transform efficiently, we implement it using CUDA, which enables parallel computation of each pixel. Specifically, when a new event \({\textbf {e}}_k=({\textbf {u}}_k,t_k,p_k)\) arrives, we assign it to the corresponding thread based on its spatial coordinate \({\textbf {u}}_k\). We then update the representation \(\vec {z^{{\textbf {u}}}_c}\) by adding the term \(\delta \vec {z^{{\textbf {u}}}_c}\) as follows:

$$\begin{aligned} \delta \vec {z^{{\textbf {u}}}_c} = \left\{ \begin{aligned} x_k cos(2\pi f_m t_k) ~~~&~&if~~c=2m,~~~~~\\ x_k sin(-2\pi f_m t_k)&~&if~~c=2m+1, \\ \end{aligned} \right. \end{aligned}$$
(7)

where \(x_k = p_k\). This update incorporates the new event into the Fourier transform representations, and enables efficient processing of large event streams.

3.5 Theoretical Analysis

The proposed CES generates measurements that retain more information from the raw event streams. However, it is unclear whether these measurements lead to more effective representations when integrated with a deep learning framework for downstream tasks. In this section, we provide a theoretical analysis showing that CES offers greater expressive power in deep learning by analyzing its Reproducing Kernel Hilbert Space using Neural Tangent Kernel approximation.

Problem Formulation Given a distinct event dataset \({\mathfrak {X}} = \{\vec {x_i}\}_{i=1}^I\), deep learning methods first represent events using a sensing matrix \(\Psi \) and then infer using a learnable network g. Equivalently, this process is an end-to-end network p whose first layer is a fixed-weight linear transformation without activation. The output of an event input \(\vec {x}\) can be formulated as

$$\begin{aligned} p(\vec {x}) = g(\Psi ^{\textsf{T}} \vec {x}). \end{aligned}$$
(8)

Given a network g with a fixed architecture, the expressive power of the whole network p is fully determined by the expressive power of the chosen \(\Psi \). Thus, the objective is to measure the function space of p, as an indicator of the expressive power of \(\Psi \). Since p involves highly complex embedding using neural network g, we first revisit the recent neural tangent kernel (NTK) methods (Jacot et al., 2018; Tancik et al., 2020) as the tool for model simplification.

Neural Tangent Kernel (NTK)

Recent works (Jacot et al., 2018; Tancik et al., 2020; Arora et al., 2019; Bietti et al., 2019; Chen et al., 2020; Liu et al., 2023) show that, under the infinite width assumption, a neural network can be approximated as a kernel regression using NTK. Specifically, we assume that g is a fully-connected deep network with its parameters initialized from a Gaussian distribution \({\mathcal {N}}\). When the width of the layers in g tends to infinity, g can be approximated by kernel regression (Jacot et al., 2018). We denote its NTK function as \(K_g(\cdot ,\cdot )\).

Since the whole network p is essentially network g with the linear-transformed inputs \(\{\Psi ^T\vec {x_i}\}_{i=1}^I\), the NTK function of p can be obtained by:

$$\begin{aligned} K(\vec {x_i}, \vec {x_j}) = K_g(\Psi ^T\vec {x_i}, \Psi ^T\vec {x_j}). \end{aligned}$$
(9)

And thus, p can also be approximated by kernel regression.

According to Bernhard et al. (2002); Seeger (2004); Saitoh et al. (2016), the reproducing kernel Hilbert space (RKHS) of a positive definite kernel encompasses a set of functions learned by kernel regression, and thus, RKHS can be used to characterise the expressive power of NTK (Chen et al., 2020). Similarly, when fixing the architecture of the network g, we propose to evaluate the expressive power of sensing matrix \(\Psi \) by comparing the RKHS of NTK corresponding to the whole network p. Formally, we introduce the following definition:

Definition 1

(Expressive Power) Given a distinct dataset \({\mathfrak {X}}=\{\vec {x_i}\}_{i=1}^I\), let \({\mathfrak {H}}_a\) and \({\mathfrak {H}}_b\) be the RKHS associated with the NTK of a certain network with \(\Psi _a^T\vec {x}\), and \(\Psi _b^T\vec {x}\) as input. When

$$\begin{aligned} {\mathfrak {H}}_b \subsetneq {\mathfrak {H}}_a, \end{aligned}$$
(10)

we call the sensing matrix \(\Psi _a\) is more expressive than \(\Psi _b\).

Comparing Reproducing Kernel Hilbert Space (RKHS)

The proposed event representation is based on compressed sensing and can satisfy the Restricted Isometry Property (RIP) with a proper sensing matrix. Therefore, for any distinct event vector pairs \(\vec {x_i}, \vec {x_j} \in {\mathfrak {X}}\), our compressed results are also distinct, that is

$$\begin{aligned} \vec {x_i} \ne \vec {x_j} \Leftrightarrow \Psi _{ces}^T \vec {x_i} \ne \Psi _{ces}^T \vec {x_j}. \end{aligned}$$
(11)

We refer to this property as the non-degenerate property of the sensing matrix \(\Psi _{ces}^T\) on \({\mathfrak {X}}\). Based on this observation, we prove the following theorem:

Theorem 1

Given a non-zero distinct s-sparse dataset \({\mathfrak {X}}=\{\vec {x_i}\}_{i=1}^I\), let \({\mathfrak {H}}_a\) and \({\mathfrak {H}}_b\) be the RKHS associated with the NTK of same-architecture fully-connected network with \(\Psi _a^T\vec {x}\), and \(\Psi _b^T\vec {x}\) as input, where \(\Psi _a\) holds the non-degenerate property while \(\Psi _b\) does not, the following subset inclusion relation hold:

$$\begin{aligned} {\mathfrak {H}}_b \subsetneq {\mathfrak {H}}_a. \end{aligned}$$
(12)

Please refer to the appendix 7 for detailed proof.

The proposed CES has the non-degenerate property, while the existing grid-like representations do not. According to Theorem 1, under the NTK assumption, the neural networks with CES can reach a larger RKHS compared with other representations. And thus, based on Definition 1, we can conclude that our CES exhibits greater expressive power.

4 Fidelity of CES Volumes

The existing representations of event data often suffer from an irreversible loss of temporal information. In contrast, our proposed compressed event sensing (CES) approach can encode the full temporal information of the event data with minimal loss. This section discuss on the quality of the sensing matrices of existing event representations and further evaluate our high fidelity by recovering raw event signals from our proposed CES volumes.

Table 2 Comparison on the quality of the sensing matrix \(\Psi ^T\) with different representations in terms of restricted isometry constant \(\delta _s\) for sparse vectors with \(s=2\) sparse and \(N=1000\) length

4.1 Quality of Sensing Matrix

According to the theory of compressed sensing (Simon & Holger, 2013), restricted isometry property (RIP) is a fine measure of the quality of a measurement matrix. The sth restricted isometry constant \(\delta _s=\delta _s(\Psi ^T)\) of the sensing matrix \(\Psi ^T\) is the smallest \(\delta \ge 0\) such that

$$\begin{aligned} (1-\delta ) \Vert \vec {x^{{\textbf {u}}}}\Vert _2^2 \le \Vert \Psi ^T\vec {x^{{\textbf {u}}}}\Vert _2^2 \le (1+\delta ) \Vert \vec {x^{{\textbf {u}}}}\Vert _2^2, \end{aligned}$$
(13)

for all s-sparse event vector \(\vec {x^{{\textbf {u}}}} \in {\mathbb {R}}^N\). Equivalently, it is given by:

$$\begin{aligned} \delta _s = max_{S\subseteq [N], card(S)\le s} \Vert \Psi _S\Psi _S^T - I_N\Vert _{2\rightarrow 2}. \end{aligned}$$
(14)

If \(\delta _s\) is small for reasonably large s, the sensing matrix satisfies the RIP and ensures a stable and unique sparse recovery.

Table 2 compares the \(\delta _s\) for N-length s-sparse event vectors (\(N=1000, s=2\)) among the sensing matrices \(\Psi ^T\) of different representations. Compared with existing representations, our method achieves lowest restricted isometry constants for all numbers of frequencies M. And thus, it exhibits high probability of success to recover sparse signal via algorithms.

Moreover, we observed that the quality of our sensing matrix improves as the number of sampled frequencies increases. According to Candès et al. (2006), the number of sampled frequencies required for accurate recovery via l1 minimization should satisfy: \(M \ge C_M \cdot logN \cdot \Vert \vec {x^{{\textbf {u}}}} \Vert _0\), for some constant \(C_M>0\). Therefore, it would be necessary to design a suitable channel number of representations based on the sparsity of the event signal \(s = \Vert \vec {x^{{\textbf {u}}}} \Vert _0\) and the desired temporal resolution N.

4.2 Raw Event Signal Recovery

Fig. 4
figure 4

Visualization on the reversibility of the proposed CES volumes on “shapes_6dof" data (Mueggler et al., 2017). Different from existing event representations suffering from irreversible losses, the proposed CES can be used to accurately recover the raw event signals. (a) denotes an original event signal between two adjacent frames. (b)(c)(d) show the recovered event signals by feeding our representations with the frequency number \(M=4,~16,~64\) into a conventional sparse recovery algorithm (Candes et al., 2008) and illustrating the points where the absolute values of signal intensities are above a threshold 0.5. The recovered results ensemble the original event signal and reach PSNR of 46.45, 62.20 and 219.38, respectively. The experiment shows the reversible recovery ability of the proposed CES volumes and demonstrates its powerful representational capability

We conduct an experiment on the “shapes_6dof" dataset (Mueggler et al., 2017) captured by DAVIS240C. Each event stream between two adjacent frames with a duration of approximately 40 ms is compressed by CES volume and then reconstructed by a conventional sparse recovery algorithm. However, due to the high temporal resolution of event cameras (\(\tau \approx 1 \mu s\)), the vector format of event streams is very lengthy (\(N \approx 40,000\)), which poses computational challenges for the conventional sparse recovery algorithms. To address this problem, we conduct a rough evaluation experiment by shortening event signal N as an alternative. Specifically, we normalize the timestamps of event data to a range of [0, 1] and discretize them into N=1000 intervals. We randomly sample \(M = 4,8,16,32,64\) frequencies \(\{f_m\}_{m=0}^{M-1}\) from a uniform distribution over the interval [0, N], and calculate corresponding frequency components at each pixel to form a grid-like representation. Then, we feed the representations into the iteratively re-weighted l1 minimization (Candes et al., 2008). Through examining the quality of the reconstructed signals, we can evaluate the fidelity of our representation.

Table 3 Raw event signal recovery from the proposed CES volumes with different numbers of frequencies M on “shapes_6dof” dataset (Mueggler et al., 2017)

We evaluate the average PSNR between the original events and the reconstructed ones as shown in Table 3. We also provide qualitative results in Fig. 4, where the decompressed event data are highlighted in panels (b), (c), and (d) using a threshold of 0.5 for the absolute values of signal intensities.

The results demonstrate that our proposed representation, which relies on the compressed sensing scheme, excels in reconstructing realistic events with remarkable fidelity. Furthermore, we observed that the performance of our method improves as the number of sampled frequencies increases, which is consistent with the observation in Table 2. When M is small, such as in Fig. 4 (b), the insufficient number of frequencies causes missing and misaligned signals during sparse recovery. However, when a sufficient number of frequencies are sampled, as in Fig. 4 (c) and (d), the restored events closely resemble the original ones.

However, in practice, downstream applications may not rely on temporal information as heavily as the raw event signal recovery does. The detailed discussion on frequency numbers will be given in Sect. 5.

5 Experimental Results

In this section, we first investigate the advantage of CES volumes on preserving much temporal information by a challenging case: dense frame regression test on a synthetic phantom dataset. Then, we demonstrate the effectiveness of our proposed CES volume on two event-driven applications: intensity-image reconstruction and object classification. Finally, we compare the running time of the existing representations.

5.1 Dense Frame Regression Test

To better understand the proposed method’s advantage of keeping high temporal resolution, we regress dense frames from a starting frame and the subsequent events on a simulated phantom dataset. Preserving more temporal information should result in better reconstruction of dense frames.

Given a starting intensity \(z_0^{{\textbf {u}}} \in {\mathbb {R}}\) at a pixel \({\textbf {u}}\) and the subsequent event stream \(\vec {x^{{\textbf {u}}}} \in {\mathbb {R}}^{N}\), dense frame regression aims to output corresponding frames \([z_1^{{\textbf {u}}}, z_2^{{\textbf {u}}},...,z_{Z-1}^{{\textbf {u}}}] \in {\mathbb {R}}^{Z-1}\). This procedure can be formulated as

$$\begin{aligned}{}[z_1^{{\textbf {u}}}, z_2^{{\textbf {u}}},...,z_{Z-1}^{{\textbf {u}}}] = z_0^{{\textbf {u}}} + f(\vec {y^{{\textbf {u}}}};\theta ) = z_0^{{\textbf {u}}} + f(\Psi ^{\textsf{T}}\vec {x^{{\textbf {u}}}};\theta )\nonumber \\ \end{aligned}$$
(15)

in which \(f(\cdot )\) is a fully-connected networks (also called multilayer perceptrons or MLPs) with weights \(\theta \), \(\Psi \in {\mathbb {R}}^{N\times M}\) is a sensing matrix and \(\vec {y^{{\textbf {u}}}}\in {\mathbb {R}}^{M}\) is the compressed representation. The pipeline of dense frame regression is illustrated in Fig. 5 (a).

Specifically, we compare the proposed CES volumes with the state-of-the-art representations, including fixed weighted sum-based Voxel Grid (Zhu et al., 2019; Rebecq et al., 2019) and learnt weighted sum-based Event Spike Tensor (EST) (Gehrig et al., 2019). The timestamps of events are normalized to a range of [0, 1]. As for our CES volumes, we sample the frequencies from a Gaussian distribution \(f_m \sim {\mathcal {N}}(0, \sigma ^2)\) similar to (Tancik et al., 2020), and form the volumes with corresponding frequency components. For a fair comparison, we set the same number of channels of each representation \(C=16\).

The network \(f(\cdot )\) contains 8 fully-connected (FC) layers. The channel number of the last layer is set as \(Z-1\), where \(Z=50\) in this experiment. Other layers are designed as 256 channels and followed by GELU activations (Hendrycks et al., 2016). We implement the network using PyTorch (Paszke et al., 2017) and use ADAM (Diederi amb Jimmy, 2014) with a learning rate of 0.001. The networks are trained with the mean squared error (MSE) loss for 50 epochs (937,500 iterations) with a batch size of 16.

Fig. 5
figure 5

Pipeline and dataset synthesis of dense frame regression test. Given a starting frame \(z^{{\textbf {u}}}_0\) at a pixel \({\textbf {u}}\) and its subsequent event stream \(\vec {x^{{\textbf {u}}}}\), dense frame regression aims to predict corresponding subsequent frames \([z_1^{{\textbf {u}}}, z_2^{{\textbf {u}}},...,z_{Z-1}^{{\textbf {u}}}]\). The event data \(\vec {x^{{\textbf {u}}}}\) is first represented as \(\vec {y^{{\textbf {u}}}}\) using different sensing matrices \(\Psi ^T\), and then fed into a fully-connected network. As for the synthetic phantom dataset, we generate video sequences by rendering a brighter rectangle moving on a darker background, and simulating event data using Vid2E (Gehrig et al., 2020). Please see manuscripts for more details

Fig. 6
figure 6

Convergence of dense frame regression. Compared with existing representations, the proposed compressed sensing-based method preserves more temporal information of events, showing good learning ability on regressing and distinguishing dense frames

Table 4 Dense frame regression performance on the synthetic phantom dataset in terms of mean squared error (MSE) and structural similarity (SSIM) (Zhou et al., 2004)

Datasets To obtain the phantom dataset, we first generate several video sequences by rendering a brighter rectangle with the size of \(50\times 100\), on a darker background with the size of \(100\times 100\). The rectangle starts at the center of the background, and moves horizontally for 3 key times with various and random step lengths, otherwise keeps stationary. Each sample contains Z frames. Figure 5 (b)-(e) show an example in which the rectangle moves at 21st, 47th, and 49th frames. After obtaining video sequences, we feed them into an event simulator, Vid2E (Gehrig et al., 2020), to generate an event stream. The dataset contains 300,000 video samples and corresponding events. We select 225,000 pairs for training and 75,000 pairs for testing.

Results and Discussion Figure 6 provides the convergence of existing event representations. Existing representations converge rapidly but show limited power in frame regression. Based on compressed sensing theory, the proposed CES volumes capture frequency characteristics of events and encode much temporal information, and thus have potential to distinguish frames in dense timestamps. It shows good learning capacity of dense frame regression.

Table 4 compares the mean squared error (MSE) and structural similarity (SSIM) (Zhou et al., 2004) of regression results. Voxel Grid (Rebecq et al., 2019) and EST (Gehrig et al., 2019) utilize a sliding window approach, which fuses a large number of events within each window, leading to blurred temporal information and limited ability to recover high time resolution. In contrast, the proposed method utilizes compressed sensing scheme to preserve more temporal information, thereby overcoming the “temporal bias” observed when training networks with sliding window. It allows for the distinction of dense timestamps and results in high regression performance.

Actually, it comes as no surprise to learn that our frequency domain representation performs favorably. Previous work (Tancik et al., 2020) has demonstrated on a variety of regression tasks relevant to the computer vision and graphics communities that Fourier feature mapping can overcome the spectral bias of coordinate-based MLPs towards low frequencies by allowing them to learn much higher frequencies.

5.2 Intensity-Image Reconstruction

Intensity-image reconstruction aims to generate high-quality video frames from sparse event data, allowing for the use of off-the-shelf computer vision techniques designed for conventional cameras. Ideally, benefiting from the pleasing properties of events, the reconstructed intensity images can exhibit a significantly larger dynamic range, shaper edges, and reduced motion blur compared to conventional cameras.

However, there are several challenges associated with reconstructing high-quality images from event data. First, events are unevenly distributed in both space and time, with the number of events being highly dependent on the movement in the scene. In the presence of large movements, the captured events can be massive, posing difficulties in event representation, as essential information may be lost, resulting in blurry reconstructed images. On the other hand, small movements can lead to sparser events, requiring event representation methods to effectively highlight the relevant details without sacrificing important information.

Furthermore, the imbalance in the amounts of events introduced by different object motions adds another layer of complexity to the design of data representation methods. Finding a suitable event representation approach that can capture both large and small movements while preserving the essential details poses a significant challenge in intensity-image reconstruction from event data. Addressing these challenges is crucial to achieving high-quality image reconstruction from event data and unlocking the full potential of event cameras in computer vision applications.

Table 5 Intensity-image reconstruction performance compared with existing grid-like representations and state-of-the-art reconstruction methods on the Blur-DVS (Jiang et al., 2020) and Event Camera Dataset (Mueggler et al., 2017) in terms of mean squared error (MSE), structural similarity (SSIM) (Zhou et al., 2004), and the learned perceptual image patch similarity (LPIPS) (Richard et al., 2018)
Fig. 7
figure 7

Visual comparisons on intensity-image reconstruction on Blur-DVS (Jiang et al., 2020). The reconstructed images from Voxel Grid (Rebecq et al., 2019), EST (Gehrig et al., 2019), and our representation as well as ground truth are shown in the odd rows, while the center channel of each representation and input event data are shown in the even rows. The network designed with our representation generates more details with fewer artifacts

In this section, we conduct an evaluation on a state-of-the-art high-pass filter-based non-deep method (Cedric et al., 2018) as well as a deep learning-based method E2VID (Rebecq et al., 2019) integrating various representations, including fixed weighted sum-based Voxel Grid (Zhu et al., 2019; Rebecq et al., 2019), learnt weighted sum-based Event Spike Tensor (EST) (Gehrig et al., 2019), TORE (Baldwin et al., 2022) and our compressed event sensing method. To ensure a fair comparison, we set the number of channels for each representation to be the same, with \(C=8\). These representations are then fed into the same base network, E2VID (Rebecq et al., 2019), for further processing.

We implemented the networks using PyTorch (Paszke et al., 2017) and adopted ADAM (Diederi amb Jimmy, 2014) as the optimization algorithm, with a learning rate of 0.0001. The networks were trained for 20 epochs, corresponding to 56,540 iterations, with a batch size of 2.

Fig. 8
figure 8

Visual comparisons on intensity-image reconstruction on Event Camera Dataset (Mueggler et al., 2017). The reconstructed images from Voxel Grid (Rebecq et al., 2019), EST (Gehrig et al., 2019), and our representation as well as ground truth are shown in the odd rows, while the center channel of each representation and input event data are shown in the even rows. The network designed with our representation generates more details with fewer artifacts

Datasets In our experiments, we utilize the slow subset of the Blur-DVS dataset (Jiang et al., 2020) for both training and validation. The slow Blur-DVS dataset was captured using a DAVIS240C event camera with slow and stable camera movement, capturing relatively static scenes. As a result, this dataset provides sharp intensity frames as ground truths and corresponding event streams as inputs, which are suitable for training our model. The dataset is split into 11,308 pairs for training and 3,700 pairs for validation.

To further assess the generalization ability of the representations, we also conducted experiments on the Event Camera Dataset (Mueggler et al., 2017). This dataset presents a different set of scenes and events compared to the Blur-DVS dataset.

Fig. 9
figure 9

Effects of channel numbers of representations on intensity-image reconstruction on Blur-DVS (Jiang et al., 2020). The proposed method achieves the highest performance under all numbers of channels. It can perform favorably with few channels, resulting in efficient representation

Results and Discussion We report mean squared error (MSE), structural similarity (SSIM) (Zhou et al., 2004), and the learned perceptual image patch similarity (LPIPS) (Richard et al., 2018) in Table 5. On Blur-DVS dataset, the proposed representation method demonstrates superior performance with an \(8\%\) decrease in MSE, a \(3\%\) increase in SSIM, and a \(2\%\) decrease in LPIPS. Our method exhibits better generalization capability across almost all scenes, as indicated by improvements in MSE (\(1\%\) decrease), SSIM (\(4\%\) increase) and LPIPS (\(5\%\) decrease).

We provide visual examples from the validation and testing sets in Figs. 7 and 8, respectively. Voxel Grid (Rebecq et al., 2019), which relies on a fixed weighted sum, sacrifices temporal information and is less effective in handling significant camera movement, resulting in noticeable ringing artifacts. Event Spike Tensor (EST) (Gehrig et al., 2019), which uses learnable weights for event data embedding, exhibits limited learning ability, and the learnable weights are similar to the fixed weights in Voxel Grid (see Fig. 2), leading to blurry intensity images. In contrast, our proposed representation method leverages the frequency characteristics and sparsity of events to encode data using a compressed sensing scheme, resulting in reconstructed images with finer details and fewer visual artifacts.

Effect of Frequency Number

According to the compressed sensing theory, our proposed CES volumes can retain extensive temporal information when a sufficient number of frequencies are sampled. Nevertheless, in event-driven applications, preserving accurate temporal resolution with numerous frequencies may not always be unnecessary. To further explore this, we carry out an experiment where we vary the number of channels in the representations representations, setting them to \(C=8, 16, 32, 64\), and evaluate their performance on an intensity-image reconstruction task, as shown in Fig. 9. We compare our method with Voxel Grid (Rebecq et al., 2019) and Event Spike Tensor (EST) (Gehrig et al., 2019).

We observe that the accuracy of all methods generally increase with the number of channels. However, our method exhibits greater robustness to the number of channels and consistently achieves superior performance across all channel levels. These findings suggest that intensity-image reconstruction may be less sensitive to temporal information compared to event signal reconstruction, and satisfactory performance can be achieved with a limited number of frequencies using in our proposed method. This makes our method valuable in scenarios where computation or storage constraint exists.

Fig. 10
figure 10

Comprehensive analysis on the interaction between channel numbers and sampling variances on intensity-image reconstruction on Blur-DVS (Jiang et al., 2020). Voxel Grid (Rebecq et al., 2019), EST (Gehrig et al., 2019), and the proposed method are highlighted as blue, cyan, and red, specifically. The 3D surface plots illustrate the interaction on (a) MSE, (b) SSIM, and (c) LPIPS. The channel numbers are set as 8, 16, 32, 64; while frequencies are sampled from a Gaussian distribution with variance \(\sigma ^2\) ranging from 1 to 25. The 2D line plots in (a)(b)(c) show the performance with changes in sampling variance when channel numbers are 8, 16, 64 from top to bottom. The overall trend of different line plots is similar; however, the inflection points and ranges differ

Fig. 11
figure 11

Effect of frequency sampling distribution on intensity-image reconstruction on Blur-DVS (Jiang et al., 2020). Uniform distribution, “Naive FFT" and Gaussian distribution are highlighted as green, black and red, specifically. The 3D surface plots illustrate the performance, including (a) MSE, (b) SSIM, and (c) LPIPS, with different channel numbers and variances of representations. The “naive FFT" and the Gaussian distribution can obtain comparable reconstruction results, but uniform distribution presents poor reconstruction accuracy

Effect of Frequency Sampling Range

The tunable nature of the proposed representation allows for manipulation of frequency settings to optimize downstream network performance. To investigate the effects of frequency sampling range on the intensity-image reconstruction task, we conduct an experiment where the frequency number is set as 4, 8, 16, 32, 64 and the frequencies are sampled from a Gaussian distribution with variance \(\sigma ^2\) ranging from 1 to 25.

Figure 10 illustrates the interaction on intensity-image reconstruction in terms of (a) MSE, (b) SSIM, and (c) LPIPS. We note that given a fixed channel number of event representations, the image reconstruction accuracy initially improves and then declines with an increase in sampling variance. Moreover, as the channel number increases, the inflection point occurs later and the reconstruction network obtains higher accuracy.

The observation may stem from 2 reasons. 1) Low-frequency components reflect the overall trend of event streams, which is crucial and necessary in the intensity-image reconstruction application. Gaussian distribution with small variances guarantees the existence of low frequency components, while high variances does not. Therefore, the accuracy decreases with the increase of sampling variance. 2) The richness of the sampled frequencies is crucial. When the channel is large but the variance is small, the generated frequencies becomes redundant, leading to limited expressive capacity in the representation and a loss of detailed information. Therefore, the accuracy improves initially with an increase in sampling variance. It also explains why the inflection point in 2D plots occurs later with an increase in channel number.

Effect of Frequency Sampling Distribution

We have discussed the effect of frequency sampling range using Gaussian distribution above, and in this section, we further investigate the impact of different frequency sampling distributions. We conduct an experiment on uniform distribution as well as “naive FFT" which selects the first a few low-order Fourier basis instead of random set of basis.

Table 6 Object recognition accuracy compared to existing grid-like representations and state-of-the-art classification algorithms on the N-Cars (Amos et al., 2018) and N-Caltech101 (Orchard et al., 2015)

Figure 11 illustrates the comparison results. The “naive FFT" ensures the existence of low-frequency components and the richness of frequencies similar to Gaussian distribution, and thus obtain comparable reconstruction results with Gaussian distribution with suitable variances. However, given a limited of frequency number, uniform distribution cannot ensure the inclusion of enough low-frequency components and thus presents poor reconstruction performance.

Actually, determining the optimal sampling distribution is challenging and there is no theoretical justification or guarantees. Different applications may have varying requirements of frequencies on event representation. Moreover, the complexity and intrinsic characteristics of event stream may influence the optimal frequency sampling as well.

5.3 Object Recognition

In recent years, event-based classification has gained significant attention for challenging scenarios where conventional cameras may struggle, such as low light conditions and fast object motion. In this section, we use object classification as an example to investigate the performance of the proposed compressed event sensing (CES) volumes on inference tasks.

Following the settings used in Gehrig et al. (2019), we use a pre-trained ResNet-34 (He et al., 2016) as the base network for object prediction. To ensure a fair comparison, events are represented with the same number of channels \(C=4\) using various presentations, including fixed weighted sum-based Voxel Grid (Zhu et al., 2019; Rebecq et al., 2019), learnt weighted sum-based Event Spike Tensor (EST) (Gehrig et al., 2019), and the proposed CES volumes. The networks are implemented using PyTorch (Paszke et al., 2017), and trained using the cross-entropy loss with the ADAM optimizer (Diederi amb Jimmy, 2014) with an initial learning rate of 0.0001, which is halved every 10,000 iterations. The training is conducted for 50 epochs (192,800 and 54,450 iterations for N-Cars and N-Caltech101 datasets, respectively) with a batch size of 4.

Furthermore, we conduct additional comparisons with several state-of-the-art classification algorithms, including two handcrafted representations: HOTS (Lagorce et al., 2016) and HATS (Amos et al., 2018); two baseline implementations of spiking neural networks (SNNs): H-First (Orchard et al., 2015) and Gabor (Jun Haeng et al., 2016; Neil et al., 2016); and one event-based graph neural network (GNN) method: AEGNN (Simon et al., 2022).

Datasets We utilize two public datasets: N-Cars (Amos et al., 2018) and N-Caltech101 (Orchard et al., 2015). The N-Cars dataset is used for the binary classification and contains 12,336 car samples and 11,693 non-cars samples. The event data is recorded by an ATIS event camera (Posch et al., 2010) with a length of 100ms per sample. The N-Caltech101 dataset is an event-based version of the frame-based Caltech101 dataset (Fei-Fei et al., 2006). It mounts an ATIS sensor on a motorized pan-tilt unit, focuses the ATIS on an LCD monitor displaying the original Caltech101 data, and records events as the sensor moves. The N-Caltech101 dataset contains 6,968 samples with 100 object classes plus a background class. We split the training and testing datasets as suggested in Gehrig et al. (2019).

Results and Discussion

Table 6 shows the classification results of our proposed CES volumes compared to other grid-like representations with the same channel number, as well as state-of-the-art methods, on N-Cars and N-Caltech101 datasets.

Compared to Voxel Grid (Rebecq et al., 2019) and EST (Gehrig et al., 2019), our CES volumes outperform them by \(1.3\%\), \(2.2\%\) in N-Cars and \(2.5\%\), \(4.2\%\) in N-Caltech101. This improvement is attributed to the minimal information loss in our compressed sensing-based approach.

Furthermore, our proposed CES volumes perform favorably against the state-of-the-art classification methods on both datasets. HOTS (Lagorce et al., 2016) and HATS (Amos et al., 2018) which cluster recent events to form a time surface, and thus obscure much temporal information and show less effectiveness for object classification. H-First (Orchard et al., 2015) and Gabor (Jun Haeng et al., 2016; Neil et al., 2016) which feed events into spiking neural networks (SNNs), gain minimal latency but are limited by the learning ability of SNN. The recent event-based GNN method (Simon et al., 2022) preserves the high temporal resolution of events but discards the spatial structure, which limits its effectiveness. In contrast, the proposed CES method can maintain the high temporal information via a compressed sensing scheme and fully exploit the capacity of the convolutional neural network model, resulting in state-of-the-art performance on both N-Cars and N-Caltech101 datasets.

Effect of Frequency Number

Fig. 12
figure 12

Effects of channel numbers of representations on object recognition on N-Caltech101 (Orchard et al., 2015). The channel numbers C are set from 4 to 64. The proposed method achieves the highest performance under all numbers of channels and it can perform favorably with few channels, resulting in efficient representation

We further investigate the sensitivity of the frequency numbers on the inference task. We conduct an experiment by setting the number of channels of representations C to different values, specifically \(C=4, 8, 16, 32, 64\), and compare the results with Voxel Grid (Rebecq et al., 2019) and EST (Gehrig et al., 2019) in Fig. 12.

Our method performs favorably across all numbers of channels and the accuracy generally increases with the number of channels. However, we observed that the improvement of accuracy becomes less significant when \(C\ge 16\), or equivalently, \(M\ge 8\). This suggests that it is not necessary for object recognition tasks to preserve as much accurate temporal information as the event signal reconstruction does. A smaller number of channels are sufficient for achieving good performance, leading to more efficient event data compression. This can be valuable in practical event-driven applications where reducing data size and processing requirements are important considerations.

Effect of Frequency Sampling Range

In our proposed representation, the frequencies for the Fourier transform are tunable. We investigate the effects of the frequency range on the inference task. In this experiment, we set frequency number as 4, 8, 16, 32, 64 and sample frequencies from a Gaussian distribution with variance \(\sigma ^2\) ranging from 1 to 25. The representations are fed into the prediction network and the prediction results are shown in Fig. 13.

Figure 13 illustrates the interaction on object recognition in terms of classification accuracy. We observe that under different channel numbers, the classification accuracy initially improves and then declines with an increase in sampling variance, similar to the results in the intensity-image reconstruction application. However, the inflection points are roughly the same different from the drift appearance in intensity-image reconstruction. One possible reason is that for some low-level computer vision tasks, such as intensity-image reconstruction, dynamic and temporal information play an important role, so that incorporating reasonably high frequency components benefit reconstruction; while for some high-level computer vision tasks, such as object recognition, the structure in spatial domain may be important and high frequency components may disturb and confuse the spatial structure, leading to poor recognition performance.

Fig. 13
figure 13

Comprehensive analysis on the interaction between channel numbers and sampling variances on object recognition on N-Caltech101 (Orchard et al., 2015). Voxel Grid (Rebecq et al., 2019), EST (Gehrig et al., 2019), and the proposed method are highlighted as blue, cyan, and red, specifically. The 3D surface plot illustrate the interaction on accuracy. The channel numbers are set as 4, 8, 16, 32, 64; while frequencies are sampled from a Gaussian distribution with variance \(\sigma ^2\) ranging from 1 to 25. The 2D line plots show the performance with changes in sampling variance when channel numbers are 4, 32, 64 from top to bottom. The overall trend and inflection points of different line plots are similar; however, the ranges differ

Effect of Frequency Sampling Distribution

We have discussed the effect of frequency sampling range using Gaussian distribution above, and in this section, we further investigate the impact of different frequency sampling distributions. We conduct an experiment on uniform distribution as well as “naive FFT" which selects the first a few low-order Fourier basis instead of random set of basis.

Figure 14 illustrates the comparison results. As discussed in the effect of frequency sampling range, object recognition is more dependent on lower frequencies and may be disturbed by higher ones. The “naive FFT" method with fixed sampling strategy retains more high frequency information, and thus, exhibits limited improvement with an increase of channel number, and performs less effectively with the large channel number. Uniform distribution cannot ensure the inclusion of enough low-frequency components and thus presents poor classification performance. Instead, Gaussian distribution is more flexible and effective. We find that Gaussian distribution performs favourably with a small sampling variance throughout different channel numbers and get a stable and high accuracy under the large channel numbers. Therefore, we adopt Gaussian distribution throughout the paper.

Overall, determining the optimal sampling distribution is challenging and there is no theoretical justification or guarantees. The optimal frequency sampling may be influenced by different applications as well as the complexity and intrinsic characteristics of event stream.

Fig. 14
figure 14

Effect of Frequency Sampling Distribution on object recognition on N-Caltech101 (Orchard et al., 2015). Uniform distribution, “Naive FFT" and Gaussian distribution are highlighted as green, black and red, specifically. The 3D surface plots illustrate the prediction accuracy with different channel numbers and variances of representations. Gaussian distribution performs favourably with a small sampling variance throughout all channel numbers and gets a stable and large improvement with an increase of channel numbers

5.4 Running Time and Latency

One of the key advantages of event cameras is their low latency and high update rate, which makes them suitable for high-speed predictions. To meet the computational demands of event-driven applications, the efficiency of event representations is critical.

We evaluate the computational time of different event representations on the N-Cars testing dataset (Amos et al., 2018) and report the number of processed events per second and the total time used to process a sample of 100ms in Table 7. All representations run on an RTX A5000 GPU. Our proposed CES shares the same computational complexity with other representations. However, due to the efficient implementation using CUDA as stated in Eq. (7), it exhibits more efficiency and is sufficient for most high-speed applications.

Table 7 Running time for 100ms of event data and number of events processed per second on N-Cars testing dataset (Amos et al., 2018)

5.5 Limitations

“Sim-to-real gap" refers to a degradation in performance when a neural network is trained on simulated data and tested on real data. Currently, there are significant differences between simulated and real events because existing simulators adopt approximate and simple models which cannot comprehensively involves the noise and dynamic effects of event cameras. The frequency properties and distributions of simulated and real events are different. However, our CES operates on frequency domain and encodes more temporal information of the training data. Therefore, in this situation, our advantages may turn into disadvantages.

We conduct an experiment to investigate the influence of “sim-to-real gap" in our proposed CES volumes. Specifically, we focus on the intensity-image reconstruction task. The training dataset is obtained by a simulator ESIM (Rebecq et al., 2018) similar to Rebecq et al. (2019), and the testing dataset is the real Event Camera Dataset (Mueggler et al., 2017).

The results are shown in Table 8. The existing methods, which merge and stack the events into grid-like representations, obscure much temporal information but instead show more robustness to the “sim-to-real gap". Unfortunately, the proposed CES method hinges on frequency domain and encodes more temporal information based on compressed sensing theory. Therefore, the neural networks integrating our CES and trained on simulated data learns the bias towards simulated distribution, resulting in less effectiveness on real data.

It should be emphasized that when the neural networks are trained and tested both on real datasets, our CES takes advantage of its powerful representational capability, resulting in favorable performance against state-of-the-art event representations, as well as high generalization ability across different real datasets. (see Table 5)

Table 8 Limitation of CES volumes to the sim-to-real gap. The intensity-image reconstruction algorithms are trained on the simulated event data generated by ESIM (Rebecq et al., 2018) and tested on the real Event Camera Dataset (Mueggler et al., 2017). The existing representations obscure much temporal information and thus are robust to the sim-to-real gap. Unfortunately, the proposed CES volumes, which encode more temporal information, learn the bias towards simulated training distribution, resulting in less effectiveness on real testing dataset

6 Conclusion and Future Work

We leverage the sparsity property of events and compressed sensing scheme to show that compressed event sensing (CES) volumes can encode more temporal information of event data, thereby improving event-driven applications. We experimentally show that CES preserves high fidelity and reversibly achieves accurate event reconstruction. We provide theoretical analysis on the expressive power of CES in the deep learning framework. We validate the advantage of CES on a challenging case: dense frame regression test on a synthetic phantom data. The intensity-image reconstruction and object recognition applications demonstrate that the proposed representation achieves superior performance against the existing representations. Moreover, we thoroughly analyze the effects of Fourier mapping in terms of frequency numbers and frequency selection.