Keywords

1 Introduction

Event-based cameras are driven by the events which occur in a scene like their biological counterparts. They are different from conventional vision sensors, on the contrary, they are driven by the timing and control signals by man-made, and that have nothing to do with the source of visual information [18]. Dynamic Vision Sensor (DVS) [27] (Fig. 1a) is a kind of the event based, it provide a series of asynchronous events [9] (Fig. 1b). And these bio-inspired sensors overcome some of the limitations of traditional cameras: high temporal resolution, high dynamic range, low power consumption and so on. Hence, event cameras can take an advantage for high-speed and high dynamic range visual applications in challenging scenarios with large brightness contrast. Since these advantages, vision algorithms based on event cameras have been applied in the areas of like event-based tracking, Simultaneous Localization and Mapping (SLAM), and object recognition [26, 31, 32, 35], etc.

Fig. 1.
figure 1

(a) The Dynamic Vision Sensor (DVS). (b) The difference between event camera with traditional camera when a blackspot move on the platform.

Visual tracking is a hot topics and it is widely used in video surveillance, unmanned driving, human-computer interaction and so on. While tracking algorithms are well established and have achieved successful applications in many aspects. Their visual image acquisition methods are based on traditional fixed-frequency acquisition frames which suffer from high redundancy, high latency and high data volume. Event-based sensors provide a continuous steam of asynchronous events. The position, time and polarity of the each event is encoded by address event representation (AER) [19]. The AER is triggered by the event. The pixels work asynchronously and only outputs the address and information of the pixel whose light intensity changes. Instead of passively reading out each pixel information in the frame, they can eliminate redundant information from the source. Real-time dynamic response with scene change, image super-sparse representation, and the events of output asynchronously can be widely used in high-speed object tracking and robot vision.

The current visual tracking algorithm works on nature images. Since each pixel of the frame needs a uniform time for exposure, it will cause image blur and information loss when the object moves quickly. And the tracking algorithms are susceptible to lighting, fast movement of targets, etc. Event-based tracking maybe solve this problem. Event-based cameras cannot directly output the frames. Therefore, it cannot be directly applied to the computer vision algorithm of ordinary cameras.

In our work, we convert the event stream generated by the event camera into image representations. The converted image is formed by integrating a certain a mount of events with a sliding event window. We select seven event data records from the DVS benchmark data sets [14]. By accumulating the DVS data into the frames, we have remarked the ground-truth to further accurately determine the locations of tracking object for the current tracking algorithms evaluation. All of the frames are annotated with axis-aligned bounding boxes, the sequences are relabeled with the visual attributes such as noise events, occlusion, deformation and so on. The experiments test and verify the reasonable validity of the labeled data and show that the tracking algorithms can track the specific targets with high accuracy and robustness in complex scenes based on the output frames of the event camera.

2 Related Work

Many methods for the event-based tracking have been presented up to now. Because of the low data processing and latency of the event based cameras, early researchers track targets which moving in a static scene as the clusters of events, and they achieve good performances in applications such as traffic monitoring, high-speed robot tracking and so on [7]. At the same time, the event-by-event adaptive tracking algorithm has been proved in some high contrast user-defined shapes. Ni et al. [25] proposed the nearest neighbor strategy, which linked the incoming event with the target form and updated its conversion parameters. Glover et al. [10] proposed an improved particle filter which can automatically adjust the time window of the target observation for tracking a single target in event-space. All of the above methods need a experience premise or user-defined to define the target to be tracked. When the motion range of the object is gradually enlarged, other methods determine to distinct the natural features to track by analyzing the event [8]. Zhu et al. [36] proposed a soft data association modeled with probabilities, which relying on grouping events into a model. Features were generated by the motion compensated events, which generated to pointsets based on registered templates of new events. Lagorce et al. [18] proposed an event-based multi kernel algorithm, which tracked the characteristics of incoming events by integrating various kernels like Gaussian and user-defined kernels and so on. The appearance features of the event stream objects were obtained from a multi-scale space independent of the data foundation, and the original features could not be retained. Kogler et al. [17] presented an event-to frame converter and tested on two conventional stereo vision algorithms. Schraml et al. [30] proposed to integrate DVS events in a period of 5–50 ms and used them to track moving objects in stereo vision. However, one difference between an event camera and a normal camera is that the stationary object is not imaged, which result in the sparse data on the space and time. When integrating events at a fixed time, the time information is destroyed and the spatial data is sparse in different frames. Li et al. [20] proposed a tracking algorithm based on the CF mechanism by encoding the event-stream object by the rate coding. But it produces a lot of noise events.

3 Event-Image Representation Based on Event Time-Stamp

Event based cameras have independent pixels and response to changes in logarithm of light intensity \(e_m=\left( X_m,t_m,p_m\right) \). In the ideal case of no noise, the event is triggered by the address-event representation combines the position, time, and polarity of the event (a signal with a change in brightness, an ON event is a positive event indicating an increase in brightness, and an OFF event is a negative event indicating a decrease in brightness). The event as:

$$\begin{aligned} e_m=\left( X_m,t_m,p_m\right) \end{aligned}$$
(1)

Where: \(X_m=\left( x_m,y_m\right) ^T\) indicates the pixel address; \(t_m\) indicates the time when the event occurs; \(p_m\in \left\{ +1,-1\right\} \) indicates the polarity of event, \(p_m=+1\) indicates the brightening event, otherwise it becomes dark. The triggered event means that the brightness increase from the last event reaches a preset threshold \(\pm C\), namely:

$$\begin{aligned} \varDelta L\left( X_m,t_m\right) =p_mC \end{aligned}$$
(2)

where:

$$\begin{aligned} \varDelta L\left( X_m,t_m\right) =L\left( X_m,t_m\right) -L\left( X_m,t_m-\varDelta t_m\right) \end{aligned}$$
(3)

\(\varDelta t_m\) indicates the time elapsed since the last trigger event of pixel’s \(X_m\).

As mentioned above, the event camera outputs the captured image as an asynchronous event stream. Since a single data stream contains a lot of data, it should be processed in batches firstly. In this paper, a sliding event window [28, 29] with a fixed number of N events is used to slide on the data stream, thereby completing the transition from an event stream which containing huge amounts of data to a small batch of data with a fixed number of events. The sliding window divides the event stream into multiple small windows, and controls the flow of one small window every time. This method can overcome the large of traffic data. The sliding window works as shown in Fig. 2.

Fig. 2.
figure 2

The sliding event window diagram. The incoming event stream is depicted as red (positive events)/green (negative events) in the timeline). Events are divided into N events windows (blue box) by sliding windows. Each window is formed into each frame. In this example, N = 8. (Color figure online)

The valid information of the event stream rely on the number of events are processed at the same time. There are generally two methods for event processing: One is based on event-driven that work on event by event, and the other method that work on a set of events. The former process the every of incoming single event. However, an independent event does not provide enough information frequently and may generate a large number of noise events. The later method integrate all of the information contained in the events. we choose the fixed events information to integrate the event stream, which can achieve better results. We define E(t) as the sum of events during a little time interval of the sliding window:

$$\begin{aligned} E(t)=\sum _{t\in e v e n t w i n d o w}\left( event_x\pm 1,event_y\pm 1,t\right) \end{aligned}$$
(4)

We accumulate 7500 events into the frames in this paper. By the process of accumulating framing, the position where the event occurs is converted to the pixel position and formed the image frame. The position of the event in the world coordinates is mapped to the image coordinates by the mapping relationship by the Eq. 5:

$$\begin{aligned} \left[ \begin{array}{c}u\\ v\\ 1\\ \end{array}\right] =1/z_c\left[ \begin{array}{cccc}f_x&{}0&{}c_x&{}0\\ 0&{}f_y&{}c_y&{}0\\ 0&{}0&{}1&{}0\\ \end{array}\right] \left[ \begin{array}{cc}R&{}t\\ 0&{}1\\ \end{array}\right] \left[ \begin{array}{c}X_w\\ Y_w\\ Z_w\\ 1\\ \end{array}\right] \end{aligned}$$
(5)

Fixed events can make the deviation range between two adjacent frames not too big.

To reduce the effect of the image surface’s noise, we add an event counter synchronously during accumulating the events. At the same time, we record the every pixel’s coordinate position (x, y) and times in the event. The more events accumulate, the higher weight of events, and the higher of the activity in the process of data accumulating and framing. It is easier to get strongly vitality in the event selection and it is not easy to eliminate. On the contrary, the fewer the number of event accumulations, the easier to obtain a higher elimination rate in the event selection, and it is easy to die. Compared to the time surface [23, 34], the ability to handle the local patch is flat (weaker), but the speed of processing is faster and the resource of consumption is less.

Fig. 3.
figure 3

An example of a girl scene that accumulates a certain number of events. The (a) shows a girl captured by a normal camera. The (b) shows the visualization of the space-time information within the stream of events of the girl. The (c) shows the girl image of corresponding Integral reconstruction.

The accumulated events are represented by image binarization, and the pixel’s gray value is set to 0 or 255. Hence, the entire image exhibit a distinct between black and white effect. Since the size of the data sets provided by DVS benchmark data sets is different, we divide them into 200*200 pixels,and adjust the object to the middle of the lens to obtain effective information. The way of processing can improve the robustness of the input data. The Fig. 3 shows a scene by accumulating a fixed number of events.

Table 1. The distribution of the video sequence attributes and the length of the video recording table, 1 means the attribute, 0 means no.

4 Experiment and Evaluation

4.1 The DVS Data Sets

We choose seven event data records from the DVS benchmark which recorded by the DVS output of a DAVIS camera [14]. The tracking targets including the person, head, doll. These seven sequences are “figure_skating” , “singer”, “girl”, “sylvester”, “Vid_D_person”, “Vid_E_person_part_occluded”, “Vid_J_person_floor”, which are accumulated into 696, 263, 731, 437, 443, 160 and 422 frames, respectively.

The ground-truth provided by the tracking data sets is slightly offset from the actual position and size. Therefore, we relabel the locations of tracking object in the seven sequences. Each frame is annotated with an axis aligned bounding box. The description of boxes is (x, y, w, h), where (x, y) is the left-top corner vertex position, and w, h represent the width and height. Moreover, all of the sequences are labeled with five visual attributes, including the noise events, occlusion, complicated background, scale variation and deformation. The distribution of these attributes and the long of recordings are presented in Table 1.

4.2 Evaluation

The evaluation methodology: The success rate and accuracy are used to the tracking algorithms for evaluating. The former is measured by average overlap rate (AOR), that is, the percentage of sequence frames whose overlap exceeds the set threshold. And the latter is measured by the center location error (CLE), that is, the distance between the target and actual location is less than the percentage of the sequence frame with the set threshold.

The evaluation tracking algorithms: According to the learning mechanism of algorithms, we divide them into the tracking algorithms based on deep learning (SiamBAN [5], MDNet [24], DIMP [3], SiamMask [33]), the Correlation filtering tracking algorithms with hand-crafted features (CSK [12], KCF [13], FDSST [6], LCT [22], Staple [2], CSRT [21]), and other tracking algorithms like (CT [15], BOOSTING [11], MIL [1], TLD [16], MOSSE [4]).

Fig. 4.
figure 4

The performance comparison of the tracking algorithms. The success and accuracy rate are shown in the left and right figures respectively.

Quantitative Analysis. It is found through the experiments that the visual tracking algorithms can effectively track our event stream sequences and achieve good performance. Among them, the KCF tracking algorithm performs best. The comparsion results of accuracy and success rate plot are showed in Fig. 4. From the Fig. 4 we can see that the SOTA deep learning algorithms perform not well. The improved feature extraction algorithm (such as KCF with hog feature) can track accurately and effectively. The pre-20 and suc-50 are 0.856 and 0.628, respectively. It performs better than the CSK algorithm that use a cyclic matrix based on gray features, which is beyond doubt. But its accuracy is better than that the staple tracking algorithm that combine the Hog and color histogram Fortunately, this may be related to our video sequence with the binary grayscale images. In addition, the effect of tracking drift is easily caused by scale change, the FDSST algorithm introduce to the scale estimation, and the LCT algorithm add a confidence filter to the scale is slightly better than its performance, but they are worse than the KCF tracking algorithm. In summary, the data transformed by event camera is valid and reasonable.

Fig. 5.
figure 5

Examples of the results of some tracking algorithms in seven event stream sequences.

Quantitative Analysis. The tracking results of some trackers in the seven event stream sequences are shown in Fig. 5. The scale of the skater change repeatedly for the figure_skating sequence. At the beginning, all of the six tracking algorithms are able to track the object, but when it continue to change, the tracker will drift gradually. At the 485th frame, the other five tracking algorithms completely lose the target, and the KCF can still keep tracking. In the singer sequence, under the disturbance of the background, the singer is blurred. Due to the sudden change of the stage (the 111th frame and the 131th frame), the target is lost for the CT and CSK tracking algorithm. Similarly, in the Vid_D_person and Vid_E_person_part_occluded sequence, the CSK and CT also drift. However, for the Sylvester sequence, even if the target is deformed, all the trackers can still track robustly. In the girl sequence, The face target is constantly deformed and changed, and the tracker gradually drift. At the 619th frame, only Staple and FDSST can keep effective tracking. In the Vid_J_person_floor sequence, When the target is interfered by another person, the CT and CSK will track the wrong target, which makes the tracking fail. But the rest of the tracking algorithm can still achieve stable tracking when the two are separated (via. the 125th frame and 183th frame).

5 Conclusion

We accumulated the events stream into frames by integrating a certain number of asynchronous events. It applied to the SOTA tracking algorithm of the ordinary camera successfully, and achieved good performance in complex visual scenes. The experiments showed the rationality of converting the event stream into a frame image. At the same time, the processing of the method not only avoided the effects of lighting, but also reduced the effects of the background, which could protect the privacy outside the target. In the future, we will put forward a novel target tracking algorithm with high robustness for the data sets made in this paper.